Support Apache Airflow Monitoring
Motivation
Apache Airflow is an open-source workflow management platform primarily used for scheduling and monitoring workflows. It can be used to handle complex data pipelines and has been widely applied in the fields of data engineering and data science. Airflow allows users to write workflows called DAGs (Directed Acyclic Graphs). Each DAG contains a series of tasks that can be executed in a specific sequence and dependency relationship. Due to its support for multitasking in complex scenarios, monitoring the health and operational status of Airflow is crucial. Through these metrics, it is possible to help analyze task health status, formulate optimization plans, and design risk prevention strategies.
Architecture Graph
graph LR;
AirflowOTEL("Airflow OTEL") --> OpenTelemetryCollector("OpenTelemetry Collector") --> SkyWalkingOTELReceiver("SkyWalking OTEL Receiver") --> SkyWalkingMALEngine("SkyWalking MAL Engine") --> HorizonUI("Horizon UI")
Proposed Changes
- Airflow exports metrics via native OpenTelemetry (
otel_on/OTEL_EXPORTER_OTLP_*). - OpenTelemetry Collector receives OTLP metrics from Airflow and forwards them to SkyWalking OTel Receiver via the OpenTelemetry exporter.
- The SkyWalking OAP Server parses expressions with MAL to filter, calculate, aggregate, and store the results.
- Metrics are displayed via Horizon UI under the Workflow Scheduler menu group and can be customized on dashboards.
SkyWalking models an Airflow deployment as Layer: AIRFLOW:
- Service — one logical cluster (
airflow::{cluster}), keyed by resource attributecluster. - Instance — scheduler or triggerer host (
host.nameresource attribute).
Horizon labels this entity Components rather than Instance so operators are not led to confuse it with Airflow Task Instance (a single task execution within one DAG run). See Airflow monitoring setup for the full naming rationale.
Airflow Service Supported Metrics
| Monitoring Panel | Unit | Metric Name | Description |
|---|---|---|---|
| Tasks Executable | count | meter_airflow_scheduler_tasks_executable | Tasks ready for execution |
| Running Tasks | count | meter_airflow_executor_running_tasks | Tasks currently running on executor |
| Queued Tasks | count | meter_airflow_executor_queued_tasks | Queued tasks on executor |
| Scheduled Slots | count | meter_airflow_pool_scheduled_slots | Scheduled but not yet running slots in pool (aggregated across pools via aggregate_labels in the UI KPI card) |
| Executor Open Slots | count | meter_airflow_executor_open_slots | Open executor slots |
| DAG File Queue Size | count | meter_airflow_dag_file_queue_size | DAG files pending scan |
| DAG Import Errors | count | meter_airflow_dag_import_errors | DAG files that failed to parse |
| DAG Bag Size | count | meter_airflow_dagbag_size | DAGs found in the last scheduler scan |
| DAG Total Parse Time | seconds | meter_airflow_dag_total_parse_time | Time to scan and import queued DAG files |
| DAG File Refresh Errors | count/min | meter_airflow_dag_file_refresh_error | DAG file load failures per minute |
| Asset Updates | count/min | meter_airflow_asset_updates | Updated assets per minute |
Airflow Instance Supported Metrics
| Monitoring Panel | Unit | Metric Name | Description |
|---|---|---|---|
| Pool Open / Deferred / Running Slots | count | meter_airflow_instance_pool_open_slots, meter_airflow_instance_pool_deferred_slots, meter_airflow_instance_pool_running_slots | Pool capacity on the scheduler |
| Running Tasks / Scheduled Slots | count | meter_airflow_instance_executor_running_tasks, meter_airflow_instance_pool_scheduled_slots | Executor queue depth and pool slots waiting to run |
| Scheduler Heartbeat | count/min | meter_airflow_instance_scheduler_heartbeat | Scheduler heartbeats per minute |
| Executor Open / Queued Slots | count | meter_airflow_instance_executor_open_slots, meter_airflow_instance_executor_queued_tasks | Executor capacity and queue depth on the scheduler |
| Asset Updates | count/min | meter_airflow_instance_asset_updates | Asset updates on this host |
| Asset Triggered DagRuns | count/min | meter_airflow_instance_asset_triggered_dagruns | DagRuns triggered by assets |
| Triggerer Heartbeat | count/min | meter_airflow_instance_triggerer_heartbeat | Triggerer heartbeats per minute |
| Triggers Running / Capacity Left | count | meter_airflow_instance_triggers_running, meter_airflow_instance_triggerer_capacity_left | Live deferrable trigger load on the triggerer |
| Triggers Blocked / Failed / Succeeded | count/min | meter_airflow_instance_triggers_blocked_main_thread, meter_airflow_instance_triggers_failed, meter_airflow_instance_triggers_succeeded | Deferred trigger outcomes on the triggerer |
Service-level panels aggregate cluster-wide samples. Instance-level panels are scoped per
host.name (shown as Components in the UI). Do not sum instance-scoped samples into service dashboards when each component
exports the same instrument independently.
Scheduler and triggerer components use Airflow core OpenTelemetry export only.
Airflow 3.x only. Asset metrics use OTel instruments airflow.asset.* (Airflow 2.x
airflow.dataset.* is not supported).
Bundled Horizon UI dashboards chart the metrics above one-to-one (27 MAL metrics: 11 Service +
16 Instance). Tasks Executable is Service-only (meter_airflow_scheduler_tasks_executable).
The Service dashboard has 11 panels. The Components template defines nine widgets;
scheduler hosts show six, triggerer hosts show three (see
setup doc).
Imported Dependencies libs and their licenses.
No new dependency.
Compatibility
No breaking changes.