Runbook

Bottleneck in Airflow DAG Scheduler Causing Task Execution Delays

Back to Runbooks

Overview

This incident type refers to a bottleneck within the Apache Airflow DAG (Directed Acyclic Graph) Scheduler that causes delays in task execution. Apache Airflow is an open-source platform used to programmatically author, schedule, and monitor workflows. The DAG Scheduler is responsible for scheduling tasks based on their dependencies and availability of resources. When a bottleneck occurs within this component, it leads to delays in task executions, which can impact the overall workflow and potentially cause failures. This type of incident requires investigation and resolution to ensure optimal performance and reliability of the Apache Airflow platform.

Parameters

Debug

List all Apache Airflow Pods running in the cluster

Check the logs of the DAG Scheduler Pod

Check the resource usage of the DAG Scheduler Pod

Check the status of the Kubernetes Nodes

Check the resource usage of the Kubernetes Nodes

Check the CPU and memory limits set for the DAG Scheduler Pod

Check the CPU and memory usage of the DAG Scheduler Pod over time

Check the network connectivity between the DAG Scheduler Pod and other Pods

Repair

Increase the number of DAG Scheduler workers to distribute the workload and reduce the bottleneck.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.