Runbook

Airflow worker resource exhaustion incident.

Back to Runbooks

Overview

This incident type refers to situations where Apache Airflow workers have exhausted their resources. Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Workers in Apache Airflow are responsible for executing tasks and are essential to the platform's functionality. Resource exhaustion in workers can cause a significant impact on the performance of the platform and can result in workflow failures. This incident type requires immediate attention to ensure the workers have sufficient resources to execute tasks, and the platform is functioning correctly.

Parameters

Debug

Check the CPU usage of the affected worker node

Check the disk usage of the affected worker node

Check the memory usage of the affected worker node

Check the currently running processes on the affected worker node

Identify any specific Airflow tasks that may be causing the resource exhaustion

Restart the Airflow worker process on the affected node

The worker may not have been configured with enough resources to handle the workload it was given.

Repair

Scale up the number of airflow workers to ensure sufficient resources are available to handle the workload.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.