This incident type describes an issue where the Airflow server, which is used to manage and schedule workflows, is running out of resources, such as CPU, memory, or disk space, during periods of high workflow executions. This can cause delays in workflow execution or even complete failures. It is important to monitor the server's resource usage and allocate sufficient resources to ensure smooth and uninterrupted workflow execution.
Parameters
Debug
Check if there are any zombie processes
Check CPU usage
Check memory usage
Check disk space usage
Check Airflow logs for errors or warnings
Check Airflow configuration for resource limits
Check if Airflow workers are consuming too many resources
Check if there are any blocked I/O operations
The Airflow server is running on a machine with insufficient resources, such as low memory or CPU capacity, to handle peak workflow loads.
Repair
Scaling up server resources: One possible remediation strategy is to increase the resources available to the Airflow server during peak execution periods. This can be done by adding more CPU, memory, or disk space to the server or by moving to a more powerful server altogether.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.