Runbook

Kubernetes Cronjob Failure

Overview

A Kubernetes Cronjob Failure incident occurs when a scheduled task, or cronjob, in a Kubernetes cluster fails to execute as expected. This may be due to a variety of reasons, such as misconfiguration, resource constraints, or software bugs. The incident requires investigation and debugging to identify the root cause of the failure and resolve the issue to restore normal operation.There is also a kubernetes limitation that permanently stops a cronjob after too many (e.g. 100) execution errors or failures to schedule.

Parameters

Debug

Check if the cronjob is still active

Check if the pods created by the cronjob are still running

Check the logs of the pods created by the cronjob

Check if the cronjob schedule is correct

Check if there are any errors in the cronjob events

Check if the cronjob image exists in the container registry

Check the status of the last cronjob run

Check if the cronjob is running on the expected node

Check if the pod has sufficient resources

Check if there are any errors in the pod events

Repair

Check the cronjob configuration to ensure that it is correctly defined and scheduled to run at the intended time.

Check for "100 missed start times" error. Recreate the cronjob if found.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.