This incident type involves an alert triggered by the Datadog monitoring agent indicating that it has stopped running in a particular Kubernetes cluster. This can lead to a variety of issues such as loss of visibility into cluster performance, potential security risks, and other problems that can impact operations. The incident requires immediate attention and resolution to ensure the Datadog agent is running properly and that the cluster is operating as expected.
Parameters
Debug
Check if the Datadog agent is deployed in the Kubernetes cluster
Check the logs of the Datadog agent pod
Check if the Datadog agent is running on all nodes in the Kubernetes cluster
Check if the Datadog agent is able to connect to the Datadog backend
Check the status of the Kubernetes nodes in the cluster
Check the status of the Kubernetes pods in the cluster
Check the status of the Kubernetes services in the cluster
There may have been an issue with the network or communication between the Datadog agent and the Kubernetes cluster.
Repair
Verify the connection between the Kubernetes cluster and the Datadog agent.
Perform a rolling restart of the Datadog agent DaemonSet.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.