Runbook
Docker Swarm Node Failure
Back to Runbooks
Overview
A Docker Swarm Node Failure incident occurs when one or more nodes in a Docker Swarm cluster become unresponsive or go down, resulting in the failure of the applications or services running on those nodes. This can cause a disruption in the availability and performance of the application or service, and may require immediate action from the DevOps team to diagnose and fix the issue.
Parameters
Debug
Check the status of the Docker service on the affected node
Check the logs for any errors or warnings related to the Docker service
List all the running containers on the affected node
Inspect the logs of a specific container to look for any error messages
Check the health status of a specific service running on the Docker Swarm cluster
Check the status of the Docker Swarm cluster and its nodes
Check the status of the Docker Swarm manager nodes
Check the Docker Swarm events for any recent changes or failures
Repair
Identify the failed node(s) using Docker Swarm commands or monitoring tools, and verify if the node is still reachable or if it has completely gone down.
If the node is still reachable, try restarting the Docker service or the node itself to see if it resolves the issue.
If the node is completely down, remove it from the Swarm cluster and replace it with a new node, or scale up the existing nodes to ensure sufficient capacity to handle the workload.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.