Runbook

Docker Swarm Node Failure

Back to Runbooks

Overview

A Docker Swarm Node Failure incident occurs when one or more nodes in a Docker Swarm cluster become unresponsive or go down, resulting in the failure of the applications or services running on those nodes. This can cause a disruption in the availability and performance of the application or service, and may require immediate action from the DevOps team to diagnose and fix the issue.

Parameters

Debug

Check the status of the Docker service on the affected node

List all the running containers on the affected node

Inspect the logs of a specific container to look for any error messages

Check the health status of a specific service running on the Docker Swarm cluster

Check the status of the Docker Swarm cluster and its nodes

Check the status of the Docker Swarm manager nodes

Check the Docker Swarm events for any recent changes or failures

Repair

Identify the failed node(s) using Docker Swarm commands or monitoring tools, and verify if the node is still reachable or if it has completely gone down.

If the node is still reachable, try restarting the Docker service or the node itself to see if it resolves the issue.

If the node is completely down, remove it from the Swarm cluster and replace it with a new node, or scale up the existing nodes to ensure sufficient capacity to handle the workload.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.