Runbook

Etcd member communication slow incident.

Back to Runbooks

Overview

This incident type refers to an issue where communication between Etcd members is slowing down, resulting in a decrease in the performance of the Etcd system. The incident is triggered when the 99th percentile of communication time exceeds 0.15 seconds. This type of incident can impact the functionality and stability of the Etcd system, and requires immediate attention to restore normal operation.

Parameters

Debug

Check if the Etcd service is running

Check the logs for the Etcd service

Check Etcd cluster health

Check the health of each Etcd member

Check the network latency between Etcd members

Check the CPU and memory usage of Etcd processes

Check the network traffic between Etcd members

Check the network bandwidth between the Etcd members

Check the firewall rules for Etcd ports

Check the configuration file for Etcd

High network traffic between etcd cluster members.

Repair

Increase the resources allocated to the Etcd cluster by adding more nodes or increasing the CPU and memory on the existing nodes.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.