This incident type refers to an issue where communication between Etcd members is slowing down, resulting in a decrease in the performance of the Etcd system. The incident is triggered when the 99th percentile of communication time exceeds 0.15 seconds. This type of incident can impact the functionality and stability of the Etcd system, and requires immediate attention to restore normal operation.
Parameters
Debug
Check if the Etcd service is running
Check the logs for the Etcd service
Check Etcd cluster health
Check the health of each Etcd member
Check the network latency between Etcd members
Check the CPU and memory usage of Etcd processes
Check the network traffic between Etcd members
Check the network bandwidth between the Etcd members
Check the firewall rules for Etcd ports
Check the configuration file for Etcd
High network traffic between etcd cluster members.
Repair
Increase the resources allocated to the Etcd cluster by adding more nodes or increasing the CPU and memory on the existing nodes.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.