Runbook

Elasticsearch Instability and Cluster Failures

Back to Runbooks

Overview

This incident type refers to frequent or unexpected instability and cluster failures in Elasticsearch, which is a distributed search and analytics engine. These issues can impact the performance of the system, leading to downtime and potential data loss. The cause of these incidents can vary, including hardware failure, software bugs, network issues, or configuration errors. It is crucial to address these incidents quickly and efficiently to minimize the impact on the system and ensure its stability and reliability.

Parameters

Debug

Check Elasticsearch cluster health

Check Elasticsearch cluster state

Check Elasticsearch node stats

Check Elasticsearch node info

Check Elasticsearch index health

Check Elasticsearch index stats

Check Elasticsearch shard allocation

Check Elasticsearch logs for errors

Restart Elasticsearch service

Repair

Consider adjusting the cluster configuration to improve performance and stability, such as changing the number of nodes, shards, or replicas, or adjusting the allocation of resources.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.