Runbook

Spark tasks experiencing shuffle spills and high disk I/O.

Back to Runbooks

Overview

This incident type typically occurs in distributed computing systems, where Spark tasks are experiencing high disk I/O and shuffle spills. Spark is a popular distributed computing engine that uses shuffle operations to move data between nodes in a cluster, which can sometimes result in performance issues due to spills. The spills occur when the data being shuffled exceeds the memory capacity allocated for the shuffle operations. This incident requires optimization of the shuffle operations to reduce spills and improve overall performance.

Parameters

Debug

Check the disk I/O usage

Check the network bandwidth usage

Check the Spark task metrics

Check the shuffle size and spill metrics

Check the system resource usage

Insufficient memory allocated for shuffle operations.

Repair

Increase the memory allocation for shuffle operations to avoid spills. This can be done by increasing the spark.shuffle.memoryFraction or spark.memory.fraction configuration parameters.