---
id: 177d23ab-a8ac-4f23-b251-ed4dd05748ce
---

# Spark YARN Container Failure
---

Spark YARN container failure is an incident type that occurs when a container running on the YARN (Yet Another Resource Negotiator) resource manager in a Spark cluster fails to complete its assigned task. This can be caused by a variety of factors such as resource constraints, hardware failures, or software bugs. When a container fails, it can cause delays or failures in the overall Spark job, leading to reduced performance or data loss.

### Parameters
```shell
export YARN_RESOURCEMANAGER_SERVICE="PLACEHOLDER"

export YARN_NODEMANAGER_SERVICE="PLACEHOLDER"

export APPLICATION_ID="PLACEHOLDER"

export MEMORY_ALLOCATED_TO_THE_CONTAINER_IN_GB="PLACEHOLDER"

export NUMBER_OF_CPUS_ALLOCATED_TO_THE_CONTAINER="PLACEHOLDER"

export CPU_THRESHOLD_VALUE="PLACEHOLDER"

export MEMORY_THRESHOLD_VALUE="PLACEHOLDER"
```

## Debug

### 1. Check if YARN ResourceManager is running
```shell
sudo systemctl status ${YARN_RESOURCEMANAGER_SERVICE}
```

### 2. Check if YARN NodeManager is running on the failed node
```shell
sudo systemctl status ${YARN_NODEMANAGER_SERVICE}
```

### 3. Check if the Spark application is running on the YARN cluster
```shell
yarn application -list
```

### 4. Check if the Spark application is using the correct YARN queue
```shell
yarn application -status ${APPLICATION_ID} | grep queue
```

### 5. Check if the Spark application logs show any errors
```shell
yarn logs -applicationId ${APPLICATION_ID} | grep ERROR
```

### 6. Check if there are any system-level errors on the failed node
```shell
sudo journalctl -u ${YARN_NODEMANAGER_SERVICE} -n 50 | grep ERROR
```

### Resource constraints: If the Spark YARN container is running on limited resources such as memory or CPU, it may fail due to insufficient resources.
```shell
bash

#!/bin/bash



# Get the current CPU and memory usage

cpu_usage=$(top -b -n 1 | grep "Cpu(s)" | awk '{print $2}')

memory_usage=$(free | awk '/Mem/{printf("%.2f"), $3/$2 * 100}')



# Set the threshold for CPU and memory usage

cpu_threshold=${CPU_THRESHOLD_VALUE} # e.g. 80

memory_threshold=${MEMORY_THRESHOLD_VALUE} # e.g. 80



# Check if CPU or memory usage exceeds the threshold

if (( $(echo "$cpu_usage > $cpu_threshold" | bc -l) )); then

  echo "CPU usage is too high (${cpu_usage}%)."

fi



if (( $(echo "$memory_usage > $memory_threshold" | bc -l) )); then

  echo "Memory usage is too high (${memory_usage}%)."

fi


```

## Repair

### Increase the resources allocated to the Spark YARN container, such as memory or CPU resources. This may prevent future failures caused by resource constraints.
```shell


#!/bin/bash



# Set the variables

MEMORY=${MEMORY_ALLOCATED_TO_THE_CONTAINER_IN_GB}

CPU=${NUMBER_OF_CPUS_ALLOCATED_TO_THE_CONTAINER}



# Update the YARN configuration

yarn config -set yarn.nodemanager.resource.memory-mb $((MEMORY*1024))

yarn config -set yarn.nodemanager.resource.cpu-vcores $CPU



# Restart the YARN NodeManager service

systemctl restart hadoop-yarn-nodemanager


```

Spark YARN container failure is an incident type that occurs when a container running on the YARN (Yet Another Resource Negotiator) resource manager in a Spark cluster fails to complete its assigned task. This can be caused by a variety of factors such as resource constraints, hardware failures, or software bugs. When a container fails, it can cause delays or failures in the overall Spark job, leading to reduced performance or data loss.


This incident type involves the failure of the YARN ResourceManager, which can impact the performance of Spark jobs. The ResourceManager is responsible for managing resources in a Hadoop cluster, and when it fails, it can prevent Spark jobs from running properly. This incident requires investigation to determine the root cause of the failure, and recovery steps to restore the ResourceManager and prevent similar failures from occurring in the future.


YARN ResourceManager Failure Impacting Spark Jobs.

This incident type refers to a situation where Spark tasks are failing due to out of memory errors. Spark is a distributed computing system used for big data processing. When the data volume exceeds the allocated memory, the Spark tasks fail, and the system generates an out of memory error. This type of incident can cause data processing delays or even system downtime, which can impact the overall performance of the application.


Spark tasks failing due to out of memory errors.

This incident type typically occurs in distributed computing systems, where Spark tasks are experiencing high disk I/O and shuffle spills. Spark is a popular distributed computing engine that uses shuffle operations to move data between nodes in a cluster, which can sometimes result in performance issues due to spills. The spills occur when the data being shuffled exceeds the memory capacity allocated for the shuffle operations. This incident requires optimization of the shuffle operations to reduce spills and improve overall performance.


Spark tasks experiencing shuffle spills and high disk I/O.

This incident type refers to a situation where Spark jobs are failing due to resource contentions in the cluster. When multiple Spark jobs are trying to access the same resources or data at the same time, it can cause a bottleneck that leads to job failures. This can happen when the resources in the cluster are not properly allocated or when the number of jobs running simultaneously exceeds the cluster's capacity to handle them. The result is that Spark jobs fail, leading to disruptions in data processing and analysis.


Spark job failures due to cluster resource contentions.

This incident type refers to a failure in one or more Spark executors during the execution of a job. Spark executors are worker processes that run computations and store data in memory or on disk. When an executor fails, it can cause the entire job to fail or result in degraded performance. This type of incident can occur for a variety of reasons, such as hardware or network issues, memory errors, or software bugs.


Spark executor failure during job execution.

```shell
export YARN_RESOURCEMANAGER_SERVICE="PLACEHOLDER"

export YARN_NODEMANAGER_SERVICE="PLACEHOLDER"

export APPLICATION_ID="PLACEHOLDER"

export MEMORY_ALLOCATED_TO_THE_CONTAINER_IN_GB="PLACEHOLDER"

export NUMBER_OF_CPUS_ALLOCATED_TO_THE_CONTAINER="PLACEHOLDER"

export CPU_THRESHOLD_VALUE="PLACEHOLDER"

export MEMORY_THRESHOLD_VALUE="PLACEHOLDER"
```


### 1. Check if YARN ResourceManager is running

```shell
sudo systemctl status ${YARN_RESOURCEMANAGER_SERVICE}
```

### 2. Check if YARN NodeManager is running on the failed node

```shell
sudo systemctl status ${YARN_NODEMANAGER_SERVICE}
```

### 3. Check if the Spark application is running on the YARN cluster

```shell
yarn application -list
```

### 4. Check if the Spark application is using the correct YARN queue

```shell
yarn application -status ${APPLICATION_ID} | grep queue
```

### 5. Check if the Spark application logs show any errors

```shell
yarn logs -applicationId ${APPLICATION_ID} | grep ERROR
```

### 6. Check if there are any system-level errors on the failed node

```shell
sudo journalctl -u ${YARN_NODEMANAGER_SERVICE} -n 50 | grep ERROR
```

### Resource constraints: If the Spark YARN container is running on limited resources such as memory or CPU, it may fail due to insufficient resources.

```shell
bash

#!/bin/bash



# Get the current CPU and memory usage

cpu_usage=$(top -b -n 1 | grep "Cpu(s)" | awk '{print $2}')

memory_usage=$(free | awk '/Mem/{printf("%.2f"), $3/$2 * 100}')



# Set the threshold for CPU and memory usage

cpu_threshold=${CPU_THRESHOLD_VALUE} # e.g. 80

memory_threshold=${MEMORY_THRESHOLD_VALUE} # e.g. 80



# Check if CPU or memory usage exceeds the threshold

if (( $(echo "$cpu_usage > $cpu_threshold" | bc -l) )); then

  echo "CPU usage is too high (${cpu_usage}%)."

fi



if (( $(echo "$memory_usage > $memory_threshold" | bc -l) )); then

  echo "Memory usage is too high (${memory_usage}%)."

fi


```


### Increase the resources allocated to the Spark YARN container, such as memory or CPU resources. This may prevent future failures caused by resource constraints.

```shell


#!/bin/bash



# Set the variables

MEMORY=${MEMORY_ALLOCATED_TO_THE_CONTAINER_IN_GB}

CPU=${NUMBER_OF_CPUS_ALLOCATED_TO_THE_CONTAINER}



# Update the YARN configuration

yarn config -set yarn.nodemanager.resource.memory-mb $((MEMORY*1024))

yarn config -set yarn.nodemanager.resource.cpu-vcores $CPU



# Restart the YARN NodeManager service

systemctl restart hadoop-yarn-nodemanager


```


Spark YARN Container Failure

Overview

Parameters

Debug

1. Check if YARN ResourceManager is running

2. Check if YARN NodeManager is running on the failed node

3. Check if the Spark application is running on the YARN cluster

4. Check if the Spark application is using the correct YARN queue

5. Check if the Spark application logs show any errors

6. Check if there are any system-level errors on the failed node

Resource constraints: If the Spark YARN container is running on limited resources such as memory or CPU, it may fail due to insufficient resources.

Repair

Increase the resources allocated to the Spark YARN container, such as memory or CPU resources. This may prevent future failures caused by resource constraints.

Learn more

Related Runbooks

YARN ResourceManager Failure Impacting Spark Jobs.

Spark tasks failing due to out of memory errors.

Spark tasks experiencing shuffle spills and high disk I/O.

Spark job failures due to cluster resource contentions.

Support