---
id: 5fd226f3-2c23-4ad9-8f63-0335d65831de
---

# GPU Environment Debugging
---

GPU environment debugging is important for managing and troubleshooting machine learning environments. This runbook provides a series of systematic checks to ensure that common tools such as PyTorch and TensorFlow are setup and functioning properly in a GPU environment and with CUDA, NVIDIA's parallel computing platform and application programming interface.

### Parameters
```shell
# Environment Variables

export NVIDIA_DRIVER_VERSION="PLACEHOLDER"
```

## Debug

### Check for available GPU hardware
```shell
lspci | grep -i VGA
```

### Get details of GPU cards installed on machine
```shell
nvidia-smi -q
```

### Check if CUDA is installed and check its installation version
```shell
nvcc --version
```

### Check PyTorch version
```shell
python3 -c "import torch; print(torch.__version__)"
```

### Check if PyTorch can access the GPU
```shell
python3 -c "import torch; print(torch.cuda.is_available())"
```

### Check CUDA version being used by PyTorch, ensuring it is supported by the driver version
```shell
python3 -c "import torch; print(torch.version.cuda)"
```

### Check TensorFlow version
```shell
python3 -c "import tensorflow as tf; print(tf.__version__)"
```

### Check if TensorFlow can access the GPU
```shell
python3 -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
```

### Check CUDA version being used by TensorFlow, ensuring it is supported by the driver version
```shell
python3 -c "import tensorflow as tf; print(tf.sysconfig.get_build_info()['cuda_version'])"
```

### Update version of Nvidia driver if necessary
```shell
sudo apt-get install nvidia-driver-${NVIDIA_DRIVER_VERSION}
```


GPU environment debugging is important for managing and troubleshooting machine learning environments. This runbook provides a series of systematic checks to ensure that common tools such as PyTorch and TensorFlow are setup and functioning properly in a GPU environment and with CUDA, NVIDIA's parallel computing platform and application programming interface.


This incident type refers to a situation where Spark tasks are failing due to out of memory errors. Spark is a distributed computing system used for big data processing. When the data volume exceeds the allocated memory, the Spark tasks fail, and the system generates an out of memory error. This type of incident can cause data processing delays or even system downtime, which can impact the overall performance of the application.


Spark tasks failing due to out of memory errors.

This incident type refers to an unexpected termination of the Spark driver program during the runtime of a job. The driver program is responsible for coordinating the execution of a Spark job and if it crashes, the entire job is affected. This can result in data loss and downtime, and requires investigation and troubleshooting to identify the root cause of the issue.


Spark driver program crash during job runtime.

Apache Spark driver failure refers to an incident where the driver program in an Apache Spark cluster fails to execute or crashes during runtime. This can happen due to a variety of reasons such as hardware failure, software bugs, resource constraints, or programming errors. As the driver program is responsible for coordinating the execution of tasks across the cluster, any failure in the driver can result in the entire Spark job failing. This can lead to data loss, processing delays, and impact the overall performance of the Spark cluster.


Spark driver failure incident.

This incident type refers to a situation where a Spark cluster experiences performance bottlenecks when it is subjected to peak loads. In other words, the Spark cluster struggles to handle the high volume of requests it receives during times of heavy traffic or increased demand. This can lead to slower processing times, delays, or even system crashes. Identifying and resolving the root cause of the bottlenecks is crucial to ensure the smooth functioning of the Spark cluster during peak loads.


Spark cluster bottlenecks during peak loads.

This incident type indicates that there is a high latency issue in the execution of a Spark job. Spark is a distributed computing framework that is used for processing large datasets. High latency in this context means that the time taken to execute the Spark job is significantly longer than expected or normal. This can result in delays in processing data and can impact the performance of the application or system that is utilizing Spark.


High Latency Incident for Spark Job Execution.

```shell
# Environment Variables

export NVIDIA_DRIVER_VERSION="PLACEHOLDER"
```


### Check for available GPU hardware

```shell
lspci | grep -i VGA
```

### Get details of GPU cards installed on machine

```shell
nvidia-smi -q
```

### Check if CUDA is installed and check its installation version

```shell
nvcc --version
```

### Check PyTorch version

```shell
python3 -c "import torch; print(torch.__version__)"
```

### Check if PyTorch can access the GPU

```shell
python3 -c "import torch; print(torch.cuda.is_available())"
```

### Check CUDA version being used by PyTorch, ensuring it is supported by the driver version

```shell
python3 -c "import torch; print(torch.version.cuda)"
```

### Check TensorFlow version

```shell
python3 -c "import tensorflow as tf; print(tf.__version__)"
```

### Check if TensorFlow can access the GPU

```shell
python3 -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
```

### Check CUDA version being used by TensorFlow, ensuring it is supported by the driver version

```shell
python3 -c "import tensorflow as tf; print(tf.sysconfig.get_build_info()['cuda_version'])"
```

### Update version of Nvidia driver if necessary

```shell
sudo apt-get install nvidia-driver-${NVIDIA_DRIVER_VERSION}
```


GPU Environment Debugging

Overview

Parameters

Debug

Check for available GPU hardware

Get details of GPU cards installed on machine

Check if CUDA is installed and check its installation version

Check PyTorch version

Check if PyTorch can access the GPU

Check CUDA version being used by PyTorch, ensuring it is supported by the driver version

Check TensorFlow version

Check if TensorFlow can access the GPU

Check CUDA version being used by TensorFlow, ensuring it is supported by the driver version

Update version of Nvidia driver if necessary

Learn more

Related Runbooks

Spark tasks failing due to out of memory errors.

Spark driver program crash during job runtime.

Spark driver failure incident.

Spark cluster bottlenecks during peak loads.

Support