Runbook

GPU Environment Debugging

Back to Runbooks

Overview

GPU environment debugging is important for managing and troubleshooting machine learning environments. This runbook provides a series of systematic checks to ensure that common tools such as PyTorch and TensorFlow are setup and functioning properly in a GPU environment and with CUDA, NVIDIA's parallel computing platform and application programming interface.

Parameters

Debug

Check for available GPU hardware

Get details of GPU cards installed on machine

Check if CUDA is installed and check its installation version

Check PyTorch version

Check if PyTorch can access the GPU

Check CUDA version being used by PyTorch, ensuring it is supported by the driver version

Check TensorFlow version

Check if TensorFlow can access the GPU

Check CUDA version being used by TensorFlow, ensuring it is supported by the driver version

Update version of Nvidia driver if necessary

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.