GPU environment debugging is important for managing and troubleshooting machine learning environments. This runbook provides a series of systematic checks to ensure that common tools such as PyTorch and TensorFlow are setup and functioning properly in a GPU environment and with CUDA, NVIDIA's parallel computing platform and application programming interface.
Parameters
Debug
Check for available GPU hardware
Get details of GPU cards installed on machine
Check if CUDA is installed and check its installation version
Check PyTorch version
Check if PyTorch can access the GPU
Check CUDA version being used by PyTorch, ensuring it is supported by the driver version
Check TensorFlow version
Check if TensorFlow can access the GPU
Check CUDA version being used by TensorFlow, ensuring it is supported by the driver version
Update version of Nvidia driver if necessary
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.