Restarting CUDA-Devices

At some point we all need some rest and a kick in the ass for working again in production mode. The same occurs to my Docker containers sometimes when (re)starting them in a ‘warm’ system. Starting a container running Jupyter, assigned to work with CUDA on one (or more) GPUs is failing with something like:

Cannot start service training: OCI runtime create failed.
nnvidia-container-cli: initialization error: cuda error: unknown error

Not so cool. A restart of the whole system fixes the issue. But hey, there are some reasons we are not using Windows! Restarting the Docker Engine does also not solving the issue. BUT unloading and reloading the nvidia kernel module solved the problem for me:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

This hotfix is tested on different ubuntu machines (16+, with and without GUI). The crash can be reconstructed by running a Jupyter Notebook with a kernel using GPU and killing the Docker container (Yes, not so nice but sometimes it might happen).

Share on facebook
Share on google
Share on twitter
Share on linkedin
Share on pinterest

Leave a Reply

Your email address will not be published. Required fields are marked *