At some point we all need some rest and a kick in the ass for working again in production mode. The same occurs to my Docker containers sometimes when (re)starting them in a ‘warm’ system. Starting a container running Jupyter, assigned to work with CUDA on one (or more) GPUs is failing with something like:
Cannot start service training: OCI runtime create failed.
nnvidia-container-cli: initialization error: cuda error: unknown error
Not so cool. A restart of the whole system fixes the issue. But hey, there are some reasons we are not using Windows! Restarting the Docker Engine does also not solving the issue. BUT unloading and reloading the nvidia kernel module solved the problem for me:
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
This hotfix is tested on different ubuntu machines (16+, with and without X.org GUI). The crash can be reconstructed by running a Jupyter Notebook with a kernel using GPU and killing the Docker container (Yes, not so nice but sometimes it might happen).