I am using GCP GKE and have a gpu node but could not make the Docker work in there:
None
Traceback (most recent call last):
File "/api/server.py", line 13, in <module>
user_src.init()
File "/api/app.py", line 61, in init
"device": torch.cuda.get_device_name(),
File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/cuda/__init__.py", line 329, in get_device_name
return get_device_properties(device).name
File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/cuda/__init__.py", line 359, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/cuda/__init__.py", line 217, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
ERROR conda.cli.main_run:execute(47): `conda run /bin/bash -c python3 -u server.py` failed. (See above for error)
i am running other models on the same node without any problem so the official gke drivers are working (instaled with Daemonset: Run GPUs in GKE Standard node pools | Google Kubernetes Engine (GKE) | Google Cloud)
Thinking about it this may be a problem running pytorch i will try to isolate it