Help running on Google Kubernetes Engine (GKE)

I am using GCP GKE and have a gpu node but could not make the Docker work in there:

None
Traceback (most recent call last):
  File "/api/server.py", line 13, in <module>
    user_src.init()
  File "/api/app.py", line 61, in init
    "device": torch.cuda.get_device_name(),
  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/cuda/__init__.py", line 329, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/cuda/__init__.py", line 359, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/cuda/__init__.py", line 217, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
ERROR conda.cli.main_run:execute(47): `conda run /bin/bash -c python3 -u server.py` failed. (See above for error)

i am running other models on the same node without any problem so the official gke drivers are working (instaled with Daemonset: Run GPUs in GKE Standard node pools | Google Kubernetes Engine (GKE) | Google Cloud)

Thinking about it this may be a problem running pytorch :thinking: i will try to isolate it

Hey @Alexis_De_La_Torre, welcome to the forums :slight_smile:

I moved this to its own topic as it seems GKE specific. Unfortunately I don’t have experience with GKE personally, so please do continue to update with your findings.

Does nvidia-smi work? (Inside the container?)

I did find a note that the NVIDIA driver and CUDA toolkit versions must be compatible. The image is built over the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime base image (with those versions).

Also, not sure if it’s relevant but I found this post where the person succeeded by reinstalling the drivers.

Anyways, good luck and let us know what you figure out :pray: