Skip to content

nvidia-device-plugin v0.18.0 failing to start #1470

@gabrielbussolo

Description

@gabrielbussolo

runtime is defined already on containerd

sudo crictl info | jq '.config.containerd.defaultRuntimeName'
"nvidia"

when applying the v0.18.0 with k apply

k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.18.0/deployments/static/nvidia-device-plugin.yml

i get error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy

k logs nvidia-device-plugin-daemonset-fzltg -n kube-system
I1022 18:32:39.566652       1 main.go:239] "Starting NVIDIA Device Plugin" version=<
        3c9ffca9
        commit: 3c9ffca9491f0d2d362a7064138dfcd71bb57592
 >
I1022 18:32:39.566674       1 main.go:242] Starting FS watcher for /var/lib/kubelet/device-plugins
I1022 18:32:39.566692       1 main.go:249] Starting OS watcher.
I1022 18:32:39.566840       1 main.go:264] Starting Plugins.
I1022 18:32:39.566851       1 main.go:321] Loading configuration.
I1022 18:32:39.567169       1 main.go:346] Updating config with default resource matching patterns.
I1022 18:32:39.567245       1 main.go:357] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdrcopyEnabled": false,
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1022 18:32:39.567250       1 main.go:360] Retrieving plugins.
E1022 18:32:39.567304       1 factory.go:113] Incompatible strategy detected auto
E1022 18:32:39.567309       1 factory.go:114] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1022 18:32:39.567311       1 factory.go:115] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1022 18:32:39.567312       1 factory.go:116] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1022 18:32:39.567314       1 factory.go:117] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E1022 18:32:39.567388       1 main.go:177] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy

v0.17.4 works fine

$ k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.4/deployments/static/nvidia-device-plugin.yml
daemonset.apps/nvidia-device-plugin-daemonset created
$ kubectl get pods -n kube-system | grep nvidia                                                                                   
nvidia-device-plugin-daemonset-ss9lm      0/1     ContainerCreating   0          8s
$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-ss9lm      1/1     Running     0          12s

log from v0.17.4:

$ k logs nvidia-device-plugin-daemonset-ss9lm -n kube-system
I1022 18:38:35.316289       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
        fd56a747
        commit: fd56a747defe15333adce40fcd3a06ffb129251b
 >
I1022 18:38:35.316319       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1022 18:38:35.316336       1 main.go:245] Starting OS watcher.
I1022 18:38:35.316431       1 main.go:260] Starting Plugins.
I1022 18:38:35.316442       1 main.go:317] Loading configuration.
I1022 18:38:35.316944       1 main.go:342] Updating config with default resource matching patterns.
I1022 18:38:35.317034       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1022 18:38:35.317039       1 main.go:356] Retrieving plugins.
I1022 18:38:35.331421       1 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I1022 18:38:35.331842       1 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1022 18:38:35.332662       1 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions