Installation

Prerequisites

NvidiaDriver v565+
Kubernetes v1.32+
ACP v4.1+
Cluster administrator access to your ACP cluster
CDI must be enabled in the underlying container runtime (such as containerd, see Enable CDI)
DRA and corresponding API groups must be enabled(see Enable DRA).

Procedure

Installing Nvidia driver in your gpu node

Prefer to Installation guide of Nvidia Official website

Installing Nvidia Container Runtime

Prefer to Installation guide of Nvidia Container Toolkit

Downloading Cluster plugin

INFO

Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin can be retrieved from Customer Portal.

Please contact Consumer Support for more information.

Uploading the Cluster plugin

For more information on uploading the cluster plugin, please refer to Uploading Cluster Plugins

Installing Alauda Build of NVIDIA DRA Driver for GPUs

Add label "nvidia-device-enable=pgpu-dra" in your GPU node for nvidia-dra-driver-gpu-kubelet-plugin schedule.
```
 kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra
```
INFO
Note: On the same node, you can only set one of the following labels: gpu=on, nvidia-device-enable=pgpu, or nvidia-device-enable=pgpu-dra.
Go to the Administrator -> Marketplace -> Cluster Plugin page, switch to the target cluster, and then deploy the Alauda Build of NVIDIA DRA Driver for GPUs Cluster plugin.

Verify DRA setup

Check DRA driver and DRA controller pods:

kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu"

You should get results similar to:

 nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4   1/1     Running   0              18h
 nvidia-dra-driver-gpu-kubelet-plugin-65fjt          2/2     Running   0              18h

Verify ResourceSlice objects:

kubectl get resourceslices -o yaml

For GPU nodes, you should see output similar to:

apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  generateName: 192.168.140.59-gpu.nvidia.com-
  name: 192.168.140.59-gpu.nvidia.com-gbl46
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: 192.168.140.59
    uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c
spec:
  devices:
  - basic:
      attributes:
        architecture:
          string: Pascal
        brand:
          string: Tesla
        cudaComputeCapability:
          version: 6.0.0
        cudaDriverVersion:
          version: 12.8.0
        driverVersion:
          version: 570.124.6
        pcieBusID:
          string: 0000:00:0b.0
        productName:
          string: Tesla P100-PCIE-16GB
        resource.kubernetes.io/pcieRoot:
          string: pci0000:00
        type:
          string: gpu
        uuid:
          string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66
      capacity:
        memory:
          value: 16Gi
    name: gpu-0
  driver: gpu.nvidia.com
  nodeName: 192.168.140.59
  pool:
    generation: 1
    name: 192.168.140.59
    resourceSliceCount: 1

Deploy workloads with DRA.

INFO

Note:Fill in the selector field of the following ResourceClaimTemplate resource according to your specific GPU model.You can use common expression language (CEL) to select devices based on specific attributes.

Create spec file:

cat <<EOF > dra-gpu-test.yaml
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: gpu-template
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'" # [!code callout]
---
apiVersion: v1
kind: Pod
metadata:
  name: dra-gpu-workload
spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  resourceClaims:
  - name: gpu-claim
    resourceClaimTemplateName: gpu-template
  containers:
  - name: cuda-container
    image: "ubuntu:22.04"
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu-claim

Apply spec:

kubectl apply -f dra-gpu-test.yaml

Obtain output of container in the pod:

kubectl logs pod -n dra-gpu-workload -f

The output is expected to show the GPU UUID from the container. Example:

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66)

#Installation

#TOC

#Prerequisites

#Procedure

#Installing Nvidia driver in your gpu node

#Installing Nvidia Container Runtime

#Downloading Cluster plugin

#Uploading the Cluster plugin

#Installing Alauda Build of NVIDIA DRA Driver for GPUs

#Verify DRA setup