OpenShift RDMA

Chapter 5. NVIDIA GPUDirect Remote Direct Memory Access (RDMA) | Hardware accelerators | OpenShift Container Platform | 4.19 | Red Hat Documentation
Getting Started with Red Hat OpenShift - NVIDIA Docs
SCHMAUSTECH: RDMA with NVIDIA on OpenShift
- SCHMAUSTECH: RDMA+CUDA with NVIDIA on OpenShift
在自建 Kubernetes 集群用 InfiniBand RDMA 运行 DeepSeek 分布式推理 - 知乎

Disabling the IRDMA kernel module Install the NFD Operator

# 5.2. Disabling the IRDMA kernel module 
cat <<EOF > 99-machine-config-blacklist-irdma.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-blacklist-irdma
spec:
  kernelArguments:
    - "module_blacklist=irdma"
EOF

oc create -f 99-machine-config-blacklist-irdma.yaml
[root@admin ocp]# oc get events --all-namespaces --field-selector involvedObject.name=node86 -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,MESSAGE:.message
TIME                   TYPE      REASON                      MESSAGE
2025-10-21T02:40:19Z   Normal    RegisteredNode              Node node86 event: Registered Node node86 in Controller
2025-10-21T02:41:20Z   Normal    RegisteredNode              Node node86 event: Registered Node node86 in Controller
2025-10-21T02:48:22Z   Normal    RegisteredNode              Node node86 event: Registered Node node86 in Controller
2025-10-21T05:43:54Z   Normal    NodeNotSchedulable          Node node86 status is now: NodeNotSchedulable
2025-10-21T05:46:48Z   Normal    OSUpdateStaged              Changes to OS staged
2025-10-21T05:49:03Z   Normal    NodeNotReady                Node node86 status is now: NodeNotReady
2025-10-21T05:49:05Z   Normal    Starting                    Starting kubelet.
2025-10-21T05:49:05Z   Normal    NodeAllocatableEnforced     Updated Node Allocatable limit across pods
2025-10-21T05:49:05Z   Normal    NodeHasSufficientMemory     Node node86 status is now: NodeHasSufficientMemory
2025-10-21T05:49:05Z   Normal    NodeHasNoDiskPressure       Node node86 status is now: NodeHasNoDiskPressure
2025-10-21T05:49:05Z   Normal    NodeHasSufficientPID        Node node86 status is now: NodeHasSufficientPID
2025-10-21T05:49:05Z   Warning   Rebooted                    Node node86 has been rebooted, boot id: 969b75c0-876e-4245-81d3-dd0b6219c388
2025-10-21T05:49:05Z   Normal    NodeNotReady                Node node86 status is now: NodeNotReady
2025-10-21T05:49:05Z   Normal    NodeNotSchedulable          Node node86 status is now: NodeNotSchedulable
2025-10-21T05:49:16Z   Normal    NodeReady                   Node node86 status is now: NodeReady
2025-10-21T05:49:26Z   Normal    NodeSchedulable             Node node86 status is now: NodeSchedulable
2025-10-21T02:36:29Z   Normal    Discovered                  Discovered host with no BMC details
2025-10-21T02:39:05Z   Normal    Uncordon                    Update completed for config rendered-worker-802db99064302fc7c3c821ddb310a829 and node has been uncordoned
2025-10-21T02:39:05Z   Normal    NodeDone                    Setting node node86, currentConfig rendered-worker-802db99064302fc7c3c821ddb310a829 to Done
2025-10-21T02:39:05Z   Normal    ConfigDriftMonitorStarted   Config Drift Monitor started, watching against rendered-worker-802db99064302fc7c3c821ddb310a829
2025-10-21T05:43:48Z   Normal    ConfigDriftMonitorStopped   Config Drift Monitor stopped
2025-10-21T05:43:48Z   Normal    AddSigtermProtection        Adding SIGTERM protection
2025-10-21T05:43:48Z   Normal    Cordon                      Cordoned node to apply update
2025-10-21T05:43:48Z   Normal    Drain                       Draining node to update config.
2025-10-21T05:46:45Z   Normal    OSUpdateStarted             Changing kernel arguments
2025-10-21T05:46:45Z   Normal    OSUpgradeSkipped            OS upgrade skipped; new MachineConfig (rendered-worker-1882e0995f812495d9774cd7c73b3cbd) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f6e0c5c7d4177c6631277deea60df55f67b15ba40f7d06f3d4e2eeb88fd4530) as old MachineConfig (rendered-worker-802db99064302fc7c3c821ddb310a829)
2025-10-21T05:46:48Z   Normal    RemoveSigtermProtection     Removing SIGTERM protection
2025-10-21T05:46:48Z   Normal    Reboot                      Node will reboot into config rendered-worker-1882e0995f812495d9774cd7c73b3cbd
2025-10-21T05:49:22Z   Normal    Uncordon                    Update completed for config rendered-worker-1882e0995f812495d9774cd7c73b3cbd and node has been uncordoned
2025-10-21T05:49:22Z   Normal    NodeDone                    Setting node node86, currentConfig rendered-worker-1882e0995f812495d9774cd7c73b3cbd to Done
2025-10-21T05:49:22Z   Normal    ConfigDriftMonitorStarted   Config Drift Monitor started, watching against rendered-worker-1882e0995f812495d9774cd7c73b3cbd
[root@admin ocp]#
[root@admin ocp]# oc get events --all-namespaces --field-selector involvedObject.name=node95 -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,MESSAGE:.message
TIME                   TYPE      REASON                      MESSAGE
2025-10-21T05:49:42Z   Normal    NodeNotSchedulable          Node node95 status is now: NodeNotSchedulable
2025-10-21T05:52:59Z   Normal    OSUpdateStaged              Changes to OS staged
2025-10-21T05:55:10Z   Normal    Starting                    Starting kubelet.
2025-10-21T05:55:10Z   Normal    NodeAllocatableEnforced     Updated Node Allocatable limit across pods
2025-10-21T05:55:10Z   Normal    NodeHasSufficientMemory     Node node95 status is now: NodeHasSufficientMemory
2025-10-21T05:55:10Z   Normal    NodeHasNoDiskPressure       Node node95 status is now: NodeHasNoDiskPressure
2025-10-21T05:55:10Z   Normal    NodeHasSufficientPID        Node node95 status is now: NodeHasSufficientPID
2025-10-21T05:55:10Z   Warning   Rebooted                    Node node95 has been rebooted, boot id: 92f92bf4-2221-4a83-b92f-21cc453f35e3
2025-10-21T05:55:10Z   Normal    NodeNotReady                Node node95 status is now: NodeNotReady
2025-10-21T05:55:10Z   Normal    NodeNotSchedulable          Node node95 status is now: NodeNotSchedulable
2025-10-21T05:55:11Z   Normal    NodeNotReady                Node node95 status is now: NodeNotReady
2025-10-21T05:55:20Z   Normal    NodeReady                   Node node95 status is now: NodeReady
2025-10-21T05:55:30Z   Normal    NodeSchedulable             Node node95 status is now: NodeSchedulable
2025-10-21T05:49:29Z   Normal    ConfigDriftMonitorStopped   Config Drift Monitor stopped
2025-10-21T05:49:29Z   Normal    AddSigtermProtection        Adding SIGTERM protection
2025-10-21T05:49:29Z   Normal    Cordon                      Cordoned node to apply update
2025-10-21T05:49:29Z   Normal    Drain                       Draining node to update config.
2025-10-21T05:52:56Z   Normal    OSUpdateStarted             Changing kernel arguments
2025-10-21T05:52:56Z   Normal    OSUpgradeSkipped            OS upgrade skipped; new MachineConfig (rendered-worker-1882e0995f812495d9774cd7c73b3cbd) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f6e0c5c7d4177c6631277deea60df55f67b15ba40f7d06f3d4e2eeb88fd4530) as old MachineConfig (rendered-worker-802db99064302fc7c3c821ddb310a829)
2025-10-21T05:52:59Z   Normal    RemoveSigtermProtection     Removing SIGTERM protection
2025-10-21T05:52:59Z   Normal    Reboot                      Node will reboot into config rendered-worker-1882e0995f812495d9774cd7c73b3cbd
2025-10-21T05:55:27Z   Normal    Uncordon                    Update completed for config rendered-worker-1882e0995f812495d9774cd7c73b3cbd and node has been uncordoned
2025-10-21T05:55:27Z   Normal    NodeDone                    Setting node node95, currentConfig rendered-worker-1882e0995f812495d9774cd7c73b3cbd to Done
2025-10-21T05:55:27Z   Normal    ConfigDriftMonitorStarted   Config Drift Monitor started, watching against rendered-worker-1882e0995f812495d9774cd7c73b3cbd
[root@admin ocp]#
[root@admin ocp]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-2b46ddf5c9b88261be14aa9a6670060a   True      False      False      3              3                   3                     0                      3h52m
worker   rendered-worker-1882e0995f812495d9774cd7c73b3cbd   True      False      False      2              2                   2                     0                      3h52m
[root@admin ocp]#
oc debug node/node86 -- chroot /host bash -c "lsmod | grep irdma && echo 'Module found' || echo 'Module not found'"
oc debug node/node86 -- chroot /host bash -c "lsmod | grep irdma || true"
oc debug node/node95 -- chroot /host bash -c "lsmod | grep irdma && echo 'Module found' || echo 'Module not found'"
oc debug node/node95 -- chroot /host bash -c "lsmod | grep irdma || true"

# Install the NFD Operator
cat <<EOF > nfd-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-nfd
  labels:
    name: openshift-nfd
    openshift.io/cluster-monitoring: "true"
EOF

oc create -f nfd-namespace.yaml

cat <<EOF > nfd-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: openshift-nfd
  name: openshift-nfd
  namespace: openshift-nfd
spec:
  targetNamespaces:
  - openshift-nfd
EOF

oc create -f nfd-operatorgroup.yaml

cat <<EOF > nfd-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nfd
  namespace: openshift-nfd
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: nfd
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

oc create -f nfd-sub.yaml
oc get pods -n openshift-nfd

# With the NFD controller running, generate the NodeFeatureDiscovery instance and add it to the cluster
NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
cat <<EOF > nfd-instance.yaml
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd
spec:
  instance: ''
  operand:
    image: '${NFD_OPERAND_IMAGE}'
    servicePort: 12000
  prunerOnDelete: false
  topologyUpdater: false
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        pci:
          deviceClassWhitelist:
            - "02"
            - "03"
            - "0200"
            - "0207"
            - "12"
          deviceLabelFields:
            - "vendor"
EOF
oc create -f nfd-instance.yaml
oc get pods -n openshift-nfd

# Wait a short period of time and then verify that NFD has added labels to the node.
oc describe node | grep -E 'Roles|pci' | grep pci-15b3
                    # feature.node.kubernetes.io/pci-15b3.present=true
                    # feature.node.kubernetes.io/pci-15b3.sriov.capable=true
                    # feature.node.kubernetes.io/pci-15b3.present=true
                    # feature.node.kubernetes.io/pci-15b3.sriov.capable=true
oc describe node node95 | grep -E 'Roles|pci' | grep pci-15b3
oc describe node node86 | grep -E 'Roles|pci' | grep pci-15b3

Install the SR-IOV Operator

cat <<EOF > sriov-network-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-sriov-network-operator
  labels:
    name: openshift-sriov-network-operator
    openshift.io/cluster-monitoring: "true"
EOF

oc create -f sriov-network-namespace.yaml

cat <<EOF > sriov-network-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: openshift-sriov-network
  name: openshift-sriov-network
  namespace: openshift-sriov-network-operator
spec:
  targetNamespaces:
  - openshift-sriov-network-operator
EOF

oc create -f sriov-network-operatorgroup.yaml

cat <<EOF > sriov-network-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: sriov-network-operator
  namespace: openshift-sriov-network-operator
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: sriov-network-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

oc create -f sriov-network-sub.yaml
# Validate that the Operator is installed and running
oc get pods -n openshift-sriov-network-operator

# For the default SriovOperatorConfig CR to work with the MLNX_OFED container
cat <<EOF > sriov-operator-config.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
  namespace: openshift-sriov-network-operator
spec:
  enableInjector: true
  enableOperatorWebhook: true
  logLevel: 2
EOF

oc create -f sriov-operator-config.yaml
# Patch the sriov-operator so the MOFED container can work with it
oc patch sriovoperatorconfig default   --type=merge -n openshift-sriov-network-operator   --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'

Install the NVIDIA network Operator

cat <<EOF > nvidia-network-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-network-operator
  labels:
    name: nvidia-network-operator
    openshift.io/cluster-monitoring: "true"
EOF

oc create -f nvidia-network-namespace.yaml

cat <<EOF > nvidia-network-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: nvidia-network-operator-group
  name: nvidia-network-operator-group
  namespace: nvidia-network-operator
spec:
  targetNamespaces:
  - nvidia-network-operator
EOF

oc create -f nvidia-network-operatorgroup.yaml

cat <<EOF > nvidia-network-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nvidia-network-operator
  namespace: nvidia-network-operator
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: nvidia-network-operator
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

oc create -f nvidia-network-sub.yaml
# Validate that the Operator is installed and running
oc get pods -n nvidia-network-operator

# create the NicClusterPolicy custom resource file.
lspci | grep -i mellanox
ip addr show | grep ib
ip addr show | grep -E '(eno|ens)'

# With the Operator running, create the NicClusterPolicy custom resource file. The device(ifNames) you choose depends on your system configuration.
# https://docs.nvidia.com/networking/display/kubernetes2501/life-cycle-management.html?q=doca-driver&text=...namespace%20:%20nvidia-network-operator%20spec%20:%20ofedDriver%20:%20image%20:%20doca-driver%20repository%20:%20nvcr.io/nvidia/mellanox%20version%20:%2025.01-0.6.0.0-0...#automatic-doca-driver-upgrade
# other version can ref this https://catalog.ngc.nvidia.com/orgs/nvidia/teams/mellanox/containers/doca-driver/tags
# image: doca-driver
#     repository: nvcr.io/nvidia/mellanox
#     version: 25.04-0.6.1.0-2
#
# 24.10-0.7.0.0-0 will trigger the ImagePullBackOffabout nvcr.io/nvidia/mellanox 24.10-0.7.0.0-0
# 25.01-0.6.0.0-0 will cause rdma/rdma_shared_device_eth:  0
# image: k8s-rdma-shared-dev-plugin version: v1.5.2
cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nicFeatureDiscovery:
    image: nic-feature-discovery
    repository: ghcr.io/mellanox
    version: v0.0.1
  docaTelemetryService:
    image: doca_telemetry
    repository: nvcr.io/nvidia/doca
    version: 1.16.5-doca2.6.0-host
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibp129s0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["eno1np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: v1.5.2
  secondaryNetwork:
    ipoib:
      image: ipoib-cni
      repository: ghcr.io/mellanox
      version: v1.2.0
  nvIpam:
    enableWebhook: false
    image: nvidia-k8s-ipam
    repository: ghcr.io/mellanox
    version: v0.2.0
  ofedDriver:
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    forcePrecompiled: false
    terminationGracePeriodSeconds: 300
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
        podSelector: ''
      maxParallelUpgrades: 1
      safeLoad: false
      waitForCompletion:
        timeoutSeconds: 0
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    env:
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: "true"
    - name: CREATE_IFNAMES_UDEV
      value: "true"
EOF

oc create -f network-sharedrdma-nic-cluster-policy.yaml

oc get pods -n nvidia-network-operator
# NAME                                                          READY   STATUS    RESTARTS   AGE
# doca-telemetry-service-czk2s                                  1/1     Running   0          5m40s
# doca-telemetry-service-hh2zh                                  1/1     Running   0          5m40s
# kube-ipoib-cni-ds-2x4wg                                       1/1     Running   0          14m
# kube-ipoib-cni-ds-cs5fp                                       1/1     Running   0          14m
# mofed-rhcos4.17-86bc7c5555-ds-k95ck                           2/2     Running   0          14m
# mofed-rhcos4.17-86bc7c5555-ds-kcxnw                           2/2     Running   0          14m
# nic-feature-discovery-ds-9qdtq                                1/1     Running   0          14m
# nic-feature-discovery-ds-pdcpm                                1/1     Running   0          14m
# nv-ipam-controller-67556c846b-9l5db                           1/1     Running   0          14m
# nv-ipam-controller-67556c846b-vzjbp                           1/1     Running   0          14m
# nv-ipam-node-s2zg8                                            1/1     Running   0          14m
# nv-ipam-node-tlshw                                            1/1     Running   0          14m
# nvidia-network-operator-controller-manager-6f87b5b879-wk5ht   1/1     Running   0          125m
# rdma-shared-dp-ds-gr622                                       1/1     Running   0          5m40s
# rdma-shared-dp-ds-xjjc9
oc get pods -n nvidia-network-operator -o name | grep mofed
oc rsh -n nvidia-network-operator -c mofed-container mofed-rhcos4.17-86bc7c5555-ds-k95ck
# sh-5.1# ofed_info -s
# OFED-internal-25.01-0.6.0:
# sh-5.1# ibdev2netdev -v
# 0000:81:00.0 mlx5_0 (MT4123 - 1.01       ) Supermicro Network Adapter fw 20.28.1002 port 1 (INIT  ) ==> ibp129s0 (Down)
# 0000:02:00.0 mlx5_1 (MT4117 - 1.01       ) Supermicro Network Adapter fw 14.27.1016 port 1 (ACTIVE) ==> eno1np0 (Up)
# 0000:02:00.1 mlx5_2 (MT4117 - 1.01       ) Supermicro Network Adapter fw 14.27.1016 port 1 (ACTIVE) ==> eno2np1 (Up)
# sh-5.1#


# Create a MacvlanNetwork custom resource file for your other interface
cat <<EOF > macvlan-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdmashared-net
spec:
  networkNamespace: default
  master: eno1np0
  mode: bridge
  mtu: 1500
  ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
EOF

oc create -f macvlan-network.yaml

Install the NVIDIA GPU Operator

cat <<EOF > nvidia-gpu-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-gpu-operator
  labels:
    name: nvidia-gpu-operator
    openshift.io/cluster-monitoring: "true"
EOF

oc create -f nvidia-gpu-namespace.yaml

cat <<EOF > nvidia-gpu-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: nvidia-gpu-operator-group
  name: nvidia-gpu-operator-group
  namespace: nvidia-gpu-operator
spec:
  targetNamespaces:
  - nvidia-gpu-operator
EOF

oc create -f nvidia-gpu-operatorgroup.yaml

cat <<EOF > nvidia-gpu-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nvidia-gpu-operator
  namespace: nvidia-gpu-operator
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

oc create -f nvidia-gpu-sub.yaml


# Create a GPU cluster policy custom resource
# https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/values.yaml
# gds.version => from 2.20.5(ImageBackOff) to 2.26.6
# spec.driver.rdma.enabled: true
cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    serviceMonitor:
      enabled: true
    enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      nlsEnabled: true
      configMapName: ''
    certConfig:
      name: ''
    rdma:
      enabled: true
    kernelModuleConfig:
      name: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    virtualTopology:
      config: ''
    enabled: true
    useNvidiaDriverCRD: false
    useOpenKernelModules: true
  devicePlugin:
    config:
      name: ''
      default: ''
    mps:
      root: /run/nvidia/mps
    enabled: true
  gdrcopy:
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: true
    image: nvidia-fs
    version: 2.26.6
    repository: nvcr.io/nvidia/cloud-native
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    installDir: /usr/local/nvidia
    enabled: true
EOF

oc create -f gpu-cluster-policy.yaml

oc get pods -n nvidia-gpu-operator -o wide
# NAME                                                  READY   STATUS      RESTARTS        AGE     IP             NODE     NOMINATED NODE   READINESS GATES
# gpu-feature-discovery-27x8t                           1/1     Running     0               13s     10.128.3.39    node95   <none>           <none>
# gpu-feature-discovery-nk4xg                           1/1     Running     0               5m15s   10.128.2.30    node86   <none>           <none>
# gpu-operator-7b6f9d8f4f-5njmk                         1/1     Running     0               32m     10.128.2.241   node86   <none>           <none>
# nvidia-container-toolkit-daemonset-bpgcn              1/1     Running     0               5m15s   10.128.2.32    node86   <none>           <none>
# nvidia-container-toolkit-daemonset-jhxxv              1/1     Running     0               5m15s   10.128.3.34    node95   <none>           <none>
# nvidia-cuda-validator-5b7qm                           0/1     Completed   0               9s      10.128.3.43    node95   <none>           <none>
# nvidia-cuda-validator-lgzs5                           0/1     Completed   0               2m13s   10.128.2.33    node86   <none>           <none>
# nvidia-dcgm-2l425                                     1/1     Running     0               5m15s   10.128.2.29    node86   <none>           <none>
# nvidia-dcgm-exporter-594nm                            1/1     Running     3 (2m29s ago)   5m15s   10.128.2.27    node86   <none>           <none>
# nvidia-dcgm-exporter-qwn4x                            1/1     Running     0               13s     10.128.3.42    node95   <none>           <none>
# nvidia-dcgm-sz8tp                                     1/1     Running     0               13s     10.128.3.41    node95   <none>           <none>
# nvidia-device-plugin-daemonset-d6ggh                  1/1     Running     0               13s     10.128.3.40    node95   <none>           <none>
# nvidia-device-plugin-daemonset-llxhb                  1/1     Running     0               5m15s   10.128.2.31    node86   <none>           <none>
# nvidia-driver-daemonset-417.94.202412180008-0-5rmmp   4/4     Running     3 (4m44s ago)   5m49s   10.128.3.21    node95   <none>           <none>
# nvidia-driver-daemonset-417.94.202412180008-0-qmnr4   4/4     Running     3 (4m44s ago)   5m49s   10.128.2.21    node86   <none>           <none>
# nvidia-mig-manager-bkmxc                              1/1     Running     0               98s     10.128.3.36    node95   <none>           <none>
# nvidia-mig-manager-qvd4p                              1/1     Running     0               98s     10.128.2.34    node86   <none>           <none>
# nvidia-node-status-exporter-7xr77                     1/1     Running     0               5m46s   10.128.3.28    node95   <none>           <none>
# nvidia-node-status-exporter-kpjxp                     1/1     Running     0               5m46s   10.128.2.26    node86   <none>           <none>
# nvidia-operator-validator-g4v97                       1/1     Running     0               13s     10.128.3.38    node95   <none>           <none>
# nvidia-operator-validator-sgf99                       1/1     Running     0               5m15s   10.128.2.28    node86   <none>           <none>


# remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the nvidia_peermem is loaded.
[root@admin ~]# oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver
pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984
pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh

oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984 -- lsmod|grep nvidia
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984 -- lsmod|grep nvidia|grep nvidia_peermem
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984 -- nvidia-smi

oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh -- lsmod|grep nvidia
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh -- lsmod|grep nvidia|grep nvidia_peermem
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh -- nvidia-smi


[root@admin ~]# oc rsh -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984
sh-5.1# lsmod|grep nvidia
nvidia_fs             327680  0
nvidia_peermem         24576  0
nvidia_modeset       1753088  0
nvidia_uvm           4087808  12
nvidia              14381056  32 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
video                  73728  1 nvidia_modeset
ib_uverbs             217088  17 nvidia_peermem,rdma_ucm,mlx5_ib
drm                   741376  5 drm_kms_helper,ast,drm_shmem_helper,nvidia
sh-5.1#
sh-5.1# nvidia-smi
Mon Oct 27 07:59:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:01:00.0 Off |                    0 |
| N/A   36C    P0             45W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
sh-5.1# exit
[root@admin ~]# oc rsh -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh
sh-5.1# lsmod|grep nvidia
nvidia_fs             327680  0
nvidia_peermem         24576  0
nvidia_modeset       1753088  0
nvidia_uvm           4087808  12
nvidia              14381056  32 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
video                  73728  1 nvidia_modeset
ib_uverbs             217088  17 nvidia_peermem,rdma_ucm,mlx5_ib
drm                   741376  5 drm_kms_helper,ast,drm_shmem_helper,nvidia
sh-5.1# nvidia-smi
Mon Oct 27 08:02:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:01:00.0 Off |                    0 |
| N/A   34C    P0             38W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
sh-5.1#

FAQ

rdma/rdma_shared_device_eth: 0 instead of rdma/rdma_shared_device_eth: 63

"missing RDMA device spec for device 0000:e5:00.1, RDMA device \"issm\" not found" · Issue #94 · Mellanox/k8s-rdma-shared-dev-plugin
This plugin not working when used IB NIC the LINK_TYPE_P1=ETH! · Issue #98 · Mellanox/k8s-rdma-shared-dev-plugin
```
oc logs -n nvidia-network-operator rdma-shared-dp-ds-gr622
```

Defaulted container "rdma-shared-dp" out of: rdma-shared-dp, ofed-driver-validation (init)
Using Kubelet Plugin Registry Mode
2025/10/22 08:50:51 Starting K8s RDMA Shared Device Plugin version= master
2025/10/22 08:50:51 resource manager reading configs
2025/10/22 08:50:51 Reading /k8s-rdma-shared-dev-plugin/config.json
2025/10/22 08:50:51 loaded config: [{ResourceName:rdma_shared_device_ib ResourcePrefix: RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ibp129s0] LinkTypes:[]}} {ResourceName:rdma_shared_device_eth ResourcePrefix: RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[eno1np0] LinkTypes:[]}}]
2025/10/22 08:50:51 no periodic update interval is set, use default interval 60 seconds
2025/10/22 08:50:51 Discovering host devices
2025/10/22 08:50:51 discovering host network devices
2025/10/22 08:50:51 DiscoverHostDevices(): device found: 0000:02:00.0   02              Mellanox Technolo...    MT27710 Family [ConnectX-4 Lx]
2025/10/22 08:50:51 DiscoverHostDevices(): device found: 0000:02:00.1   02              Mellanox Technolo...    MT27710 Family [ConnectX-4 Lx]
2025/10/22 08:50:51 DiscoverHostDevices(): device found: 0000:81:00.0   02              Mellanox Technolo...    MT28908 Family [ConnectX-6]
2025/10/22 08:50:51 Initializing resource servers
2025/10/22 08:50:51 Resource: &{ResourceName:rdma_shared_device_ib ResourcePrefix:rdma RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ibp129s0] LinkTypes:[]}}
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.0, RDMA device \"issm\" not found" 
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.1, RDMA device \"issm\" not found" 
2025/10/22 08:50:51 Resource: &{ResourceName:rdma_shared_device_eth ResourcePrefix:rdma RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[eno1np0] LinkTypes:[]}}
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.0, RDMA device \"issm\" not found" 
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.1, RDMA device \"issm\" not found" 
2025/10/22 08:50:51 Warning: no devices in device pool, creating empty resource server for rdma_shared_device_eth
2025/10/22 08:50:51 Warning: no Rdma Devices were found for resource rdma_shared_device_eth
2025/10/22 08:50:51 Starting all servers...
2025/10/22 08:50:51 starting rdma/rdma_shared_device_ib device plugin endpoint at: rdma_shared_device_ib.sock
2025/10/22 08:50:51 rdma/rdma_shared_device_ib device plugin endpoint started serving
2025/10/22 08:50:51 starting rdma/rdma_shared_device_eth device plugin endpoint at: rdma_shared_device_eth.sock
2025/10/22 08:50:51 rdma/rdma_shared_device_eth device plugin endpoint started serving
2025/10/22 08:50:51 All servers started.
2025/10/22 08:50:51 Listening for term signals
2025/10/22 08:50:51 Starting OS watcher.
2025/10/22 08:50:52 ListAndWatch called by kubelet for: rdma/rdma_shared_device_ib
2025/10/22 08:50:52 Updating "rdma/rdma_shared_device_ib" devices
2025/10/22 08:50:52 rdma_shared_device_ib.sock gets registered successfully at Kubelet
2025/10/22 08:50:52 rdma_shared_device_eth.sock gets registered successfully at Kubelet
2025/10/22 08:50:52 ListAndWatch called by kubelet for: rdma/rdma_shared_device_eth
2025/10/22 08:50:52 Updating "rdma/rdma_shared_device_eth" devices
2025/10/22 08:50:52 exposing "0" devices
2025/10/22 08:50:52 exposing "63" devices

resolve

change the image: doca-driver from 25.04-0.6.1.0-2 to 24.10-0.7.0.0-0 image: k8s-rdma-shared-dev-plugin version: from v1.5.1 to v1.5.2

OpenShift RDMA

FAQ

rdma/rdma_shared_device_eth: 0 instead of rdma/rdma_shared_device_eth: 63

resolve

ib link down