OpenShift RDMA
- Chapter 5. NVIDIA GPUDirect Remote Direct Memory Access (RDMA) | Hardware accelerators | OpenShift Container Platform | 4.19 | Red Hat Documentation
- Single Root IO Virtualization (SR-IOV) - NVIDIA Docs
- Overview of Single Root I/O Virtualization (SR-IOV) - Windows drivers | Microsoft Learn
- RDMA over Converged Ethernet (RoCE) - NVIDIA Docs
- Chapter 3. Node Feature Discovery Operator | Specialized hardware and driver enablement | OpenShift Container Platform | 4.18 | Red Hat Documentation
- Getting Started with Red Hat OpenShift - NVIDIA Docs
- SCHMAUSTECH: RDMA with NVIDIA on OpenShift
- 在自建 Kubernetes 集群用 InfiniBand RDMA 运行 DeepSeek 分布式推理 - 知乎
Disabling the IRDMA kernel module Install the NFD Operator
# 5.2. Disabling the IRDMA kernel module
cat <<EOF > 99-machine-config-blacklist-irdma.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-blacklist-irdma
spec:
kernelArguments:
- "module_blacklist=irdma"
EOF
oc create -f 99-machine-config-blacklist-irdma.yaml
[root@admin ocp]# oc get events --all-namespaces --field-selector involvedObject.name=node86 -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,MESSAGE:.message
TIME TYPE REASON MESSAGE
2025-10-21T02:40:19Z Normal RegisteredNode Node node86 event: Registered Node node86 in Controller
2025-10-21T02:41:20Z Normal RegisteredNode Node node86 event: Registered Node node86 in Controller
2025-10-21T02:48:22Z Normal RegisteredNode Node node86 event: Registered Node node86 in Controller
2025-10-21T05:43:54Z Normal NodeNotSchedulable Node node86 status is now: NodeNotSchedulable
2025-10-21T05:46:48Z Normal OSUpdateStaged Changes to OS staged
2025-10-21T05:49:03Z Normal NodeNotReady Node node86 status is now: NodeNotReady
2025-10-21T05:49:05Z Normal Starting Starting kubelet.
2025-10-21T05:49:05Z Normal NodeAllocatableEnforced Updated Node Allocatable limit across pods
2025-10-21T05:49:05Z Normal NodeHasSufficientMemory Node node86 status is now: NodeHasSufficientMemory
2025-10-21T05:49:05Z Normal NodeHasNoDiskPressure Node node86 status is now: NodeHasNoDiskPressure
2025-10-21T05:49:05Z Normal NodeHasSufficientPID Node node86 status is now: NodeHasSufficientPID
2025-10-21T05:49:05Z Warning Rebooted Node node86 has been rebooted, boot id: 969b75c0-876e-4245-81d3-dd0b6219c388
2025-10-21T05:49:05Z Normal NodeNotReady Node node86 status is now: NodeNotReady
2025-10-21T05:49:05Z Normal NodeNotSchedulable Node node86 status is now: NodeNotSchedulable
2025-10-21T05:49:16Z Normal NodeReady Node node86 status is now: NodeReady
2025-10-21T05:49:26Z Normal NodeSchedulable Node node86 status is now: NodeSchedulable
2025-10-21T02:36:29Z Normal Discovered Discovered host with no BMC details
2025-10-21T02:39:05Z Normal Uncordon Update completed for config rendered-worker-802db99064302fc7c3c821ddb310a829 and node has been uncordoned
2025-10-21T02:39:05Z Normal NodeDone Setting node node86, currentConfig rendered-worker-802db99064302fc7c3c821ddb310a829 to Done
2025-10-21T02:39:05Z Normal ConfigDriftMonitorStarted Config Drift Monitor started, watching against rendered-worker-802db99064302fc7c3c821ddb310a829
2025-10-21T05:43:48Z Normal ConfigDriftMonitorStopped Config Drift Monitor stopped
2025-10-21T05:43:48Z Normal AddSigtermProtection Adding SIGTERM protection
2025-10-21T05:43:48Z Normal Cordon Cordoned node to apply update
2025-10-21T05:43:48Z Normal Drain Draining node to update config.
2025-10-21T05:46:45Z Normal OSUpdateStarted Changing kernel arguments
2025-10-21T05:46:45Z Normal OSUpgradeSkipped OS upgrade skipped; new MachineConfig (rendered-worker-1882e0995f812495d9774cd7c73b3cbd) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f6e0c5c7d4177c6631277deea60df55f67b15ba40f7d06f3d4e2eeb88fd4530) as old MachineConfig (rendered-worker-802db99064302fc7c3c821ddb310a829)
2025-10-21T05:46:48Z Normal RemoveSigtermProtection Removing SIGTERM protection
2025-10-21T05:46:48Z Normal Reboot Node will reboot into config rendered-worker-1882e0995f812495d9774cd7c73b3cbd
2025-10-21T05:49:22Z Normal Uncordon Update completed for config rendered-worker-1882e0995f812495d9774cd7c73b3cbd and node has been uncordoned
2025-10-21T05:49:22Z Normal NodeDone Setting node node86, currentConfig rendered-worker-1882e0995f812495d9774cd7c73b3cbd to Done
2025-10-21T05:49:22Z Normal ConfigDriftMonitorStarted Config Drift Monitor started, watching against rendered-worker-1882e0995f812495d9774cd7c73b3cbd
[root@admin ocp]#
[root@admin ocp]# oc get events --all-namespaces --field-selector involvedObject.name=node95 -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,MESSAGE:.message
TIME TYPE REASON MESSAGE
2025-10-21T05:49:42Z Normal NodeNotSchedulable Node node95 status is now: NodeNotSchedulable
2025-10-21T05:52:59Z Normal OSUpdateStaged Changes to OS staged
2025-10-21T05:55:10Z Normal Starting Starting kubelet.
2025-10-21T05:55:10Z Normal NodeAllocatableEnforced Updated Node Allocatable limit across pods
2025-10-21T05:55:10Z Normal NodeHasSufficientMemory Node node95 status is now: NodeHasSufficientMemory
2025-10-21T05:55:10Z Normal NodeHasNoDiskPressure Node node95 status is now: NodeHasNoDiskPressure
2025-10-21T05:55:10Z Normal NodeHasSufficientPID Node node95 status is now: NodeHasSufficientPID
2025-10-21T05:55:10Z Warning Rebooted Node node95 has been rebooted, boot id: 92f92bf4-2221-4a83-b92f-21cc453f35e3
2025-10-21T05:55:10Z Normal NodeNotReady Node node95 status is now: NodeNotReady
2025-10-21T05:55:10Z Normal NodeNotSchedulable Node node95 status is now: NodeNotSchedulable
2025-10-21T05:55:11Z Normal NodeNotReady Node node95 status is now: NodeNotReady
2025-10-21T05:55:20Z Normal NodeReady Node node95 status is now: NodeReady
2025-10-21T05:55:30Z Normal NodeSchedulable Node node95 status is now: NodeSchedulable
2025-10-21T05:49:29Z Normal ConfigDriftMonitorStopped Config Drift Monitor stopped
2025-10-21T05:49:29Z Normal AddSigtermProtection Adding SIGTERM protection
2025-10-21T05:49:29Z Normal Cordon Cordoned node to apply update
2025-10-21T05:49:29Z Normal Drain Draining node to update config.
2025-10-21T05:52:56Z Normal OSUpdateStarted Changing kernel arguments
2025-10-21T05:52:56Z Normal OSUpgradeSkipped OS upgrade skipped; new MachineConfig (rendered-worker-1882e0995f812495d9774cd7c73b3cbd) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f6e0c5c7d4177c6631277deea60df55f67b15ba40f7d06f3d4e2eeb88fd4530) as old MachineConfig (rendered-worker-802db99064302fc7c3c821ddb310a829)
2025-10-21T05:52:59Z Normal RemoveSigtermProtection Removing SIGTERM protection
2025-10-21T05:52:59Z Normal Reboot Node will reboot into config rendered-worker-1882e0995f812495d9774cd7c73b3cbd
2025-10-21T05:55:27Z Normal Uncordon Update completed for config rendered-worker-1882e0995f812495d9774cd7c73b3cbd and node has been uncordoned
2025-10-21T05:55:27Z Normal NodeDone Setting node node95, currentConfig rendered-worker-1882e0995f812495d9774cd7c73b3cbd to Done
2025-10-21T05:55:27Z Normal ConfigDriftMonitorStarted Config Drift Monitor started, watching against rendered-worker-1882e0995f812495d9774cd7c73b3cbd
[root@admin ocp]#
[root@admin ocp]# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-2b46ddf5c9b88261be14aa9a6670060a True False False 3 3 3 0 3h52m
worker rendered-worker-1882e0995f812495d9774cd7c73b3cbd True False False 2 2 2 0 3h52m
[root@admin ocp]#
oc debug node/node86 -- chroot /host bash -c "lsmod | grep irdma && echo 'Module found' || echo 'Module not found'"
oc debug node/node86 -- chroot /host bash -c "lsmod | grep irdma || true"
oc debug node/node95 -- chroot /host bash -c "lsmod | grep irdma && echo 'Module found' || echo 'Module not found'"
oc debug node/node95 -- chroot /host bash -c "lsmod | grep irdma || true"
# Install the NFD Operator
cat <<EOF > nfd-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: openshift-nfd
labels:
name: openshift-nfd
openshift.io/cluster-monitoring: "true"
EOF
oc create -f nfd-namespace.yaml
cat <<EOF > nfd-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
generateName: openshift-nfd
name: openshift-nfd
namespace: openshift-nfd
spec:
targetNamespaces:
- openshift-nfd
EOF
oc create -f nfd-operatorgroup.yaml
cat <<EOF > nfd-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: "stable"
installPlanApproval: Automatic
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
oc create -f nfd-sub.yaml
oc get pods -n openshift-nfd
# With the NFD controller running, generate the NodeFeatureDiscovery instance and add it to the cluster
NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
cat <<EOF > nfd-instance.yaml
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd
spec:
instance: ''
operand:
image: '${NFD_OPERAND_IMAGE}'
servicePort: 12000
prunerOnDelete: false
topologyUpdater: false
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
pci:
deviceClassWhitelist:
- "02"
- "03"
- "0200"
- "0207"
- "12"
deviceLabelFields:
- "vendor"
EOF
oc create -f nfd-instance.yaml
oc get pods -n openshift-nfd
# Wait a short period of time and then verify that NFD has added labels to the node.
oc describe node | grep -E 'Roles|pci' | grep pci-15b3
# feature.node.kubernetes.io/pci-15b3.present=true
# feature.node.kubernetes.io/pci-15b3.sriov.capable=true
# feature.node.kubernetes.io/pci-15b3.present=true
# feature.node.kubernetes.io/pci-15b3.sriov.capable=true
oc describe node node95 | grep -E 'Roles|pci' | grep pci-15b3
oc describe node node86 | grep -E 'Roles|pci' | grep pci-15b3
Install the SR-IOV Operator
cat <<EOF > sriov-network-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: openshift-sriov-network-operator
labels:
name: openshift-sriov-network-operator
openshift.io/cluster-monitoring: "true"
EOF
oc create -f sriov-network-namespace.yaml
cat <<EOF > sriov-network-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
generateName: openshift-sriov-network
name: openshift-sriov-network
namespace: openshift-sriov-network-operator
spec:
targetNamespaces:
- openshift-sriov-network-operator
EOF
oc create -f sriov-network-operatorgroup.yaml
cat <<EOF > sriov-network-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: sriov-network-operator
namespace: openshift-sriov-network-operator
spec:
channel: "stable"
installPlanApproval: Automatic
name: sriov-network-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
oc create -f sriov-network-sub.yaml
# Validate that the Operator is installed and running
oc get pods -n openshift-sriov-network-operator
# For the default SriovOperatorConfig CR to work with the MLNX_OFED container
cat <<EOF > sriov-operator-config.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: openshift-sriov-network-operator
spec:
enableInjector: true
enableOperatorWebhook: true
logLevel: 2
EOF
oc create -f sriov-operator-config.yaml
# Patch the sriov-operator so the MOFED container can work with it
oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
Install the NVIDIA network Operator
cat <<EOF > nvidia-network-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-network-operator
labels:
name: nvidia-network-operator
openshift.io/cluster-monitoring: "true"
EOF
oc create -f nvidia-network-namespace.yaml
cat <<EOF > nvidia-network-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
generateName: nvidia-network-operator-group
name: nvidia-network-operator-group
namespace: nvidia-network-operator
spec:
targetNamespaces:
- nvidia-network-operator
EOF
oc create -f nvidia-network-operatorgroup.yaml
cat <<EOF > nvidia-network-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nvidia-network-operator
namespace: nvidia-network-operator
spec:
channel: "stable"
installPlanApproval: Automatic
name: nvidia-network-operator
source: certified-operators
sourceNamespace: openshift-marketplace
EOF
oc create -f nvidia-network-sub.yaml
# Validate that the Operator is installed and running
oc get pods -n nvidia-network-operator
# create the NicClusterPolicy custom resource file.
lspci | grep -i mellanox
ip addr show | grep ib
ip addr show | grep -E '(eno|ens)'
# With the Operator running, create the NicClusterPolicy custom resource file. The device(ifNames) you choose depends on your system configuration.
# https://docs.nvidia.com/networking/display/kubernetes2501/life-cycle-management.html?q=doca-driver&text=...namespace%20:%20nvidia-network-operator%20spec%20:%20ofedDriver%20:%20image%20:%20doca-driver%20repository%20:%20nvcr.io/nvidia/mellanox%20version%20:%2025.01-0.6.0.0-0...#automatic-doca-driver-upgrade
# other version can ref this https://catalog.ngc.nvidia.com/orgs/nvidia/teams/mellanox/containers/doca-driver/tags
# image: doca-driver
# repository: nvcr.io/nvidia/mellanox
# version: 25.04-0.6.1.0-2
#
# 24.10-0.7.0.0-0 will trigger the ImagePullBackOffabout nvcr.io/nvidia/mellanox 24.10-0.7.0.0-0
# 25.01-0.6.0.0-0 will cause rdma/rdma_shared_device_eth: 0
# image: k8s-rdma-shared-dev-plugin version: v1.5.2
cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nicFeatureDiscovery:
image: nic-feature-discovery
repository: ghcr.io/mellanox
version: v0.0.1
docaTelemetryService:
image: doca_telemetry
repository: nvcr.io/nvidia/doca
version: 1.16.5-doca2.6.0-host
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_ib",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ibp129s0"]
}
},
{
"resourceName": "rdma_shared_device_eth",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["eno1np0"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: v1.5.2
secondaryNetwork:
ipoib:
image: ipoib-cni
repository: ghcr.io/mellanox
version: v1.2.0
nvIpam:
enableWebhook: false
image: nvidia-k8s-ipam
repository: ghcr.io/mellanox
version: v0.2.0
ofedDriver:
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
forcePrecompiled: false
terminationGracePeriodSeconds: 300
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
podSelector: ''
maxParallelUpgrades: 1
safeLoad: false
waitForCompletion:
timeoutSeconds: 0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
EOF
oc create -f network-sharedrdma-nic-cluster-policy.yaml
oc get pods -n nvidia-network-operator
# NAME READY STATUS RESTARTS AGE
# doca-telemetry-service-czk2s 1/1 Running 0 5m40s
# doca-telemetry-service-hh2zh 1/1 Running 0 5m40s
# kube-ipoib-cni-ds-2x4wg 1/1 Running 0 14m
# kube-ipoib-cni-ds-cs5fp 1/1 Running 0 14m
# mofed-rhcos4.17-86bc7c5555-ds-k95ck 2/2 Running 0 14m
# mofed-rhcos4.17-86bc7c5555-ds-kcxnw 2/2 Running 0 14m
# nic-feature-discovery-ds-9qdtq 1/1 Running 0 14m
# nic-feature-discovery-ds-pdcpm 1/1 Running 0 14m
# nv-ipam-controller-67556c846b-9l5db 1/1 Running 0 14m
# nv-ipam-controller-67556c846b-vzjbp 1/1 Running 0 14m
# nv-ipam-node-s2zg8 1/1 Running 0 14m
# nv-ipam-node-tlshw 1/1 Running 0 14m
# nvidia-network-operator-controller-manager-6f87b5b879-wk5ht 1/1 Running 0 125m
# rdma-shared-dp-ds-gr622 1/1 Running 0 5m40s
# rdma-shared-dp-ds-xjjc9
oc get pods -n nvidia-network-operator -o name | grep mofed
oc rsh -n nvidia-network-operator -c mofed-container mofed-rhcos4.17-86bc7c5555-ds-k95ck
# sh-5.1# ofed_info -s
# OFED-internal-25.01-0.6.0:
# sh-5.1# ibdev2netdev -v
# 0000:81:00.0 mlx5_0 (MT4123 - 1.01 ) Supermicro Network Adapter fw 20.28.1002 port 1 (INIT ) ==> ibp129s0 (Down)
# 0000:02:00.0 mlx5_1 (MT4117 - 1.01 ) Supermicro Network Adapter fw 14.27.1016 port 1 (ACTIVE) ==> eno1np0 (Up)
# 0000:02:00.1 mlx5_2 (MT4117 - 1.01 ) Supermicro Network Adapter fw 14.27.1016 port 1 (ACTIVE) ==> eno2np1 (Up)
# sh-5.1#
# Create a MacvlanNetwork custom resource file for your other interface
cat <<EOF > macvlan-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdmashared-net
spec:
networkNamespace: default
master: eno1np0
mode: bridge
mtu: 1500
ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
EOF
oc create -f macvlan-network.yaml
Install the NVIDIA GPU Operator
cat <<EOF > nvidia-gpu-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-gpu-operator
labels:
name: nvidia-gpu-operator
openshift.io/cluster-monitoring: "true"
EOF
oc create -f nvidia-gpu-namespace.yaml
cat <<EOF > nvidia-gpu-operatorgroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
generateName: nvidia-gpu-operator-group
name: nvidia-gpu-operator-group
namespace: nvidia-gpu-operator
spec:
targetNamespaces:
- nvidia-gpu-operator
EOF
oc create -f nvidia-gpu-operatorgroup.yaml
cat <<EOF > nvidia-gpu-sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nvidia-gpu-operator
namespace: nvidia-gpu-operator
spec:
channel: "stable"
installPlanApproval: Automatic
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
EOF
oc create -f nvidia-gpu-sub.yaml
# Create a GPU cluster policy custom resource
# https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/values.yaml
# gds.version => from 2.20.5(ImageBackOff) to 2.26.6
# spec.driver.rdma.enabled: true
cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
serviceMonitor:
enabled: true
enabled: true
cdi:
default: false
enabled: false
driver:
licensingConfig:
nlsEnabled: true
configMapName: ''
certConfig:
name: ''
rdma:
enabled: true
kernelModuleConfig:
name: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
virtualTopology:
config: ''
enabled: true
useNvidiaDriverCRD: false
useOpenKernelModules: true
devicePlugin:
config:
name: ''
default: ''
mps:
root: /run/nvidia/mps
enabled: true
gdrcopy:
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: true
image: nvidia-fs
version: 2.26.6
repository: nvcr.io/nvidia/cloud-native
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
installDir: /usr/local/nvidia
enabled: true
EOF
oc create -f gpu-cluster-policy.yaml
oc get pods -n nvidia-gpu-operator -o wide
# NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
# gpu-feature-discovery-27x8t 1/1 Running 0 13s 10.128.3.39 node95 <none> <none>
# gpu-feature-discovery-nk4xg 1/1 Running 0 5m15s 10.128.2.30 node86 <none> <none>
# gpu-operator-7b6f9d8f4f-5njmk 1/1 Running 0 32m 10.128.2.241 node86 <none> <none>
# nvidia-container-toolkit-daemonset-bpgcn 1/1 Running 0 5m15s 10.128.2.32 node86 <none> <none>
# nvidia-container-toolkit-daemonset-jhxxv 1/1 Running 0 5m15s 10.128.3.34 node95 <none> <none>
# nvidia-cuda-validator-5b7qm 0/1 Completed 0 9s 10.128.3.43 node95 <none> <none>
# nvidia-cuda-validator-lgzs5 0/1 Completed 0 2m13s 10.128.2.33 node86 <none> <none>
# nvidia-dcgm-2l425 1/1 Running 0 5m15s 10.128.2.29 node86 <none> <none>
# nvidia-dcgm-exporter-594nm 1/1 Running 3 (2m29s ago) 5m15s 10.128.2.27 node86 <none> <none>
# nvidia-dcgm-exporter-qwn4x 1/1 Running 0 13s 10.128.3.42 node95 <none> <none>
# nvidia-dcgm-sz8tp 1/1 Running 0 13s 10.128.3.41 node95 <none> <none>
# nvidia-device-plugin-daemonset-d6ggh 1/1 Running 0 13s 10.128.3.40 node95 <none> <none>
# nvidia-device-plugin-daemonset-llxhb 1/1 Running 0 5m15s 10.128.2.31 node86 <none> <none>
# nvidia-driver-daemonset-417.94.202412180008-0-5rmmp 4/4 Running 3 (4m44s ago) 5m49s 10.128.3.21 node95 <none> <none>
# nvidia-driver-daemonset-417.94.202412180008-0-qmnr4 4/4 Running 3 (4m44s ago) 5m49s 10.128.2.21 node86 <none> <none>
# nvidia-mig-manager-bkmxc 1/1 Running 0 98s 10.128.3.36 node95 <none> <none>
# nvidia-mig-manager-qvd4p 1/1 Running 0 98s 10.128.2.34 node86 <none> <none>
# nvidia-node-status-exporter-7xr77 1/1 Running 0 5m46s 10.128.3.28 node95 <none> <none>
# nvidia-node-status-exporter-kpjxp 1/1 Running 0 5m46s 10.128.2.26 node86 <none> <none>
# nvidia-operator-validator-g4v97 1/1 Running 0 13s 10.128.3.38 node95 <none> <none>
# nvidia-operator-validator-sgf99 1/1 Running 0 5m15s 10.128.2.28 node86 <none> <none>
# remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the nvidia_peermem is loaded.
[root@admin ~]# oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver
pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984
pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984 -- lsmod|grep nvidia
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984 -- lsmod|grep nvidia|grep nvidia_peermem
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984 -- nvidia-smi
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh -- lsmod|grep nvidia
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh -- lsmod|grep nvidia|grep nvidia_peermem
oc exec -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh -- nvidia-smi
[root@admin ~]# oc rsh -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-bb984
sh-5.1# lsmod|grep nvidia
nvidia_fs 327680 0
nvidia_peermem 24576 0
nvidia_modeset 1753088 0
nvidia_uvm 4087808 12
nvidia 14381056 32 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
video 73728 1 nvidia_modeset
ib_uverbs 217088 17 nvidia_peermem,rdma_ucm,mlx5_ib
drm 741376 5 drm_kms_helper,ast,drm_shmem_helper,nvidia
sh-5.1#
sh-5.1# nvidia-smi
Mon Oct 27 07:59:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:01:00.0 Off | 0 |
| N/A 36C P0 45W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
sh-5.1# exit
[root@admin ~]# oc rsh -n nvidia-gpu-operator pod/nvidia-driver-daemonset-417.94.202412180008-0-gjshh
sh-5.1# lsmod|grep nvidia
nvidia_fs 327680 0
nvidia_peermem 24576 0
nvidia_modeset 1753088 0
nvidia_uvm 4087808 12
nvidia 14381056 32 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
video 73728 1 nvidia_modeset
ib_uverbs 217088 17 nvidia_peermem,rdma_ucm,mlx5_ib
drm 741376 5 drm_kms_helper,ast,drm_shmem_helper,nvidia
sh-5.1# nvidia-smi
Mon Oct 27 08:02:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 34C P0 38W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
sh-5.1#
FAQ
rdma/rdma_shared_device_eth: 0 instead of rdma/rdma_shared_device_eth: 63
- "missing RDMA device spec for device 0000:e5:00.1, RDMA device \"issm\" not found" · Issue #94 · Mellanox/k8s-rdma-shared-dev-plugin
- This plugin not working when used IB NIC the LINK_TYPE_P1=ETH! · Issue #98 · Mellanox/k8s-rdma-shared-dev-plugin
Defaulted container "rdma-shared-dp" out of: rdma-shared-dp, ofed-driver-validation (init)
Using Kubelet Plugin Registry Mode
2025/10/22 08:50:51 Starting K8s RDMA Shared Device Plugin version= master
2025/10/22 08:50:51 resource manager reading configs
2025/10/22 08:50:51 Reading /k8s-rdma-shared-dev-plugin/config.json
2025/10/22 08:50:51 loaded config: [{ResourceName:rdma_shared_device_ib ResourcePrefix: RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ibp129s0] LinkTypes:[]}} {ResourceName:rdma_shared_device_eth ResourcePrefix: RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[eno1np0] LinkTypes:[]}}]
2025/10/22 08:50:51 no periodic update interval is set, use default interval 60 seconds
2025/10/22 08:50:51 Discovering host devices
2025/10/22 08:50:51 discovering host network devices
2025/10/22 08:50:51 DiscoverHostDevices(): device found: 0000:02:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2025/10/22 08:50:51 DiscoverHostDevices(): device found: 0000:02:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2025/10/22 08:50:51 DiscoverHostDevices(): device found: 0000:81:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2025/10/22 08:50:51 Initializing resource servers
2025/10/22 08:50:51 Resource: &{ResourceName:rdma_shared_device_ib ResourcePrefix:rdma RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ibp129s0] LinkTypes:[]}}
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.0, RDMA device \"issm\" not found"
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.1, RDMA device \"issm\" not found"
2025/10/22 08:50:51 Resource: &{ResourceName:rdma_shared_device_eth ResourcePrefix:rdma RdmaHcaMax:63 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[eno1np0] LinkTypes:[]}}
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.0, RDMA device \"issm\" not found"
2025/10/22 08:50:51 error creating new device: "missing RDMA device spec for device 0000:02:00.1, RDMA device \"issm\" not found"
2025/10/22 08:50:51 Warning: no devices in device pool, creating empty resource server for rdma_shared_device_eth
2025/10/22 08:50:51 Warning: no Rdma Devices were found for resource rdma_shared_device_eth
2025/10/22 08:50:51 Starting all servers...
2025/10/22 08:50:51 starting rdma/rdma_shared_device_ib device plugin endpoint at: rdma_shared_device_ib.sock
2025/10/22 08:50:51 rdma/rdma_shared_device_ib device plugin endpoint started serving
2025/10/22 08:50:51 starting rdma/rdma_shared_device_eth device plugin endpoint at: rdma_shared_device_eth.sock
2025/10/22 08:50:51 rdma/rdma_shared_device_eth device plugin endpoint started serving
2025/10/22 08:50:51 All servers started.
2025/10/22 08:50:51 Listening for term signals
2025/10/22 08:50:51 Starting OS watcher.
2025/10/22 08:50:52 ListAndWatch called by kubelet for: rdma/rdma_shared_device_ib
2025/10/22 08:50:52 Updating "rdma/rdma_shared_device_ib" devices
2025/10/22 08:50:52 rdma_shared_device_ib.sock gets registered successfully at Kubelet
2025/10/22 08:50:52 rdma_shared_device_eth.sock gets registered successfully at Kubelet
2025/10/22 08:50:52 ListAndWatch called by kubelet for: rdma/rdma_shared_device_eth
2025/10/22 08:50:52 Updating "rdma/rdma_shared_device_eth" devices
2025/10/22 08:50:52 exposing "0" devices
2025/10/22 08:50:52 exposing "63" devices
resolve
change the image: doca-driver from 25.04-0.6.1.0-2 to 24.10-0.7.0.0-0 image: k8s-rdma-shared-dev-plugin version: from v1.5.1 to v1.5.2