OpenShift

redhat hybrid cloud console
OpenShift Certified Hardware
:star: Deploying a simple Python app to Kubernetes/OpenShift | JJ Asghar | Conf42 Python 2022
- jjasghar (jjasghar) / Repositories · GitHub
『紅帽』的 Cloud-Native 工作術: 從 Container 到 OpenShift 。 :: 第 12 屆 iThome 鐵人賽
- 免 YAML 部署 App 到 OpenShift: new-app 跟 Template 淺談 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天
- oc get all -o NAME --no-headers | xargs oc delete
Docker獸究極進化～～ Kubernetes獸 :: 第 12 屆 iThome 鐵人賽
愛的走馬看花 Red Hat CoreOS 與 Red Hat OpenShift Part 1 - 魂系架構 Phil's Workspace
Machine configuration tasks | Post-installation configuration | OpenShift Container Platform 4.13
How to Configure a Squid Proxy and SSH Tunnel on RHEL 8 to access OpenShift Console from your local machine - Goglides Dev 🌱
Configure access to a Red Hat OpenShift cluster on a private network in IBM Power Systems Virtual Server - IBM Developer
Red Hat OpenStack Services on OpenShift: Rethinking storage design in pod-based architectures
Course

What companies using the Openshift

What companies using the Openshift
- Innovation Awards 2024
  - NanShan Life Insurance
  - Next Bank
- Innovation Awards 2023
  - KGI Securities
  - National Taiwan University Hospital, (NTUH)
- Innovation Awards 2022
  - Taiwan High Speed Rail Corporation
- Innovation Awards 2021
  - National Center for High-performance Computing (NCHC)
  - Taiwan Business Bank

install

Chapter 2. Selecting a cluster installation method and preparing it for users OpenShift Container Platform 4.11 | Red Hat Customer Portal
OpenShift 4.10 安裝步驟 - HackMD
- Meet The New Agent-Based OpenShift Installer
Day 0 到底該如何規劃 Openshift Container Platform
Day 1 到底該如何安裝 Openshift Container Platform (Part 1)
Day 1 到底該如何安裝 Openshift Container Platform (Part 2)
Deploy OpenShift Container Platform 4.17 on KVM | ComputingForGeeks
agent install
- Preparing to install with Agent-based Installer - Installing an on-premise cluster with the Agent-based Installer | Installing | OpenShift Container Platform 4.14
  - Recommended resources for topologies
  - Notice about interfaces and rootDeviceHints
- Installing a cluster with Agent-based Installer - Installing an on-premise cluster with the Agent-based Installer | Installing | OpenShift Container Platform 4.14
  - Gathering log data from a failed Agent-based installation
    - ./openshift-install --dir agent wait-for bootstrap-complete --log-level=debug
    - ./openshift-install --dir agent wait-for install-complete --log-level=debug
- Better securing the future: Navigating Red Hat OpenShift disconnected installations with the agent-based installer
- OpenShift Agent install disconnected - HackMD
- Step by Step install OpenShift 4.18 with the Agent-based Installer (Online) - HackMD
- What is the best practice for dealing with kubeadmin user in OpenShift 4? - Red Hat Customer Portal
  - Creating a cluster admin
- Add worker to cluster built with Agent based installation
  - Add worker to cluster built with Agent based installation : r/openshift
  - 使用 agent based installer 安装 3 节点集群 - OpenShift4 慢慢走
Troubleshooting installations - Troubleshooting | Support | OpenShift Container Platform 4.15
- The initial kubeadmin password can be found in /auth/kubeadmin-password on the installation host.

which - installer-provisioned infrastructure installation - user-provisioned infrastructure installation

rootDeviceHints example

serialNumber

kindmet  add  ren hos

href="#__codelineno-0-1">apiVersion: v1alpha1 class="p">: AgentConfig adata: name: rhosh-env1 itionalNTPSources: - 172.19.21.10 dezvousIP: 172.19.21.45 ts: - hostname: rhosh-node45-ctrl1 role: master interfaces: - name: eno2np1 macAddress: 3c:ec:ef:72:3f:d3 rootDeviceHints: serialNumber: "9020A01BTTVR" networkConfig: interfaces: - name: eno1np0 type: ethernet state: up mac-address: 3c:ec:ef:72:3f:d2 ipv4: dhcp: True auto-dns: True ipv6: enabled: false - name: eno2np1 type: ethernet state: up mac-address: 3c:ec:ef:72:3f:d3 ipv4: enabled: true address: - ip: 172.19.21.45 prefix-length: 24 dhcp: False auto-dns: False ipv6: enabled: false routes: config: - destination: 0.0.0.0/0 next-hop-address: 172.19.21.1 next-hop-interface: eno2np1 table-id: 254 dns-resolver: config: search: - foo.com server: - 172.19.21.10 - hostname: rhosh-node46-ctrl2 role: master interfaces: - name: eno2np1 macAddress: 3c:ec:ef:72:40:49 rootDeviceHints: serialNumber: "9020A00GTTVR" networkConfig: interfaces: - name: eno1np0 type: ethernet state: up mac-address: 3c:ec:ef:72:40:48 ipv4: dhcp: True auto-dns: True ipv6: enabled: false - name: eno2np1 type: ethernet state: up mac-address: 3c:ec:ef:72:40:49 ipv4: enabled: true address: - ip: 172.19.21.46 prefix-length: 24 dhcp: False auto-dns: False ipv6: enabled: false routes: config: - destination: 0.0.0.0/0 next-hop-address: 172.19.21.1 next-hop-interface: eno2np1 table-id: 254 dns-resolver: config: search: - foo.com server: - 172.19.21.10 - hostname: rhosh-node47-ctrl3 role: master interfaces: - name: eno2np1 macAddress: 3c:ec:ef:72:3a:f3 rootDeviceHints: serialNumber: "9020A013TTVR" networkConfig: interfaces: - name: eno1np0 type: ethernet state: up mac-address: 3c:ec:ef:72:3a:f2 ipv4: dhcp: True auto-dns: True ipv6: enabled: false - name: eno2np1 type: ethernet state: up mac-address: 3c:ec:ef:72:3a:f3 ipv4: enabled: true address: - ip: 172.19.21.47 prefix-length: 24 dhcp: False auto-dns: False ipv6: enabled: false routes: config: - destination: 0.0.0.0/0 next-hop-address: 172.19.21.1 next-hop-interface: eno2np1 table-id: 254 dns-resolver: config: search: - foo.com server: - 172.19.21.10 - hostname: rhosh-node48-wrk1 role: worker interfaces: - name: eno2np1 macAddress: 3c:ec:ef:72:3a:99 rootDeviceHints: serialNumber: "90A0A008TTVR" networkConfig: interfaces: - name: eno2np1 type: ethernet state: up mac-address: 3c:ec:ef:72:3a:99 ipv4: enabled: true address: - ip: 172.19.21.48 prefix-length: 24 dhcp: False auto-dns: False ipv6: enabled: false - name: eno1np0 type: ethernet state: up mac-address: 3c:ec:ef:72:3a:98 ipv4: dhcp: True auto-dns: True ipv6: enabled: false routes: config: - destination: 0.0.0.0/0 next-hop-address: 172.19.21.1 next-hop-interface: eno2np1 table-id: 254 dns-resolver: config: search: - foo.com server: - 172.19.21.10

openshift 4.17 build woker node iso

export REGISTRY_AUTH_FILE=/tmp/ocp/mirror-registry/pull-secret.json
oc adm node-image create nodes-config.yaml
oc adm node-image monitor --ip-addresses <ip_addresses>
oc get csr
oc adm certificate approve <csr_name>

nodes-config.yaml

hosts:
  - hostname: extra-worker-1
    rootDeviceHints:
      deviceName: /dev/nvme0n1
    interfaces:
      - macAddress: 90:5a:08:03:6a:30
        name: enp23s0f0np0
      - macAddress: 5E:09:6B:17:DE:F6
        name: enp0s20f0u1u1c2
      - macAddress: 90:5a:08:03:6a:31
        name: enp23s0f1np1
    networkConfig:
      interfaces:
        - name: enp23s0f0np0
          type: ethernet
          state: up
          mac-address: 90:5a:08:03:6a:30
          ipv4:
            enabled: true
            address:
              - ip: 172.17.217.240
                prefix-length: 24
            dhcp: false
            auto-dns: false
          ipv6:
            enabled: false
        - name: enp0s20f0u1u1c2
          type: ethernet
          state: down
          mac-address: 5E:09:6B:17:DE:F6
          ipv4:
            enabled: false
          ipv6:
            enabled: false
        - name: enp23s0f1np1
          type: ethernet
          state: down
          mac-address: 90:5a:08:03:6a:31
          ipv4:
            enabled: false
          ipv6:
            enabled: false
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: 172.17.217.1
            next-hop-interface: enp23s0f0np0
            table-id: 254
      dns-resolver:
        config:
          search:
            - b3qportal.com
          server:
            - 172.17.217.241

CLI

CLI tools OpenShift Container Platform 4.14 | Red Hat Customer Portal
- Deleting Operators from a cluster - Administrator tasks | Operators | OpenShift Container Platform 4.15

oc cluster-info
oc project
oc login https://api.ocp4.example.com:6443
# oc login https://172.24.131.126:6443 --username=kubeadmin --password=bar --insecure-skip-tls-verify
oc whoami -c
oc whoami --show-console
oc api-versions
oc status
# view your current CLI configuration
oc config view
#  list the total memory and CPU usage of all pods in the cluster, --sum option with the command to print the sum of the resource usage. The -A option shows pods from all namespaces.
oc adm top pods -A --sum
# Use the --containers option to display the resource usage of containers within a pod.
oc adm top pods apiserver-75ff56786f-25rpd -n openshift-apiserver --containers

oc get clusteroperator
oc get operators
oc get operators nfd.openshift-nfd
oc get RESOURCE_TYPE
oc get RESOURCE_TYPE RESOURCE_NAME -o yaml
oc get RESOURCE_TYPE RESOURCE_NAME -o json
oc get all
oc get all -n openshift-apiserver --show-kind
oc get all -n openshift-monitoring --show-kind
# execute commands in a different project, you must include the --namespace or -n options.
oc get pods -n openshift-apiserver
oc get pods -n openshift-apiserver -o yaml
oc get pods -n openshift-apiserver -o json
# print the labels used by the pods.
oc get pods -n openshift-apiserver --show-labels
oc get pod --all-namespaces -o wide
# shows additional fields.
oc get pods -o wide

# this function is not available across all resources. 
oc describe RESOURCE_TYPE RESOURCE_NAME

# to print the documentation of a specific field of a resource. 
# Fields are identified via a JSONPath identifier.
# Information about each field is retrieved from the server in OpenAPI format.
oc explain pods
oc explain pods.spec.containers.resources
# display all fields of a resource without descriptions.
oc explain pods --recursive


# create a RHOCP resource in the current project.
# paired with the oc get RESOURCE_TYPE RESOURCE_NAME -o yaml command for editing definitions.
# to indicate the file that contains the JSON or YAML representation of an RHOCP resource.
oc create -f pod.yaml

# delete an existing RHOCP resource from the current project.
# must specify the resource type and the resource name.
oc delete pod quotes-ui

# RBAC
oc get clusterrole.rbac

monitor and log about the cluster

oc logs alertmanager-main-0 -n openshift-monitoring
# returns the output for a container within a pod
oc logs alertmanager-main-0 -n openshift-monitoring
oc get nodes master-0 -o json | jq '.status.conditions'
oc get nodes worker-0 -o json | jq '.status.conditions'
oc adm node-logs worker-0
oc adm node-logs worker-0 --tail 10
## start a debug session on the node 
oc debug node/worker-0
oc get pods alertmanager-main-0 -n openshift-monitoring -o jsonpath='{.spec.containers[*].name}'
oc logs alertmanager-main-0 -n openshift-monitoring -c alertmanager-proxy
oc exec -n openshift-monitoring alertmanager-main-0 -c alertmanager-proxy -it -- bash -il

update the pull secret

oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d > pull-secret.json
vim pull-secret.json
oc set data secret/pull-secret -n openshift-config --from-file=.dockerconfigjson=pull-secret.json

Tab completion

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#cli-enabling-tab-completion
```
oc completion bash > oc_bash_completion
sudo cp oc_bash_completion /etc/bash_completion.d/
```
You can also save the file to a local directory and source it from your .bashrc file instead.

Tab completion is enabled when you open a new terminal.

Authentication with OAuth

For users to interact with RHOCP, they must first authenticate to the cluster. The authentication layer identifies the user that is associated with requests to the RHOCP API. After authentication, the authorization layer then uses information about the requesting user to determine whether the request is allowed.

A user in OpenShift is an entity that can make requests to the RHOCP API.

An RHOCP User object represents an actor that can be granted permissions in the system by adding roles to the user or to the user's groups.

Regular users
- An RHOCP User object represents a regular user.
System users
Service accounts
- ServiceAccount objects represent service accounts.
- RHOCP creates service accounts automatically when a project is created

The RHOCP control plane includes a built-in OAuth server.

To authenticate themselves to the API, users obtain OAuth access tokens. Token authentication is the only guaranteed method to work with any OpenShift cluster

To retrieve an OAuth token by using the OpenShift web console, navigate to Help → Command line tools.

[user@host ~]$ oc login --token=sha256-BW...rA8 \
  --server=https://api.ocp4.example.com:6443

image

quay

with internet

wget https://mirror.openshift.com/pub/cgw/mirror-registry/latest/mirror-registry-amd64.tar.gz
tar -xvf mirror-registry-amd64.tar.gz
./mirror-registry install --quayHostname $(hostname -f) --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar

# use different port
# https://github.com/quay/mirror-registry/blob/e609475d2eba1825866909d5d5997b048da5bc88/ansible-runner/context/app/project/roles/mirror_appliance/templates/pod.service.j2#L15
./mirror-registry install --quayHostname $(hostname -f):18443 --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar

air-gapped https://github.com/quay/mirror-registry#installation

wget https://github.com/quay/mirror-registry/releases/download/v2.0.3/mirror-registry-offline.tar.gz
tar -zxvf mirror-registry-offline.tar.gz
./mirror-registry install --quayHostname $(hostname -f) --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar

pull images

with internet

oc adm release mirror -a /tmp/ocp/mirror-registry/pull-secret.json \
--from=quay.io/openshift-release-dev/ocp-release:4.17.10-x86_64 \
--to=<LOCAL_REGISTRY>/<LOCAL_REPOSITORY> \
--to-release-image=<LOCAL_REGISTRY>/<LOCAL_REPOSITORY>:4.17.10-x86_64

air-gapped Mirror the images to a directory on the removable media

oc adm release mirror -a /tmp/ocp/mirror-registry/pull-secret.json \
--to-dir=/tmp/mirror \
quay.io/openshift-release-dev/ocp-release:4.17.10-x86_64

:::info info: Mirroring completed in 30m21.7s (10.99MB/s)

Success Update image: openshift/release:4.17.10-x86_64

To upload local images to a registry, run:

oc image mirror --from-dir=/tmp/mirror 'file://openshift/release:4.17.10-x86_64*' REGISTRY/REPOSITORY

Configmap signature file /tmp/mirror/config/signature-sha256-4c8cc149a8e4ef2f.json created :::

oc image mirror \
-a /tmp/ocp/mirror-registry/pull-secret.json \
--certificate-authority=/home/foo/quay/quay-rootCA/rootCA.pem \
--from-dir=/tmp/mirror \
'file://openshift/release:4.17.10-x86_64*' <LOCAL_REGISTRY>/<LOCAL_REPOSITORY>

uninstall

./mirror-registry uninstall -v --autoApprove --quayRoot /home/foo/quay

debug

Failed to mirror-registry installation with WRONGPASS invalid username-password pair or user is disabled messages - Red Hat Customer Portal

podman secret ls
podman secret rm redis_pass

Nexus

Using Nexus as a proxy registry for a disconnected OpenShift install - Sonatype Nexus Repository - Sonatype Community
- Using Nexus as a proxy registry for a disconnected OpenShift install · Issue #475 · sonatype/nexus-public
- small patch for disconnected OCP deploy with regular proxy registries (like Nexus) · Issue #757 · karmab/kcli · GitHub

API

How to request and pass an oauth token for REST API access in OpenShift 4 - Red Hat Customer Portal

# Get OAuth Route Hostname
oc get route -n openshift-authentication -o jsonpath='{.items[].spec.host}{"\n"}'  

# Oauth Bearer Token: Method 1
TOKEN=$(curl -s -k -i -L -X GET --user USER:PASSWORD 'https://<OAuth-route-hostname>/oauth/authorize?response_type=token&client_id=openshift-challenging-client' | grep -oP "access_token=\K[^&]*")

# Oauth Bearer Token: Method 2
TOKEN=$(curl -s -k -i -L -X GET --user USER:PASSWORD 'https://<oauth-route-hostname>/oauth/authorize?response_type=token&client_id=openshift-challenging-client' | grep  "access_token=" | awk -F'=' '{print $2}' | awk -F'&' '{print $1}')

# Test
curl -s -k -H "Authorization: Bearer $TOKEN" -X GET https://<API host>:6443/apis/project.openshift.io/v1/projects

Operators

curl -s -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/subscriptions

curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/web-terminal

curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator

web-terminal

# remove
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/web-terminal
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/clusterserviceversions/web-terminal.v1.9.0-0.1708477317.p


# install
curl -k -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -X POST https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions -d '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"name":"web-terminal","namespace":"openshift-operators"},"spec":{"channel":"fast","name":"web-terminal","source":"redhat-operators","sourceNamespace":"openshift-marketplace","startingCSV":"web-terminal.v1.9.0"}}'

openshift AI

# get subscription
curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator

# remove
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/clusterserviceversions/rhods-operator.2.8.0

# install
curl -k -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -X POST https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions -d '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"name":"rhods-operator","namespace":"redhat-ods-operator"},"spec":{"channel":"stable","name":"rhods-operator","source":"redhat-operators","sourceNamespace":"openshift-marketplace", "installPlanApproval": "Automatic", "startingCSV":"rhods-operator.2.8.0"}}'

Basic

template

oc process openshift//postgresql-persistent POSTGRESQL_USER=test POSTGRESQL_PASSWORD=test POSTGRESQL_DATABASE=test0328 | oc create -n tedchangchien-dev -f -
oc status
oc get pods
oc rsh <pod name>

psql -U test -W test0328

Network

OpenShift Networking - Full Walkthrough - Containers and Virtual Machines - YouTube

Machine Network

This is the network at the infrastructure layer of an OpenShift cluster, typically used to connect physical or virtual nodes (e.g., masters, worker nodes). The IP range of the machine network is used for communication between nodes and for running management services like ETCD and the Kubernetes control plane.

Cluster Network

The internal Pod network within the cluster used for communication between Pods.

Each Pod is typically assigned a unique IP address. OpenShift uses Software-Defined Networking (SDN) to manage the cluster network, ensuring seamless communication between Pods.

Service Network

This network manages the virtual IP range for Kubernetes Services. Each Service is assigned a Cluster IP to handle traffic from both internal and external sources. Service IPs usually do not communicate directly with the external world but instead use a Service Proxy or Load Balancer for traffic forwarding.

Service

https://www.youtube.com/watch?v=AObTrhIeK2U
- internal
  - ClusterIP
- external
  - NodePort
  - LoadBalancer
Kubernetes Service：Overview｜方格子 vocus
[Day 9] 建立外部服務與Pods的溝通管道 - Services
ClusterIP vs NodePort vs LoadBalancer vs Ingress - Red Hat Learning Community

Ingress

RBAC

Red Hat OpenShift RBAC 最小權限實踐 - 魂系架構 Phil's Workspace

Console

Customizing the console route
- How to customize console URL in OpenShift 4 under the same *.apps subdomain - Red Hat Customer Portal
- Customizing the OpenShift Console URL with TLS · MeatyBytes
  - How to create a TLS/SSL certificate with a Cert-Manager Operator on OpenShift

StorageClass

OpenShift Data Foundation

Product Documentation for Red Hat OpenShift Data Foundation 4.15 | Red Hat Customer Portal
- Chapter 3. Deploy using local storage devices Red Hat OpenShift Data Foundation 4.15 | Red Hat Customer Portal

Operators

offline

Chapter 5. Using Operator Lifecycle Manager in disconnected environments | Disconnected environments | OpenShift Container Platform | 4.17 | Red Hat Documentation
- Chapter 3. Mirroring in disconnected environments | Disconnected environments | OpenShift Container Platform | 4.17 | Red Hat Documentation
  - Populating OperatorHub from mirrored Operator catalogs
- Deploying Red Hat OpenShift Operators in a disconnected environment

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.17 
      minVersion: 4.17.2
      maxVersion: 4.17.2
    graph: true 
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17 
      packages: 
       - name: aws-load-balancer-operator
       - name: 3scale-operator
       - name: node-observability-operator
  additionalImages: 
   - name: registry.redhat.io/ubi8/ubi:latest
   - name: registry.redhat.io/ubi9/ubi@sha256:20f695d2a91352d4eaa25107535126727b5945bff38ed36a3e59590f495046f0

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.17
      minVersion: 4.17.2
      maxVersion: 4.17.2
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17
      packages:
       - name: nfd
       - name: sriov-network-operator
    - catalog: registry.redhat.io/redhat/certified-operator-index:v4.17
      packages:
       - name: gpu-operator-certified
       - name: nvidia-network-operator
  additionalImages:
   - name: registry.redhat.io/ubi8/ubi:latest
   - name: registry.redhat.io/ubi9/ubi@sha256:20f695d2a91352d4eaa25107535126727b5945bff38ed36a3e59590f495046f0

oc mirror -c imageset-config.yaml file://mirror/ --v2 --authfile pull-secret.json

openshift AI

NVIDIA GPU

NVIDIA GPU architecture | Hardware accelerators | OpenShift Container Platform 4.17
- :star:Red Hat OpenShift on Bare Metal — NVIDIA AI Enterprise: OpenShift on Bare-metal Deployment Guide
- Enabling the GPU Monitoring Dashboard — NVIDIA GPU Operator on Red Hat OpenShift Container Platform
NVIDIA AI Enterprise
Demo: NVIDIA AI Enterprise with Red Hat OpenShift - YouTube
pa-nvidia-steamline-gen-ai-development-brief-1468598-202410-en.pdf
NVIDIA GPU Operator
NVIDIA NIM
- Introduction — NVIDIA NIM for Large Language Models (LLMs)
- Deliver generative AI at scale with NVIDIA NIM on OpenShift AI | Red Hat Developer
- Installing NVIDIA NIM Operator on Red Hat OpenShift — NVIDIA NIM Operator
  - Managing NIM Services — NVIDIA NIM Operator
- NVIDIA NeMo 微服務正式發布，協助企業建立 AI 代理提升生產力 - INSIDE
  - NIM專注於推論（inference）或執行模型，確保在輸送量、延遲與成本方面獲得最佳的 GPU 效率
  - NeMo則著重於訓練與提升模型能力
  - 美國電信業者 AT&T, 貝萊德, 思科的 Outshift 團隊, 納斯達克
- What is the difference between NVIDIA NIM and NVIDIA Nemo - perplexity
- What is the difference between NVIDIA NIM and NVIDIA Nemo - chatgpt
Others

GPUs and bare metal

In addition, the worker nodes can host one or more GPUs, but they must be of the same type. For example, a node can have two NVIDIA A100 GPUs, but a node with one A100 GPU and one T4 GPU is not supported. The NVIDIA Device Plugin for Kubernetes does not support mixing different GPU models on the same node.

Multi-instance GPU (MIG) partitioning

Sharing is caring: How to make the most of your GPUs part 2 - Multi-instance GPU

MIG is only supported with A30, A100, A100X, A800, AX800, H100, and H800.

For instance, the NVIDIA A100 40GB, offers multiple partitioning options:

1g.5gb: 1 Compute Instance (CI), 5GB memory
2g.10gb: 2 CIs, 10GB memory
3g.20gb: 3 CIs, 20GB memory
4g.20gb: 4 CIs, 20GB memory
7g.40gb: 7 CIs, 40GB memory

Check the supported profiles

oc rsh  -n nvidia-gpu-operator  $(oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}') nvidia-smi mig -lgip

# if not all gpu nodes support the MIG
oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}
oc rsh  -n nvidia-gpu-operator nvidia-dcgm-exporter-dln49  nvidia-smi mig -lgip

Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init), init-pod-nvidia-node-status-exporter (init)
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.5gb        19     7/7        4.75       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.5gb+me     20     1/1        4.75       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.10gb       15     4/4        9.75       No     14     1     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.10gb       14     3/3        9.75       No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.20gb        9     2/2        19.62      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.20gb        5     1/1        19.62      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.40gb        0     1/1        39.38      No     98     5     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

config MIG

# config the mig to node/worker-1(has a GPU A100) which is based on your environment
oc label  node/worker-1 nvidia.com/mig.config=all-1g.10gb --overwrite=true
# check the log
oc logs -n nvidia-gpu-operator $(oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}')
# check the node
oc describe node worker-1
oc describe nodes | grep -A 6 "Capacity"
oc get nodes -o=custom-columns='Node:metadata.name,GPU Product:metadata.labels.nvidia\.com/gpu\.product,GPU Capacity:status.capacity.nvidia\.com/gpu'

# show the mig
oc rsh -n nvidia-gpu-operator \
$(oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}') nvidia-smi mig -lgi

# if not all gpu nodes support the MIG
oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}'
oc rsh  -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202412180008-0-rzp8x nvidia-smi mig -lgi

After config it into all-1g.10gb

+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 1g.10gb         15        3          4:2     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         15        4          6:2     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         15        5          0:2     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         15        6          2:2     |
+-------------------------------------------------------+

After config it into all-1g.5gb

+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 1g.5gb          19        7          4:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19        8          5:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19        9          6:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       11          0:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       12          1:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       13          2:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       14          3:1     |
+-------------------------------------------------------+

# run the gpu application
cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8"
   resources:
     limits:
       nvidia.com/gpu: 4
EOF

Disalbe the MIG

# disalbe the mig on node/worker-1
oc label  node/worker-1 nvidia.com/mig.config=all-disabled --overwrite=true

Deploying NVIDIA AI Enterprise Containers

Deploying NVIDIA AI Enterprise Containers — NVIDIA AI Enterprise: OpenShift on Bare-metal Deployment Guide

prerequisite 1. apply for NGC_KEY

create a secret for pulling images from NGC

oc create secret docker-registry regcred --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default

jupyter

tensorflow-jupyter.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-jupyter-notebook
  labels:
    app: tensorflow-jupyter-notebook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tensorflow-jupyter-notebook
  template:
    metadata:
      labels:
        app: tensorflow-jupyter-notebook
    spec:
      containers:
      - name: tensorflow-container
        image: nvcr.io/nvidia/tensorflow-pb24h2:24.08.07-tf2-py3
        # image: nvcr.io/nvaie/tensorflow-2-3:22.09-tf2-nvaie-2.3-py3
        ports:
        - containerPort: 8888
        command: ["jupyter-notebook"]
        args: ["--NotebookApp.token=''"]
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-jupyter-notebook
spec:
  type: NodePort
  selector:
    app: tensorflow-jupyter-notebook
  ports:
  - protocol: TCP
    nodePort: 30040
    port: 8888
    targetPort: 8888

oc apply -f tensorflow-jupyter.yaml
oc get pods
oc describe pod <pod-name>
# Note the FQDN or IP of the node it is running on and construct the URL for accessing the notebook.
# http://<NODE_FQDN_OR_IP>:30040

Running ResNet-50 with TensorFlow

Running ResNet-50 with TensorFlow

tensorflow-gpu.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-gpu
  labels:
    app: tensorflow-gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tensorflow-gpu
  template:
    metadata:
      labels:
        app: tensorflow-gpu
    spec:
      containers:
      - name: tensorflow
        image: nvcr.io/nvidia/tensorflow-pb24h2:24.08.07-tf2-py3
        command: ["/bin/bash"]
        args: ["-c", "sleep infinity"]
        resources:
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred

oc apply -f tensorflow-gpu.yaml
oc get pods
oc exec -it <pod-name> -- /bin/bash
cd /workspace/nvidia-examples/cnn

python resnet.py  -b 16 -i 200 -u batch --precision fp16
# it works
python resnet.py  -b 32 -i 200 -u batch --precision fp16
# it works

# mpiexec --allow-run-as-root --bind-to socket -np 7 python resnet.py  -b 32 -i 200 -u batch --precision fp16
# it can not work
# ncclCommInitRank failed: invalid usage (run with NCCL_DEBUG=WARN for details)

Every 2.0s: nvidia-smi                                                                                                            tensorflow-gpu-56754d7d47-2lgjh: Tue Apr 22 10:00:38 2025

Tue Apr 22 10:00:38 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:01:00.0 Off |                   On |
| N/A   55C    P0            104W /  250W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   3  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   4  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   5  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   14   0   6  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0    7    0            35027      C   python                                  332MiB |
|    0    8    0            35002      C   python                                  332MiB |
|    0    9    0            35014      C   python                                  332MiB |
|    0   11    0            35001      C   python                                  332MiB |
|    0   12    0            35009      C   python                                  332MiB |
|    0   13    0            35008      C   python                                  332MiB |
|    0   14    0            34993      C   python                                  332MiB |
+-----------------------------------------------------------------------------------------+

PyTorch MNIST

Pulling and Running NVIDIA AI Enterprise Containers — NVIDIA AI Enterprise: Cloud Deployment Guide

pytorch.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-mnist
  labels:
    app: pytorch-mnist
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-mnist
  template:
    metadata:
      labels:
        app: pytorch-mnist
    spec:
      containers:
      - name: pytorch
        image: nvcr.io/nvidia/pytorch-ltsb2:23.08-lws2.1.0-py3
        command: ["/bin/bash"]
        args: ["-c", "sleep infinity"]
        resources:
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred

oc apply -f pytorch.yaml
oc get pods
oc exec -it <pod-name> -- /bin/bash
cd /workspace/examples/upstream/mnist

python main.py
# it works

RDMA

OpenShift RDMA

Debug

image

oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'

agent install

some cluster operators are not available

ClusterVersion: Installing "4.14.15" for About an hour: Unable to apply 4.14.15: some cluster operators are not available

DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, console, ingress, machine-api, monitoring are not available - OCP 4.x Installation incomplete: cluster failed to initialize due to some cluster operators are still updating - Red Hat Customer Portal - openshift-install not creating the worker vm using IPI · Issue #386 · okd-project/okd · GitHub - Crio and kubelet services are stuck in "dead" status and are unable to start in OCP 4 - Red Hat Customer Portal - Master kubelet gets Unauthorized and stuck when bootstrapping masters with 10 min gaps in OpenShift 4 - Red Hat Customer Portal

oc get nodes -o wide
oc get clusteroperators
oc get pod --all-namespaces -o wide
oc get po -n openshift-ingress
oc describe pod -n openshift-ingress router-default-66f58c7559-f2gqf

on Rendezvous node

journalctl -u assisted-service.service
journalctl -b -f -u release-image.service -u bootkube.service

oc get pods -n openshift-ingress  router-default-66f58c7559
oc describe pod/router-default-66f58c7559-fmx72 -n openshift-ingress
oc get mcp

on faile node

systemctl list-jobs
podman login registry.redhat.io --authfile /var/lib/kubelet/config.json
dig registry.redhat.io
nslookup registry.redhat.io

log in the target core os node

ssh core@<target core os ip or FQDN>

DNS issue

172.20.0.1 is internal dns and if it can reach internet, it't ok

otherwise, add the external dns into agent-config.yaml

dns-resolver:
        config:
          search:
          - supershift.com
          server:
          - 172.20.0.1
          - 8.8.8.8

vmware virtual machine

虛擬機器選項 => 開機選項 => 啟用 UEFI 安全開機
虛擬機器選項 => 進階 => 組態參數
- disk.EnableUUID: TRUE

mirror registry

check the /etc/containers/registries.conf

deployment failure

journalctl -u kubelet -n 100
journalctl -u crio -n 100

Integration

Identity provider

LDAP

389 Directory Server - Deploying 389 Directory Server on OpenShift

Subscription

HyperShift

GitHub - openshift/hypershift: Hyperscale OpenShift - clusters with hosted control planes

Support

OpenShift Lifecycle: https://access.redhat.com/support/policy/updates/openshift

OpenShift AAA: https://docs.openshift.com/container-platform/4.17/authentication/index.html

OpenShift Identity Providers: https://docs.openshift.com/container-platform/4.17/authentication/understanding-identity-provider.html

Certificates: https://docs.openshift.com/container-platform/4.17/security/index.html

Deploying ODF on Bare Metal: https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_openshift_data_foundation_using_bare_metal_infrastructure/index

ODF Architecture (Internal/External approach): https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/planning_your_deployment/odf-architecture_rhodf#odf-architecture_rhodf

Mirror Registry: https://docs.openshift.com/container-platform/4.17/disconnected/mirroring/installing-mirroring-creating-registry.html

OpenShift Web Console customizations: https://docs.openshift.com/container-platform/4.17/web_console/customizing-the-web-console.html

Add Worker Node to an OpenShift cluster: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-adding-node-iso.html