OpenShift
- redhat hybrid cloud console
- :star: Deploying a simple Python app to Kubernetes/OpenShift | JJ Asghar | Conf42 Python 2022
- 『紅帽』的 Cloud-Native 工作術: 從 Container 到 OpenShift 。 :: 第 12 屆 iThome 鐵人賽
- 免 YAML 部署 App 到 OpenShift: new-app 跟 Template 淺談 - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天
oc get all -o NAME --no-headers | xargs oc delete
- Docker獸 究極進化 ~~ Kubernetes獸 :: 第 12 屆 iThome 鐵人賽
- 愛的走馬看花 Red Hat CoreOS 與 Red Hat OpenShift Part 1 - 魂系架構 Phil's Workspace
- Machine configuration tasks | Post-installation configuration | OpenShift Container Platform 4.13
- How to Configure a Squid Proxy and SSH Tunnel on RHEL 8 to access OpenShift Console from your local machine - Goglides Dev 🌱
- Configure access to a Red Hat OpenShift cluster on a private network in IBM Power Systems Virtual Server - IBM Developer
- Red Hat OpenStack Services on OpenShift: Rethinking storage design in pod-based architectures
- Course
What companies using the Openshift
- What companies using the Openshift
- Innovation Awards 2024
- NanShan Life Insurance
- Next Bank
- Innovation Awards 2023
- KGI Securities
- National Taiwan University Hospital, (NTUH)
- Innovation Awards 2022
- Taiwan High Speed Rail Corporation
- Innovation Awards 2021
- National Center for High-performance Computing (NCHC)
- Taiwan Business Bank
- Innovation Awards 2024
install
- Chapter 2. Selecting a cluster installation method and preparing it for users OpenShift Container Platform 4.11 | Red Hat Customer Portal
- OpenShift 4.10 安裝步驟 - HackMD
- Day 0 到底該如何規劃 Openshift Container Platform
- Day 1 到底該如何安裝 Openshift Container Platform (Part 1)
- Day 1 到底該如何安裝 Openshift Container Platform (Part 2)
- Deploy OpenShift Container Platform 4.17 on KVM | ComputingForGeeks
- agent install
- Preparing to install with Agent-based Installer - Installing an on-premise cluster with the Agent-based Installer | Installing | OpenShift Container Platform 4.14
- Installing a cluster with Agent-based Installer - Installing an on-premise cluster with the Agent-based Installer | Installing | OpenShift Container Platform 4.14
- Gathering log data from a failed Agent-based installation
- ./openshift-install --dir
agent wait-for bootstrap-complete --log-level=debug - ./openshift-install --dir
agent wait-for install-complete --log-level=debug
- ./openshift-install --dir
- Gathering log data from a failed Agent-based installation
- Better securing the future: Navigating Red Hat OpenShift disconnected installations with the agent-based installer
- OpenShift Agent install disconnected - HackMD
- What is the best practice for dealing with kubeadmin user in OpenShift 4? - Red Hat Customer Portal
- Add worker to cluster built with Agent based installation
- Troubleshooting installations - Troubleshooting | Support | OpenShift Container Platform 4.15
- The initial kubeadmin password can be found in
/auth/kubeadmin-password on the installation host.
- The initial kubeadmin password can be found in
which - installer-provisioned infrastructure installation - user-provisioned infrastructure installation
openshift 4.17 build woker node iso
export REGISTRY_AUTH_FILE=/tmp/ocp/mirror-registry/pull-secret.json
oc adm node-image create nodes-config.yaml
oc adm node-image monitor --ip-addresses <ip_addresses>
oc get csr
oc adm certificate approve <csr_name>
nodes-config.yaml
hosts:
- hostname: extra-worker-1
rootDeviceHints:
deviceName: /dev/nvme0n1
interfaces:
- macAddress: 90:5a:08:03:6a:30
name: enp23s0f0np0
- macAddress: 5E:09:6B:17:DE:F6
name: enp0s20f0u1u1c2
- macAddress: 90:5a:08:03:6a:31
name: enp23s0f1np1
networkConfig:
interfaces:
- name: enp23s0f0np0
type: ethernet
state: up
mac-address: 90:5a:08:03:6a:30
ipv4:
enabled: true
address:
- ip: 172.17.217.240
prefix-length: 24
dhcp: false
auto-dns: false
ipv6:
enabled: false
- name: enp0s20f0u1u1c2
type: ethernet
state: down
mac-address: 5E:09:6B:17:DE:F6
ipv4:
enabled: false
ipv6:
enabled: false
- name: enp23s0f1np1
type: ethernet
state: down
mac-address: 90:5a:08:03:6a:31
ipv4:
enabled: false
ipv6:
enabled: false
routes:
config:
- destination: 0.0.0.0/0
next-hop-address: 172.17.217.1
next-hop-interface: enp23s0f0np0
table-id: 254
dns-resolver:
config:
search:
- b3qportal.com
server:
- 172.17.217.241
CLI
oc cluster-info
oc project
oc login https://api.ocp4.example.com:6443
# oc login https://172.24.131.126:6443 --username=kubeadmin --password=bar --insecure-skip-tls-verify
oc whoami -c
oc whoami --show-console
oc api-versions
oc status
# view your current CLI configuration
oc config view
# list the total memory and CPU usage of all pods in the cluster, --sum option with the command to print the sum of the resource usage. The -A option shows pods from all namespaces.
oc adm top pods -A --sum
# Use the --containers option to display the resource usage of containers within a pod.
oc adm top pods apiserver-75ff56786f-25rpd -n openshift-apiserver --containers
oc get clusteroperator
oc get operators
oc get operators nfd.openshift-nfd
oc get RESOURCE_TYPE
oc get RESOURCE_TYPE RESOURCE_NAME -o yaml
oc get RESOURCE_TYPE RESOURCE_NAME -o json
oc get all
oc get all -n openshift-apiserver --show-kind
oc get all -n openshift-monitoring --show-kind
# execute commands in a different project, you must include the --namespace or -n options.
oc get pods -n openshift-apiserver
oc get pods -n openshift-apiserver -o yaml
oc get pods -n openshift-apiserver -o json
# print the labels used by the pods.
oc get pods -n openshift-apiserver --show-labels
oc get pod --all-namespaces -o wide
# shows additional fields.
oc get pods -o wide
# this function is not available across all resources.
oc describe RESOURCE_TYPE RESOURCE_NAME
# to print the documentation of a specific field of a resource.
# Fields are identified via a JSONPath identifier.
# Information about each field is retrieved from the server in OpenAPI format.
oc explain pods
oc explain pods.spec.containers.resources
# display all fields of a resource without descriptions.
oc explain pods --recursive
# create a RHOCP resource in the current project.
# paired with the oc get RESOURCE_TYPE RESOURCE_NAME -o yaml command for editing definitions.
# to indicate the file that contains the JSON or YAML representation of an RHOCP resource.
oc create -f pod.yaml
# delete an existing RHOCP resource from the current project.
# must specify the resource type and the resource name.
oc delete pod quotes-ui
# RBAC
oc get clusterrole.rbac
monitor and log about the cluster
oc logs alertmanager-main-0 -n openshift-monitoring
# returns the output for a container within a pod
oc logs alertmanager-main-0 -n openshift-monitoring
oc get nodes master-0 -o json | jq '.status.conditions'
oc get nodes worker-0 -o json | jq '.status.conditions'
oc adm node-logs worker-0
oc adm node-logs worker-0 --tail 10
## start a debug session on the node
oc debug node/worker-0
oc get pods alertmanager-main-0 -n openshift-monitoring -o jsonpath='{.spec.containers[*].name}'
oc logs alertmanager-main-0 -n openshift-monitoring -c alertmanager-proxy
oc exec -n openshift-monitoring alertmanager-main-0 -c alertmanager-proxy -it -- bash -il
Tab completion
- https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#cli-enabling-tab-completion You can also save the file to a local directory and source it from your .bashrc file instead.
Tab completion is enabled when you open a new terminal.
Authentication with OAuth
For users to interact with RHOCP, they must first authenticate to the cluster. The authentication layer identifies the user that is associated with requests to the RHOCP API. After authentication, the authorization layer then uses information about the requesting user to determine whether the request is allowed.
A user in OpenShift is an entity that can make requests to the RHOCP API.
An RHOCP User object represents an actor that can be granted permissions in the system by adding roles to the user or to the user's groups.
- Regular users
- An RHOCP User object represents a regular user.
- System users
- Service accounts
- ServiceAccount objects represent service accounts.
- RHOCP creates service accounts automatically when a project is created
The RHOCP control plane includes a built-in OAuth server.
To authenticate themselves to the API, users obtain OAuth access tokens. Token authentication is the only guaranteed method to work with any OpenShift cluster
To retrieve an OAuth token by using the OpenShift web console, navigate to Help → Command line tools.
image
- Chapter 9. Image configuration resources OpenShift Container Platform 4.11 | Red Hat Customer Portal
- OpenShift External/Mirror Image registry是怎麼運作的? | by Albert Weng | Medium
- OpenShift 4 - 配置OpenShift可使用的外部Image Registry和Mirror Registry_openshift配置外部私有registry-CSDN博客
- Openshift - Quay 本地私有 Registry 倉庫 (standalone) - HowHow の WebSite
quay
- Chapter 2. Creating a mirror registry with mirror registry for Red Hat OpenShift | Red Hat Product Documentation
- GitHub - quay/mirror-registry: A standalone registry used to mirror images for Openshift installations.
- Installing OpenShift in a disconnected network, step-by-step - HackMD
- Chapter 1. SSL and TLS for Red Hat Quay | Red Hat Product Documentation
- OpenShift External/Mirror Image registry是怎麼運作的? | by Albert Weng | Medium
- Quay.io rate limiting - Red Hat Customer Portal
with internet
wget https://mirror.openshift.com/pub/cgw/mirror-registry/latest/mirror-registry-amd64.tar.gz
tar -xvf mirror-registry-amd64.tar.gz
./mirror-registry install --quayHostname $(hostname -f) --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar
# use different port
# https://github.com/quay/mirror-registry/blob/e609475d2eba1825866909d5d5997b048da5bc88/ansible-runner/context/app/project/roles/mirror_appliance/templates/pod.service.j2#L15
./mirror-registry install --quayHostname $(hostname -f):18443 --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar
air-gapped https://github.com/quay/mirror-registry#installation
wget https://github.com/quay/mirror-registry/releases/download/v2.0.3/mirror-registry-offline.tar.gz
tar -zxvf mirror-registry-offline.tar.gz
./mirror-registry install --quayHostname $(hostname -f) --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar
pull images
with internet
oc adm release mirror -a /tmp/ocp/mirror-registry/pull-secret.json \
--from=quay.io/openshift-release-dev/ocp-release:4.17.10-x86_64 \
--to=<LOCAL_REGISTRY>/<LOCAL_REPOSITORY> \
--to-release-image=<LOCAL_REGISTRY>/<LOCAL_REPOSITORY>:4.17.10-x86_64
air-gapped Mirror the images to a directory on the removable media
oc adm release mirror -a /tmp/ocp/mirror-registry/pull-secret.json \
--to-dir=/tmp/mirror \
quay.io/openshift-release-dev/ocp-release:4.17.10-x86_64
:::info info: Mirroring completed in 30m21.7s (10.99MB/s)
Success Update image: openshift/release:4.17.10-x86_64
To upload local images to a registry, run:
oc image mirror --from-dir=/tmp/mirror 'file://openshift/release:4.17.10-x86_64*' REGISTRY/REPOSITORY
Configmap signature file /tmp/mirror/config/signature-sha256-4c8cc149a8e4ef2f.json created :::
oc image mirror \
-a /tmp/ocp/mirror-registry/pull-secret.json \
--certificate-authority=/home/foo/quay/quay-rootCA/rootCA.pem \
--from-dir=/tmp/mirror \
'file://openshift/release:4.17.10-x86_64*' <LOCAL_REGISTRY>/<LOCAL_REPOSITORY>
uninstall
debug
Nexus
API
# Get OAuth Route Hostname
oc get route -n openshift-authentication -o jsonpath='{.items[].spec.host}{"\n"}'
# Oauth Bearer Token: Method 1
TOKEN=$(curl -s -k -i -L -X GET --user USER:PASSWORD 'https://<OAuth-route-hostname>/oauth/authorize?response_type=token&client_id=openshift-challenging-client' | grep -oP "access_token=\K[^&]*")
# Oauth Bearer Token: Method 2
TOKEN=$(curl -s -k -i -L -X GET --user USER:PASSWORD 'https://<oauth-route-hostname>/oauth/authorize?response_type=token&client_id=openshift-challenging-client' | grep "access_token=" | awk -F'=' '{print $2}' | awk -F'&' '{print $1}')
# Test
curl -s -k -H "Authorization: Bearer $TOKEN" -X GET https://<API host>:6443/apis/project.openshift.io/v1/projects
Operators
curl -s -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/subscriptions
curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/web-terminal
curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator
web-terminal
# remove
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/web-terminal
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/clusterserviceversions/web-terminal.v1.9.0-0.1708477317.p
# install
curl -k -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -X POST https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions -d '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"name":"web-terminal","namespace":"openshift-operators"},"spec":{"channel":"fast","name":"web-terminal","source":"redhat-operators","sourceNamespace":"openshift-marketplace","startingCSV":"web-terminal.v1.9.0"}}'
openshift AI
# get subscription
curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator
# remove
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/clusterserviceversions/rhods-operator.2.8.0
# install
curl -k -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -X POST https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions -d '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"name":"rhods-operator","namespace":"redhat-ods-operator"},"spec":{"channel":"stable","name":"rhods-operator","source":"redhat-operators","sourceNamespace":"openshift-marketplace", "installPlanApproval": "Automatic", "startingCSV":"rhods-operator.2.8.0"}}'
Basic
- :star:How to deploy a web service on OpenShift | Enable Sysadmin
- Running a PostgreSQL app in Openshift & connecting to it! | by Harshit Dawar | Medium
- Harness the Power of Python Microservices in OpenShift · MeatyBytes
template
oc process openshift//postgresql-persistent POSTGRESQL_USER=test POSTGRESQL_PASSWORD=test POSTGRESQL_DATABASE=test0328 | oc create -n tedchangchien-dev -f -
oc status
oc get pods
oc rsh <pod name>
psql -U test -W test0328
Network
Machine Network
This is the network at the infrastructure layer of an OpenShift cluster, typically used to connect physical or virtual nodes (e.g., masters, worker nodes). The IP range of the machine network is used for communication between nodes and for running management services like ETCD and the Kubernetes control plane.
Cluster Network
The internal Pod network within the cluster used for communication between Pods.
Each Pod is typically assigned a unique IP address. OpenShift uses Software-Defined Networking (SDN) to manage the cluster network, ensuring seamless communication between Pods.
Service Network
This network manages the virtual IP range for Kubernetes Services. Each Service is assigned a Cluster IP to handle traffic from both internal and external sources. Service IPs usually do not communicate directly with the external world but instead use a Service Proxy or Load Balancer for traffic forwarding.
Service
- https://www.youtube.com/watch?v=AObTrhIeK2U
- internal
- ClusterIP
- external
- NodePort
- LoadBalancer
- internal
- Kubernetes Service:Overview|方格子 vocus
- [Day 9] 建立外部服務與Pods的溝通管道 - Services
- ClusterIP vs NodePort vs LoadBalancer vs Ingress - Red Hat Learning Community
Ingress
- [Day 19] 在 Kubernetes 中實現負載平衡 - Ingress Controller
- 免除 Ingress Controller 煩惱,擁抱 OpenShift Route 新世界。
- Kubernetes 那些事 — Ingress 篇(一). 前言 | by Andy Chen | Andy的技術分享blog | Medium
- Kubernetes 那些事 — Ingress 篇(二). 前言 | by Andy Chen | Andy的技術分享blog | Medium
- Kubernetes Ingress vs OpenShift Route
RBAC
Console
StorageClass
- :star:Chapter 4. Configuring persistent storage | Red Hat Product Documentation
- OpenShift External/Mirror Image registry是怎麼運作的? | by Albert Weng | Medium
OpenShift Data Foundation
Operators
- Adding Operators to a cluster - Administrator tasks | Operators | OpenShift Container Platform 4.15
- Deleting Operators from a cluster - Administrator tasks | Operators | OpenShift Container Platform 4.15
- 3scale Operator on OpenShift4.2 | 野生的工程師
openshift AI
- Red Hat OpenShift AI Overview | Red Hat Developer
- Chapter 5. Installing the Red Hat OpenShift AI Operator Red Hat OpenShift AI Self-Managed 2.6 | Red Hat Customer Portal
- How to create a natural language processing (NLP) application using Red Hat OpenShift AI | Red Hat Developer
NVIDIA GPU
- NVIDIA GPU architecture | Hardware accelerators | OpenShift Container Platform 4.17
- NVIDIA AI Enterprise
- Subscription Required for "NVIDIA AI Enterprise Essentials"
- Overview — NVIDIA AI Enterprise Licensing Guide
- NVIDIA AI Enterprise - NVIDIA Docs
- Activate Your NVIDIA AI Enterprise License | NVIDIA
- Deploying NVIDIA AI Enterprise Containers — NVIDIA AI Enterprise: OpenShift on Bare-metal Deployment Guide
- Generating NGC API Keys
- 申請 NVIDIA NGC API key 用於 TAO toolkit DLI 課程 - CAVEDU教育團隊技術部落格
- https://ngc.nvidia.com/signin
- Pulling and Running NVIDIA AI Enterprise Containers — NVIDIA AI Enterprise: Cloud Deployment Guide
- Demo: NVIDIA AI Enterprise with Red Hat OpenShift - YouTube
- pa-nvidia-steamline-gen-ai-development-brief-1468598-202410-en.pdf
- NVIDIA GPU Operator
- About the NVIDIA GPU Operator — NVIDIA GPU Operator
- Installing the NVIDIA GPU Operator on OpenShift — NVIDIA GPU Operator on Red Hat OpenShift Container Platform
- NVIDIA AI Enterprise with OpenShift — NVIDIA GPU Operator on Red Hat OpenShift Container Platform
- Red Hat OpenShift on Bare Metal — NVIDIA AI Enterprise: OpenShift on Bare-metal Deployment Guide
- NVIDIA NIM
- Introduction — NVIDIA NIM for Large Language Models (LLMs)
- Deliver generative AI at scale with NVIDIA NIM on OpenShift AI | Red Hat Developer
- Installing NVIDIA NIM Operator on Red Hat OpenShift — NVIDIA NIM Operator
- NVIDIA NeMo 微服務正式發布,協助企業建立 AI 代理提升生產力 - INSIDE
- NIM專注於推論(inference)或執行模型,確保在輸送量、延遲與成本方面獲得最佳的 GPU 效率
- NeMo則著重於訓練與提升模型能力
- 美國電信業者 AT&T, 貝萊德, 思科的 Outshift 團隊, 納斯達克
- What is the difference between NVIDIA NIM and NVIDIA Nemo - perplexity
- What is the difference between NVIDIA NIM and NVIDIA Nemo - chatgpt
- Others
GPUs and bare metal
In addition, the worker nodes can host one or more GPUs, but they must be of the same type. For example, a node can have two NVIDIA A100 GPUs, but a node with one A100 GPU and one T4 GPU is not supported. The NVIDIA Device Plugin for Kubernetes does not support mixing different GPU models on the same node.
Multi-instance GPU (MIG) partitioning
MIG is only supported with A30, A100, A100X, A800, AX800, H100, and H800.
For instance, the NVIDIA A100 40GB, offers multiple partitioning options:
- 1g.5gb: 1 Compute Instance (CI), 5GB memory
- 2g.10gb: 2 CIs, 10GB memory
- 3g.20gb: 3 CIs, 20GB memory
- 4g.20gb: 4 CIs, 20GB memory
- 7g.40gb: 7 CIs, 40GB memory
Check the supported profiles
oc rsh -n nvidia-gpu-operator $(oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}') nvidia-smi mig -lgip
# if not all gpu nodes support the MIG
oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}
oc rsh -n nvidia-gpu-operator nvidia-dcgm-exporter-dln49 nvidia-smi mig -lgip
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init), init-pod-nvidia-node-status-exporter (init)
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.5gb 19 7/7 4.75 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.5gb+me 20 1/1 4.75 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb 15 4/4 9.75 No 14 1 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.10gb 14 3/3 9.75 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.20gb 9 2/2 19.62 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.20gb 5 1/1 19.62 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.40gb 0 1/1 39.38 No 98 5 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+
config MIG
# config the mig to node/worker-1(has a GPU A100) which is based on your environment
oc label node/worker-1 nvidia.com/mig.config=all-1g.10gb --overwrite=true
# check the log
oc logs -n nvidia-gpu-operator $(oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}')
# check the node
oc describe node worker-1
oc describe nodes | grep -A 6 "Capacity"
oc get nodes -o=custom-columns='Node:metadata.name,GPU Product:metadata.labels.nvidia\.com/gpu\.product,GPU Capacity:status.capacity.nvidia\.com/gpu'
# show the mig
oc rsh -n nvidia-gpu-operator \
$(oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}') nvidia-smi mig -lgi
# if not all gpu nodes support the MIG
oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}'
oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202412180008-0-rzp8x nvidia-smi mig -lgi
After config it into all-1g.10gb
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 1g.10gb 15 3 4:2 |
+-------------------------------------------------------+
| 0 MIG 1g.10gb 15 4 6:2 |
+-------------------------------------------------------+
| 0 MIG 1g.10gb 15 5 0:2 |
+-------------------------------------------------------+
| 0 MIG 1g.10gb 15 6 2:2 |
+-------------------------------------------------------+
After config it into all-1g.5gb
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 1g.5gb 19 7 4:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 8 5:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 9 6:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 11 0:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 12 1:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 13 2:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 14 3:1 |
+-------------------------------------------------------+
# run the gpu application
cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8"
resources:
limits:
nvidia.com/gpu: 4
EOF
Disalbe the MIG
# disalbe the mig on node/worker-1
oc label node/worker-1 nvidia.com/mig.config=all-disabled --overwrite=true
Deploying NVIDIA AI Enterprise Containers
prerequisite 1. apply for NGC_KEY
- create a secret for pulling images from NGC
jupyter
tensorflow-jupyter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-jupyter-notebook
labels:
app: tensorflow-jupyter-notebook
spec:
replicas: 1
selector:
matchLabels:
app: tensorflow-jupyter-notebook
template:
metadata:
labels:
app: tensorflow-jupyter-notebook
spec:
containers:
- name: tensorflow-container
image: nvcr.io/nvidia/tensorflow-pb24h2:24.08.07-tf2-py3
# image: nvcr.io/nvaie/tensorflow-2-3:22.09-tf2-nvaie-2.3-py3
ports:
- containerPort: 8888
command: ["jupyter-notebook"]
args: ["--NotebookApp.token=''"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: regcred
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-jupyter-notebook
spec:
type: NodePort
selector:
app: tensorflow-jupyter-notebook
ports:
- protocol: TCP
nodePort: 30040
port: 8888
targetPort: 8888
oc apply -f tensorflow-jupyter.yaml
oc get pods
oc describe pod <pod-name>
# Note the FQDN or IP of the node it is running on and construct the URL for accessing the notebook.
# http://<NODE_FQDN_OR_IP>:30040
Running ResNet-50 with TensorFlow
tensorflow-gpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-gpu
labels:
app: tensorflow-gpu
spec:
replicas: 1
selector:
matchLabels:
app: tensorflow-gpu
template:
metadata:
labels:
app: tensorflow-gpu
spec:
containers:
- name: tensorflow
image: nvcr.io/nvidia/tensorflow-pb24h2:24.08.07-tf2-py3
command: ["/bin/bash"]
args: ["-c", "sleep infinity"]
resources:
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: regcred
oc apply -f tensorflow-gpu.yaml
oc get pods
oc exec -it <pod-name> -- /bin/bash
cd /workspace/nvidia-examples/cnn
python resnet.py -b 16 -i 200 -u batch --precision fp16
# it works
python resnet.py -b 32 -i 200 -u batch --precision fp16
# it works
# mpiexec --allow-run-as-root --bind-to socket -np 7 python resnet.py -b 32 -i 200 -u batch --precision fp16
# it can not work
# ncclCommInitRank failed: invalid usage (run with NCCL_DEBUG=WARN for details)
Every 2.0s: nvidia-smi tensorflow-gpu-56754d7d47-2lgjh: Tue Apr 22 10:00:38 2025
Tue Apr 22 10:00:38 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:01:00.0 Off | On |
| N/A 55C P0 104W / 250W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 7 0 0 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 8 0 1 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 9 0 2 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 11 0 3 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 12 0 4 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 13 0 5 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 14 0 6 | 375MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 7 0 35027 C python 332MiB |
| 0 8 0 35002 C python 332MiB |
| 0 9 0 35014 C python 332MiB |
| 0 11 0 35001 C python 332MiB |
| 0 12 0 35009 C python 332MiB |
| 0 13 0 35008 C python 332MiB |
| 0 14 0 34993 C python 332MiB |
+-----------------------------------------------------------------------------------------+
PyTorch MNIST
pytorch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-mnist
labels:
app: pytorch-mnist
spec:
replicas: 1
selector:
matchLabels:
app: pytorch-mnist
template:
metadata:
labels:
app: pytorch-mnist
spec:
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch-ltsb2:23.08-lws2.1.0-py3
command: ["/bin/bash"]
args: ["-c", "sleep infinity"]
resources:
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: regcred
oc apply -f pytorch.yaml
oc get pods
oc exec -it <pod-name> -- /bin/bash
cd /workspace/examples/upstream/mnist
python main.py
# it works
Debug
- oauth - Lost my openshift console ("Application is not available") - Stack Overflow
- Troubleshooting Red Hat OpenShift Container Platform 4: DNS - Red Hat Customer Portal
image
- "unable to sync: Operation cannot be fulfilled on configs.imageregistry.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again, requeuing" in OpenShift 4.4.x - Red Hat Customer Portal
- OpenShift 4 Container Image Management - YouTube
- Image Registry operator is in degraded state with error "Unable to apply resources: storage backend not configured" - Red Hat Customer Portal
- 啟用 OpenShift 內部映像檔登錄 - IBM 說明文件
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
agent install
- Installing a cluster with Agent-based Installer - Installing an on-premise cluster with the Agent-based Installer | Installing | OpenShift Container Platform 4.14
- Troubleshooting installation issues | Installing | OpenShift Container Platform 4.15
- Troubleshooting installations - Troubleshooting | Support | OpenShift Container Platform 4.15
- How to recover install-config.yaml file from an installed OpenShift 4.x cluster - Red Hat Customer Portal
- Day-19-Kubernetes_除錯思路分享 - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天
- [Kubernetes] Taints and Tolerations | 小信豬的原始部落
some cluster operators are not available
ClusterVersion: Installing "4.14.15" for About an hour: Unable to apply 4.14.15: some cluster operators are not available
DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, console, ingress, machine-api, monitoring are not available - OCP 4.x Installation incomplete: cluster failed to initialize due to some cluster operators are still updating - Red Hat Customer Portal - openshift-install not creating the worker vm using IPI · Issue #386 · okd-project/okd · GitHub - Crio and kubelet services are stuck in "dead" status and are unable to start in OCP 4 - Red Hat Customer Portal - Master kubelet gets Unauthorized and stuck when bootstrapping masters with 10 min gaps in OpenShift 4 - Red Hat Customer Portal
oc get nodes -o wide
oc get clusteroperators
oc get pod --all-namespaces -o wide
oc get po -n openshift-ingress
oc describe pod -n openshift-ingress router-default-66f58c7559-f2gqf
on Rendezvous node
journalctl -u assisted-service.service
journalctl -b -f -u release-image.service -u bootkube.service
oc get pods -n openshift-ingress router-default-66f58c7559
oc describe pod/router-default-66f58c7559-fmx72 -n openshift-ingress
oc get mcp
on faile node
systemctl list-jobs
podman login registry.redhat.io --authfile /var/lib/kubelet/config.json
dig registry.redhat.io
nslookup registry.redhat.io
log in the target core os node
DNS issue
172.20.0.1 is internal dns and if it can reach internet, it't ok
otherwise, add the external dns into agent-config.yaml
vmware virtual machine
- 虛擬機器選項 => 開機選項 => 啟用 UEFI 安全開機
- 虛擬機器選項 => 進階 => 組態參數
- disk.EnableUUID: TRUE
mirror registry
check the /etc/containers/registries.conf
deployment failure
Integration
Identity provider
- Openshift Oauth use OpenID (Keycloak) as identity provider, oc login failed but can login successful with web console - Red Hat Customer Portal
- How to Integrate OpenShift with Keycloak - The New Stack
- Keycloak (as an Identity Provider) to secure Openshift | by Abhishek koserwal | Keycloak | Medium
- Keycloak - OpenShift Examples
- install keyclock in openshift
LDAP
Subscription
- Chapter 3. Cluster subscriptions and registration | Red Hat Product Documentation
- Red Hat OpenShift pricing
- Self-managed Red Hat OpenShift subscription guide
HyperShift
Support
OpenShift Lifecycle: https://access.redhat.com/support/policy/updates/openshift
OpenShift AAA: https://docs.openshift.com/container-platform/4.17/authentication/index.html
OpenShift Identity Providers: https://docs.openshift.com/container-platform/4.17/authentication/understanding-identity-provider.html
Certificates: https://docs.openshift.com/container-platform/4.17/security/index.html
Deploying ODF on Bare Metal: https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_openshift_data_foundation_using_bare_metal_infrastructure/index
ODF Architecture (Internal/External approach): https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/planning_your_deployment/odf-architecture_rhodf#odf-architecture_rhodf
Mirror Registry: https://docs.openshift.com/container-platform/4.17/disconnected/mirroring/installing-mirroring-creating-registry.html
OpenShift Web Console customizations: https://docs.openshift.com/container-platform/4.17/web_console/customizing-the-web-console.html
Add Worker Node to an OpenShift cluster: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-adding-node-iso.html