Skip to content

OpenShift

What companies using the Openshift

install

which - installer-provisioned infrastructure installation - user-provisioned infrastructure installation

openshift 4.17 build woker node iso

export REGISTRY_AUTH_FILE=/tmp/ocp/mirror-registry/pull-secret.json
oc adm node-image create nodes-config.yaml
oc adm node-image monitor --ip-addresses <ip_addresses>
oc get csr
oc adm certificate approve <csr_name>

nodes-config.yaml

hosts:
  - hostname: extra-worker-1
    rootDeviceHints:
      deviceName: /dev/nvme0n1
    interfaces:
      - macAddress: 90:5a:08:03:6a:30
        name: enp23s0f0np0
      - macAddress: 5E:09:6B:17:DE:F6
        name: enp0s20f0u1u1c2
      - macAddress: 90:5a:08:03:6a:31
        name: enp23s0f1np1
    networkConfig:
      interfaces:
        - name: enp23s0f0np0
          type: ethernet
          state: up
          mac-address: 90:5a:08:03:6a:30
          ipv4:
            enabled: true
            address:
              - ip: 172.17.217.240
                prefix-length: 24
            dhcp: false
            auto-dns: false
          ipv6:
            enabled: false
        - name: enp0s20f0u1u1c2
          type: ethernet
          state: down
          mac-address: 5E:09:6B:17:DE:F6
          ipv4:
            enabled: false
          ipv6:
            enabled: false
        - name: enp23s0f1np1
          type: ethernet
          state: down
          mac-address: 90:5a:08:03:6a:31
          ipv4:
            enabled: false
          ipv6:
            enabled: false
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: 172.17.217.1
            next-hop-interface: enp23s0f0np0
            table-id: 254
      dns-resolver:
        config:
          search:
            - b3qportal.com
          server:
            - 172.17.217.241

CLI

oc cluster-info
oc project
oc login https://api.ocp4.example.com:6443
# oc login https://172.24.131.126:6443 --username=kubeadmin --password=bar --insecure-skip-tls-verify
oc whoami -c
oc whoami --show-console
oc api-versions
oc status
# view your current CLI configuration
oc config view
#  list the total memory and CPU usage of all pods in the cluster, --sum option with the command to print the sum of the resource usage. The -A option shows pods from all namespaces.
oc adm top pods -A --sum
# Use the --containers option to display the resource usage of containers within a pod.
oc adm top pods apiserver-75ff56786f-25rpd -n openshift-apiserver --containers

oc get clusteroperator
oc get operators
oc get operators nfd.openshift-nfd
oc get RESOURCE_TYPE
oc get RESOURCE_TYPE RESOURCE_NAME -o yaml
oc get RESOURCE_TYPE RESOURCE_NAME -o json
oc get all
oc get all -n openshift-apiserver --show-kind
oc get all -n openshift-monitoring --show-kind
# execute commands in a different project, you must include the --namespace or -n options.
oc get pods -n openshift-apiserver
oc get pods -n openshift-apiserver -o yaml
oc get pods -n openshift-apiserver -o json
# print the labels used by the pods.
oc get pods -n openshift-apiserver --show-labels
oc get pod --all-namespaces -o wide
# shows additional fields.
oc get pods -o wide

# this function is not available across all resources. 
oc describe RESOURCE_TYPE RESOURCE_NAME

# to print the documentation of a specific field of a resource. 
# Fields are identified via a JSONPath identifier.
# Information about each field is retrieved from the server in OpenAPI format.
oc explain pods
oc explain pods.spec.containers.resources
# display all fields of a resource without descriptions.
oc explain pods --recursive


# create a RHOCP resource in the current project.
# paired with the oc get RESOURCE_TYPE RESOURCE_NAME -o yaml command for editing definitions.
# to indicate the file that contains the JSON or YAML representation of an RHOCP resource.
oc create -f pod.yaml

# delete an existing RHOCP resource from the current project.
# must specify the resource type and the resource name.
oc delete pod quotes-ui

# RBAC
oc get clusterrole.rbac

monitor and log about the cluster

oc logs alertmanager-main-0 -n openshift-monitoring
# returns the output for a container within a pod
oc logs alertmanager-main-0 -n openshift-monitoring
oc get nodes master-0 -o json | jq '.status.conditions'
oc get nodes worker-0 -o json | jq '.status.conditions'
oc adm node-logs worker-0
oc adm node-logs worker-0 --tail 10
## start a debug session on the node 
oc debug node/worker-0
oc get pods alertmanager-main-0 -n openshift-monitoring -o jsonpath='{.spec.containers[*].name}'
oc logs alertmanager-main-0 -n openshift-monitoring -c alertmanager-proxy
oc exec -n openshift-monitoring alertmanager-main-0 -c alertmanager-proxy -it -- bash -il

Tab completion

  • https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#cli-enabling-tab-completion
    oc completion bash > oc_bash_completion
    sudo cp oc_bash_completion /etc/bash_completion.d/
    
    You can also save the file to a local directory and source it from your .bashrc file instead.

Tab completion is enabled when you open a new terminal.

Authentication with OAuth

For users to interact with RHOCP, they must first authenticate to the cluster. The authentication layer identifies the user that is associated with requests to the RHOCP API. After authentication, the authorization layer then uses information about the requesting user to determine whether the request is allowed.

A user in OpenShift is an entity that can make requests to the RHOCP API.

An RHOCP User object represents an actor that can be granted permissions in the system by adding roles to the user or to the user's groups.

  • Regular users
    • An RHOCP User object represents a regular user.
  • System users
  • Service accounts
    • ServiceAccount objects represent service accounts.
    • RHOCP creates service accounts automatically when a project is created

The RHOCP control plane includes a built-in OAuth server.

To authenticate themselves to the API, users obtain OAuth access tokens. Token authentication is the only guaranteed method to work with any OpenShift cluster

To retrieve an OAuth token by using the OpenShift web console, navigate to Help → Command line tools.

[user@host ~]$ oc login --token=sha256-BW...rA8 \
  --server=https://api.ocp4.example.com:6443

image

quay

with internet

wget https://mirror.openshift.com/pub/cgw/mirror-registry/latest/mirror-registry-amd64.tar.gz
tar -xvf mirror-registry-amd64.tar.gz
./mirror-registry install --quayHostname $(hostname -f) --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar

# use different port
# https://github.com/quay/mirror-registry/blob/e609475d2eba1825866909d5d5997b048da5bc88/ansible-runner/context/app/project/roles/mirror_appliance/templates/pod.service.j2#L15
./mirror-registry install --quayHostname $(hostname -f):18443 --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar

air-gapped https://github.com/quay/mirror-registry#installation

wget https://github.com/quay/mirror-registry/releases/download/v2.0.3/mirror-registry-offline.tar.gz
tar -zxvf mirror-registry-offline.tar.gz
./mirror-registry install --quayHostname $(hostname -f) --quayRoot /home/foo/quay --initUser foo --initPassword barbarbar

pull images

with internet

oc adm release mirror -a /tmp/ocp/mirror-registry/pull-secret.json \
--from=quay.io/openshift-release-dev/ocp-release:4.17.10-x86_64 \
--to=<LOCAL_REGISTRY>/<LOCAL_REPOSITORY> \
--to-release-image=<LOCAL_REGISTRY>/<LOCAL_REPOSITORY>:4.17.10-x86_64

air-gapped Mirror the images to a directory on the removable media

oc adm release mirror -a /tmp/ocp/mirror-registry/pull-secret.json \
--to-dir=/tmp/mirror \
quay.io/openshift-release-dev/ocp-release:4.17.10-x86_64

:::info info: Mirroring completed in 30m21.7s (10.99MB/s)

Success Update image: openshift/release:4.17.10-x86_64

To upload local images to a registry, run:

oc image mirror --from-dir=/tmp/mirror 'file://openshift/release:4.17.10-x86_64*' REGISTRY/REPOSITORY

Configmap signature file /tmp/mirror/config/signature-sha256-4c8cc149a8e4ef2f.json created :::

oc image mirror \
-a /tmp/ocp/mirror-registry/pull-secret.json \
--certificate-authority=/home/foo/quay/quay-rootCA/rootCA.pem \
--from-dir=/tmp/mirror \
'file://openshift/release:4.17.10-x86_64*' <LOCAL_REGISTRY>/<LOCAL_REPOSITORY>

uninstall

./mirror-registry uninstall -v --autoApprove --quayRoot /home/foo/quay

debug

podman secret ls
podman secret rm redis_pass

Nexus

API

# Get OAuth Route Hostname
oc get route -n openshift-authentication -o jsonpath='{.items[].spec.host}{"\n"}'  

# Oauth Bearer Token: Method 1
TOKEN=$(curl -s -k -i -L -X GET --user USER:PASSWORD 'https://<OAuth-route-hostname>/oauth/authorize?response_type=token&client_id=openshift-challenging-client' | grep -oP "access_token=\K[^&]*")

# Oauth Bearer Token: Method 2
TOKEN=$(curl -s -k -i -L -X GET --user USER:PASSWORD 'https://<oauth-route-hostname>/oauth/authorize?response_type=token&client_id=openshift-challenging-client' | grep  "access_token=" | awk -F'=' '{print $2}' | awk -F'&' '{print $1}')

# Test
curl -s -k -H "Authorization: Bearer $TOKEN" -X GET https://<API host>:6443/apis/project.openshift.io/v1/projects

Operators

curl -s -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/subscriptions

curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/web-terminal

curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator

web-terminal

# remove
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/web-terminal
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/clusterserviceversions/web-terminal.v1.9.0-0.1708477317.p


# install
curl -k -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -X POST https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions -d '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"name":"web-terminal","namespace":"openshift-operators"},"spec":{"channel":"fast","name":"web-terminal","source":"redhat-operators","sourceNamespace":"openshift-marketplace","startingCSV":"web-terminal.v1.9.0"}}'

openshift AI

# get subscription
curl -k -H "Authorization: Bearer $TOKEN" -X GET https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator

# remove
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions/rhods-operator
curl -k -H "Authorization: Bearer $TOKEN" -X DELETE https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/clusterserviceversions/rhods-operator.2.8.0

# install
curl -k -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -X POST https://api.test.supershift.com:6443/apis/operators.coreos.com/v1alpha1/namespaces/redhat-ods-operator/subscriptions -d '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"name":"rhods-operator","namespace":"redhat-ods-operator"},"spec":{"channel":"stable","name":"rhods-operator","source":"redhat-operators","sourceNamespace":"openshift-marketplace", "installPlanApproval": "Automatic", "startingCSV":"rhods-operator.2.8.0"}}'

Basic

template

oc process openshift//postgresql-persistent POSTGRESQL_USER=test POSTGRESQL_PASSWORD=test POSTGRESQL_DATABASE=test0328 | oc create -n tedchangchien-dev -f -
oc status
oc get pods
oc rsh <pod name>

psql -U test -W test0328

Network

Machine Network

This is the network at the infrastructure layer of an OpenShift cluster, typically used to connect physical or virtual nodes (e.g., masters, worker nodes). The IP range of the machine network is used for communication between nodes and for running management services like ETCD and the Kubernetes control plane.

Cluster Network

The internal Pod network within the cluster used for communication between Pods.

Each Pod is typically assigned a unique IP address. OpenShift uses Software-Defined Networking (SDN) to manage the cluster network, ensuring seamless communication between Pods.

Service Network

This network manages the virtual IP range for Kubernetes Services. Each Service is assigned a Cluster IP to handle traffic from both internal and external sources. Service IPs usually do not communicate directly with the external world but instead use a Service Proxy or Load Balancer for traffic forwarding.

Service

Ingress

RBAC

Console

StorageClass

OpenShift Data Foundation

Operators

openshift AI

NVIDIA GPU

GPUs and bare metal

In addition, the worker nodes can host one or more GPUs, but they must be of the same type. For example, a node can have two NVIDIA A100 GPUs, but a node with one A100 GPU and one T4 GPU is not supported. The NVIDIA Device Plugin for Kubernetes does not support mixing different GPU models on the same node.

Multi-instance GPU (MIG) partitioning

MIG is only supported with A30, A100, A100X, A800, AX800, H100, and H800.

For instance, the NVIDIA A100 40GB, offers multiple partitioning options:

  • 1g.5gb: 1 Compute Instance (CI), 5GB memory
  • 2g.10gb: 2 CIs, 10GB memory
  • 3g.20gb: 3 CIs, 20GB memory
  • 4g.20gb: 4 CIs, 20GB memory
  • 7g.40gb: 7 CIs, 40GB memory

Check the supported profiles

oc rsh  -n nvidia-gpu-operator  $(oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}') nvidia-smi mig -lgip

# if not all gpu nodes support the MIG
oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}
oc rsh  -n nvidia-gpu-operator nvidia-dcgm-exporter-dln49  nvidia-smi mig -lgip

Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init), init-pod-nvidia-node-status-exporter (init)
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.5gb        19     7/7        4.75       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.5gb+me     20     1/1        4.75       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.10gb       15     4/4        9.75       No     14     1     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.10gb       14     3/3        9.75       No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.20gb        9     2/2        19.62      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.20gb        5     1/1        19.62      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.40gb        0     1/1        39.38      No     98     5     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

config MIG

# config the mig to node/worker-1(has a GPU A100) which is based on your environment
oc label  node/worker-1 nvidia.com/mig.config=all-1g.10gb --overwrite=true
# check the log
oc logs -n nvidia-gpu-operator $(oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}')
# check the node
oc describe node worker-1
oc describe nodes | grep -A 6 "Capacity"
oc get nodes -o=custom-columns='Node:metadata.name,GPU Product:metadata.labels.nvidia\.com/gpu\.product,GPU Capacity:status.capacity.nvidia\.com/gpu'

# show the mig
oc rsh -n nvidia-gpu-operator \
$(oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}') nvidia-smi mig -lgi

# if not all gpu nodes support the MIG
oc get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}'
oc rsh  -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202412180008-0-rzp8x nvidia-smi mig -lgi

After config it into all-1g.10gb

+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 1g.10gb         15        3          4:2     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         15        4          6:2     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         15        5          0:2     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         15        6          2:2     |
+-------------------------------------------------------+

After config it into all-1g.5gb

+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 1g.5gb          19        7          4:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19        8          5:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19        9          6:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       11          0:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       12          1:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       13          2:1     |
+-------------------------------------------------------+
|   0  MIG 1g.5gb          19       14          3:1     |
+-------------------------------------------------------+

# run the gpu application
cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8"
   resources:
     limits:
       nvidia.com/gpu: 4
EOF

Disalbe the MIG

# disalbe the mig on node/worker-1
oc label  node/worker-1 nvidia.com/mig.config=all-disabled --overwrite=true

Deploying NVIDIA AI Enterprise Containers

prerequisite 1. apply for NGC_KEY

  1. create a secret for pulling images from NGC
    oc create secret docker-registry regcred --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default
    
jupyter

tensorflow-jupyter.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-jupyter-notebook
  labels:
    app: tensorflow-jupyter-notebook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tensorflow-jupyter-notebook
  template:
    metadata:
      labels:
        app: tensorflow-jupyter-notebook
    spec:
      containers:
      - name: tensorflow-container
        image: nvcr.io/nvidia/tensorflow-pb24h2:24.08.07-tf2-py3
        # image: nvcr.io/nvaie/tensorflow-2-3:22.09-tf2-nvaie-2.3-py3
        ports:
        - containerPort: 8888
        command: ["jupyter-notebook"]
        args: ["--NotebookApp.token=''"]
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-jupyter-notebook
spec:
  type: NodePort
  selector:
    app: tensorflow-jupyter-notebook
  ports:
  - protocol: TCP
    nodePort: 30040
    port: 8888
    targetPort: 8888

oc apply -f tensorflow-jupyter.yaml
oc get pods
oc describe pod <pod-name>
# Note the FQDN or IP of the node it is running on and construct the URL for accessing the notebook.
# http://<NODE_FQDN_OR_IP>:30040
Running ResNet-50 with TensorFlow

tensorflow-gpu.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-gpu
  labels:
    app: tensorflow-gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tensorflow-gpu
  template:
    metadata:
      labels:
        app: tensorflow-gpu
    spec:
      containers:
      - name: tensorflow
        image: nvcr.io/nvidia/tensorflow-pb24h2:24.08.07-tf2-py3
        command: ["/bin/bash"]
        args: ["-c", "sleep infinity"]
        resources:
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred

oc apply -f tensorflow-gpu.yaml
oc get pods
oc exec -it <pod-name> -- /bin/bash
cd /workspace/nvidia-examples/cnn

python resnet.py  -b 16 -i 200 -u batch --precision fp16
# it works
python resnet.py  -b 32 -i 200 -u batch --precision fp16
# it works

# mpiexec --allow-run-as-root --bind-to socket -np 7 python resnet.py  -b 32 -i 200 -u batch --precision fp16
# it can not work
# ncclCommInitRank failed: invalid usage (run with NCCL_DEBUG=WARN for details)
Every 2.0s: nvidia-smi                                                                                                            tensorflow-gpu-56754d7d47-2lgjh: Tue Apr 22 10:00:38 2025

Tue Apr 22 10:00:38 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:01:00.0 Off |                   On |
| N/A   55C    P0            104W /  250W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   3  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   4  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   5  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   14   0   6  |             375MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 2MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0    7    0            35027      C   python                                  332MiB |
|    0    8    0            35002      C   python                                  332MiB |
|    0    9    0            35014      C   python                                  332MiB |
|    0   11    0            35001      C   python                                  332MiB |
|    0   12    0            35009      C   python                                  332MiB |
|    0   13    0            35008      C   python                                  332MiB |
|    0   14    0            34993      C   python                                  332MiB |
+-----------------------------------------------------------------------------------------+
PyTorch MNIST

pytorch.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-mnist
  labels:
    app: pytorch-mnist
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-mnist
  template:
    metadata:
      labels:
        app: pytorch-mnist
    spec:
      containers:
      - name: pytorch
        image: nvcr.io/nvidia/pytorch-ltsb2:23.08-lws2.1.0-py3
        command: ["/bin/bash"]
        args: ["-c", "sleep infinity"]
        resources:
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred

oc apply -f pytorch.yaml
oc get pods
oc exec -it <pod-name> -- /bin/bash
cd /workspace/examples/upstream/mnist

python main.py
# it works

Debug

image

oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'

agent install

some cluster operators are not available

ClusterVersion: Installing "4.14.15" for About an hour: Unable to apply 4.14.15: some cluster operators are not available

DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, console, ingress, machine-api, monitoring are not available - OCP 4.x Installation incomplete: cluster failed to initialize due to some cluster operators are still updating - Red Hat Customer Portal - openshift-install not creating the worker vm using IPI · Issue #386 · okd-project/okd · GitHub - Crio and kubelet services are stuck in "dead" status and are unable to start in OCP 4 - Red Hat Customer Portal - Master kubelet gets Unauthorized and stuck when bootstrapping masters with 10 min gaps in OpenShift 4 - Red Hat Customer Portal

oc get nodes -o wide
oc get clusteroperators
oc get pod --all-namespaces -o wide
oc get po -n openshift-ingress
oc describe pod -n openshift-ingress router-default-66f58c7559-f2gqf

on Rendezvous node

journalctl -u assisted-service.service
journalctl -b -f -u release-image.service -u bootkube.service

oc get pods -n openshift-ingress  router-default-66f58c7559
oc describe pod/router-default-66f58c7559-fmx72 -n openshift-ingress
oc get mcp

on faile node

systemctl list-jobs
podman login registry.redhat.io --authfile /var/lib/kubelet/config.json
dig registry.redhat.io
nslookup registry.redhat.io

log in the target core os node

ssh core@<target core os ip or FQDN>

DNS issue

172.20.0.1 is internal dns and if it can reach internet, it't ok

otherwise, add the external dns into agent-config.yaml

dns-resolver:
        config:
          search:
          - supershift.com
          server:
          - 172.20.0.1
          - 8.8.8.8

vmware virtual machine

  • 虛擬機器選項 => 開機選項 => 啟用 UEFI 安全開機
  • 虛擬機器選項 => 進階 => 組態參數
    • disk.EnableUUID: TRUE

mirror registry

check the /etc/containers/registries.conf

deployment failure

journalctl -u kubelet -n 100
journalctl -u crio -n 100

Integration

Identity provider

LDAP

Subscription

HyperShift

Support

OpenShift Lifecycle: https://access.redhat.com/support/policy/updates/openshift

OpenShift AAA: https://docs.openshift.com/container-platform/4.17/authentication/index.html

OpenShift Identity Providers: https://docs.openshift.com/container-platform/4.17/authentication/understanding-identity-provider.html

Certificates: https://docs.openshift.com/container-platform/4.17/security/index.html

Deploying ODF on Bare Metal: https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_openshift_data_foundation_using_bare_metal_infrastructure/index

ODF Architecture (Internal/External approach): https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/planning_your_deployment/odf-architecture_rhodf#odf-architecture_rhodf

Mirror Registry: https://docs.openshift.com/container-platform/4.17/disconnected/mirroring/installing-mirroring-creating-registry.html

OpenShift Web Console customizations: https://docs.openshift.com/container-platform/4.17/web_console/customizing-the-web-console.html

Add Worker Node to an OpenShift cluster: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-adding-node-iso.html