GPU

is it PCIe GPU or HGX?

mst status -v
nvidia-smi topo -m

flow

enable gpu direct gpu1 <=> ib card <=> ib card <=> gpu2 有了之後是綠的

disable gpu direct gpu1 <=> cpu <=> ib card <=> ib card <=> cpu <=> gpu2 沒有gpu direct是紫色

確認CPU Slot位置

lspci -tvv
nvidia-smi topo -m

GPU and NIC mapping

look at the PIX

GPU0 is mapping to NIC0 GPU6 is mapping to NIC5 GPU4 is mapping to NIC4

nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PIX     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     SYS     32-63,96-127    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     32-63,96-127    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     PIX     32-63,96-127    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     32-63,96-127    1               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS
NIC1    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5

root@tester:~/tools/perftest-cuda/bin# mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module is not loaded
PCI devices:
------------
DEVICE_TYPE             MST      PCI       RDMA            NET                                     NUMA
ConnectX6(rev:0)        NA       5d:00.0   mlx5_2          net-ibp93s0f0                           0

ConnectX7(rev:0)        NA       c0:00.0   mlx5_5          net-ibp192s0                            1

ConnectX7(rev:0)        NA       9c:00.0   mlx5_4          net-ibp156s0                            1

ConnectX7(rev:0)        NA       40:00.0   mlx5_1          net-ibp64s0                             0

ConnectX7(rev:0)        NA       1a:00.0   mlx5_0          net-ibp26s0                             0

ConnectX6(rev:0)        NA       5d:00.1   mlx5_3          net-ibp93s0f1                           0

Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG)類似multi process

single the GPU is partitioned into multiple instances of the same size. For example, an NVIDIA A100 GPU can be divided into seven instances, each with equal resources.

mixed the GPU is partitioned into instances of different sizes. This allows for a more flexible allocation of resources based on the specific needs of each workload.

Time-Slicing GPUs

Time-Slicing GPUs類似single thread with event loop

GPU time-slicing can be used with bare-metal applications, virtual machines with GPU passthrough, and virtual machines with NVIDIA vGPU.

Nvidia License Server

:star:License System User Guide - NVIDIA Docs
- Creating a License Service for NVIDIA AI Enterprise or Virtual GPU - YouTube

Nvidia CUDA

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

runfile

sudo apt-get install build-essential gcc-12
sudo ln -s -f /usr/bin/gcc-12 /usr/bin/gcc
sudo sh cuda_12.6.2_560.35.03_linux.run --silent --driver

NCCL

docker

Driver

Official Drivers | NVIDIA

Debug

NIM

NVIDIA Inference Microservice - NVIDIA 黃仁勳執行長在 2024 Computex 說的 NIM 是什麼？ - CAVEDU教育團隊技術部落格

Traingin and certification

AMD ROCm

ROCm Validation Suite(RVS)

ROCm Validation Suite documentation — RVS 1.1.0 Documentation
ROCmValidationSuite/docs/ug1main.md at master · ROCm/ROCmValidationSuite · GitHub
- example: /opt/rocm/share/rocm-validation-suite/conf/
- example: https://github.com/ROCm/ROCmValidationSuite/tree/master/rvs/conf

GPU