GPU
- 淺談GPU到底是什麼(上):不同的運算型態 (133369) - Cool3c
- 淺談GPU到底是什麼(中):兼具SIMD與MIMD優點的SIMT (133370) - Cool3c
- 硬科技:淺談GPU到底是什麼(下):走向汎用化的GPGPU (134057) - Cool3c
- 硬科技:GPU虛擬化為何超級難搞(上) #CPU (157525) - Cool3c
- 硬科技:GPU虛擬化為何超級難搞(中) #api (157526) - Cool3c
- 硬科技:GPU虛擬化為何超級難搞(下) #nvidia (157527) - Cool3c
- PCI devices
is it PCIe GPU or HGX?
flow
enable gpu direct gpu1 <=> ib card <=> ib card <=> gpu2 有了之後是綠的
disable gpu direct gpu1 <=> cpu <=> ib card <=> ib card <=> cpu <=> gpu2 沒有gpu direct是紫色
確認CPU Slot位置
GPU and NIC mapping
look at the PIX
GPU0 is mapping to NIC0 GPU6 is mapping to NIC5 GPU4 is mapping to NIC4
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PIX SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX SYS 32-63,96-127 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS 32-63,96-127 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS PIX 32-63,96-127 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS 32-63,96-127 1 N/A
NIC0 PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS
NIC1 SYS SYS PIX SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC2 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX SYS SYS
NIC3 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X SYS SYS
NIC4 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS X SYS
NIC5 SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
root@tester:~/tools/perftest-cuda/bin# mst status -v
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module is not loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX6(rev:0) NA 5d:00.0 mlx5_2 net-ibp93s0f0 0
ConnectX7(rev:0) NA c0:00.0 mlx5_5 net-ibp192s0 1
ConnectX7(rev:0) NA 9c:00.0 mlx5_4 net-ibp156s0 1
ConnectX7(rev:0) NA 40:00.0 mlx5_1 net-ibp64s0 0
ConnectX7(rev:0) NA 1a:00.0 mlx5_0 net-ibp26s0 0
ConnectX6(rev:0) NA 5d:00.1 mlx5_3 net-ibp93s0f1 0
sharing Nvidia GPU resources
Multi-Instance GPU (MIG)
Multi-Instance GPU (MIG)類似multi process
single the GPU is partitioned into multiple instances of the same size. For example, an NVIDIA A100 GPU can be divided into seven instances, each with equal resources.
mixed the GPU is partitioned into instances of different sizes. This allows for a more flexible allocation of resources based on the specific needs of each workload.
Time-Slicing GPUs
Time-Slicing GPUs類似single thread with event loop
GPU time-slicing can be used with bare-metal applications, virtual machines with GPU passthrough, and virtual machines with NVIDIA vGPU.
Nvidia License Server
Nvidia CUDA
- CUDA Toolkit 12.4 Update 1 Downloads | NVIDIA Developer
- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/
- 1. Introduction — Installation Guide for Linux 12.3 documentation
- 1. Introduction — Quick Start Guide 12.4 documentation
- CUDA Compatibility
- Driver and Runtime
- CUDA Driver VS CUDA Runtime - Lei Mao's Log Book
- CUDA has two APIs: 1. The runtime api (libcudart.so) 2. The driver api (libcuda.... | Hacker News
- CUDA 的driver API 、runtime API、Libraries - 知乎
- CUDA C++ Programming Guide - 3.3. Versioning and Compatibility
- Cuda toolkit — Cuda driver. Before using Nvidia’s profiling tools… | by Gia Huy ( CisMine) | Medium
- CUDA Installation Guide for Linux - 18. Removing CUDA Toolkit and Driver
- Runfile
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
runfile
sudo apt-get install build-essential gcc-12
sudo ln -s -f /usr/bin/gcc-12 /usr/bin/gcc
sudo sh cuda_12.6.2_560.35.03_linux.run --silent --driver
NCCL
- GitHub - NVIDIA/cloud-native-stack: Run cloud native workloads on NVIDIA GPUs
- GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication
docker
- Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.15.0 documentation
- 實作在 Docker 環境中使用 GPU - IT Bunny Lee
Driver
Debug
- NVIDIA GPU Debug Guidelines :: GPU Deployment and Management Documentation
- Bug #1915413 “Milan Delta A100 GPU fails to detect on Ubuntu 18....” : Bugs : Ubuntu
NIM
NVIDIA Inference Microservice - NVIDIA 黃仁勳執行長在 2024 Computex 說的 NIM 是什麼? - CAVEDU教育團隊技術部落格
AMD ROCm
- ROCm quick start install guide for Linux — ROCm installation (Linux)
- New ROCm Documentation Site : r/ROCm
- System requirements (Linux) — ROCm installation (Linux)
- Compatibility matrix — ROCm Documentation
- GPU-enabled Message Passing Interface — GPU cluster networking documentation
ROCm Validation Suite(RVS)
- ROCm Validation Suite documentation — RVS 1.1.0 Documentation
- ROCmValidationSuite/docs/ug1main.md at master · ROCm/ROCmValidationSuite · GitHub
- example: /opt/rocm/share/rocm-validation-suite/conf/
- example: https://github.com/ROCm/ROCmValidationSuite/tree/master/rvs/conf