CUDA 18VOL · CONTENT-FIRST · A4 LANDSCAPE · 20p

분산 통신 & 병렬화 단권화

NCCL Internals · 5D Parallelism (DP·TP·PP·EP·CP) · NVLink/NVSwitch · Overlap · α-β Model

Volume V15/18

Tier T5 분산/서빙

선행 V01 (PMPP core)

용도 다중 GPU·다중 노드 통신 지도

범례

핵심 용어 (노란 배경)

매우 중요 · 표 헤더

정의 / 수식 박스

예시 · 워크드

빨강주의 · 실수 포인트

★핵심 (페이지당 ≤3)

(!)니모닉 (권당 ≤5)

↗권 간 상호참조 (xref)

α·βlatency·inverse-BW

Nrank(GPU) 수

인쇄 A4 가로 / 여백 없음 / 배경 그래픽 포함 · Ctrl(⌘)+P

NCCL docs · Megatron-LM · DeepSpeed/ZeRO · FSDP · Ring/Tree allreduce papers · NVIDIA DGX H100 guide

1 Collective 정의

정의 collective: N rank 전체가 동시에 참여하는 group communication. participant 전원이 동일 op·동일 root 설정으로 호출해야 deadlock 없이 완료된다.

point-to-point (SEND/RECV) 의 상대개념
의미론: (1) input set, (2) output set, (3) reduction op (sum/min/max/prod)
communicator 단위로 정의 — rank 0..N-1

2 AllReduce ★ 가장 중요 y[i] = Σ x_r[i]

y_r[i] = ⊕_r'=0..N-1 x_r'[i] ∀ r r : rank ⊕ : reduction (sum·max·min·prod) 모든 rank가 같은 결과 y를 얻음

DP gradient sync · TP row-parallel 뒤 합산
= ReduceScatter ∘ AllGather (분해 가능)

3 Reduce 단일 root

y_root[i] = ⊕_r' x_r'[i] (다른 rank는 x만 기여) AllReduce의 subset: root 1개만 결과 보유

실무: 로그·통계 집계. 학습에는 거의 안 씀 (AllReduce 대체).

4 Broadcast ↔ Reduce 쌍

y_r[i] = x_root[i] ∀ r root 의 buffer를 전 rank로 복제

initial weight 배포 · random seed 동기화
통신량: (N−1)·S · 이론 하한 step = ⌈log₂N⌉ (tree)

5 AllGather ★

y_r[k, i] = x_k[i] ∀ r, k∈[0,N) 각 rank의 shard를 전부 concat → 모두가 full tensor 보유

ZeRO-3 / FSDP param unshard ↗ V17 §3
TP column-parallel 입력 broadcast 대체
ring: byte/GPU = (N−1)/N · S

6 ReduceScatter ★

y_r[i] = ⊕_r' x_r'[r·C + i] i∈[0,C) reduce 후 chunk로 쪼개 rank별 재분배 C = S/N

ZeRO-2 grad sharding · FSDP backward
AllReduce = AllGather ∘ ReduceScatter (분해 등식)

7 AllToAll ★ shuffle

y_r[s, i] = x_s[r, i] ∀ r, s rank×rank 전치 (transpose) r번 rank는 s번 rank로부터 s행의 r열 청크를 받음

EP (MoE) token routing · SP (Ulysses) head↔seq 전치
통신량: (N−1)/N · S · ring / Bruck / pairwise 알고리즘

8 Collective 의미론 매트릭스

op	input	output	reduce
Broadcast	root: S	all: S	—
Reduce	all: S	root: S	⊕
AllReduce	all: S	all: S	⊕
AllGather	all: S/N	all: S	—
ReduceScatter	all: S	all: S/N	⊕
AllToAll	all: S	all: S	—
Scatter	root: S	all: S/N	—
Gather	all: S/N	root: S	—

S = total message size in bytes. N = rank 수. NCCL: Scatter/Gather는 AllGather/ReduceScatter 우위라 거의 안 씀.

9 분해 등식 (!) AR=AG∘RS

AllReduce = AllGather ∘ ReduceScatter
= ReduceScatter ∘ AllGather 이 등식이 ring allreduce 알고리즘 유도의 핵심

1 Ring AllReduce ★ bandwidth-optimal RS→AG

구성 N rank을 logical ring (r → r+1 mod N). 메시지 S를 N chunk로 쪼갠다: C = S/N. 2 phase로 동작:

Phase A — ReduceScatter: N−1 step, 각 step 마다 chunk 1개 전송·수신 후 local reduce 누적. 끝에 각 rank는 chunk 1개를 "완성 형태"로 보유.
Phase B — AllGather: N−1 step, 완성된 chunk를 링으로 회전. 끝에 모두 동일 전체 결과 S 보유.

2 Ring 단계별 도식 (N=4)

C=S/4 ; rank r 가 chunk r 의 owner.

Phase A (ReduceScatter, step 0..N-2):
 step0:  r → r+1  send chunk (r)     recv chunk (r-1)    local += recv
 step1:  r → r+1  send chunk (r-1)   recv chunk (r-2)    local += recv
 step2:  r → r+1  send chunk (r-2)   recv chunk (r-3)    local += recv
 ─ after N-1=3 step:  r0 보유 chunk 1(완성) , r1→2 , r2→3 , r3→0

Phase B (AllGather, step 0..N-2):
 step0:  r → r+1  send 완성 chunk    recv 상류 완성 chunk
 step1:  r → r+1  send 방금 받은 것  recv 상류의 상류 것
 step2:  r → r+1  …
 ─ after N-1=3 step:  모든 rank = [c0,c1,c2,c3] full S

총 step = 2(N−1). 각 step 전송량 = C = S/N. rank당 총 전송 = 2(N−1)·S/N → BW-optimal.

3 Ring 복잡도

T_ring = 2(N−1)·α + 2(N−1)/N · S · β
bytes/GPU = 2(N−1)/N · S →_N→∞ 2S α : per-message latency β : time per byte (1/BW) S : total message size

S 클수록 ring 승 (β-term 우세)
N·α 커서 작은 S·큰 N 에선 불리 — tree로 전환

4 Tree AllReduce latency-optimal small

구성 binary tree: leaf → root 방향 Reduce, root → leaf 방향 Broadcast. step = 2⌈log₂N⌉.

T_tree = 2⌈log N⌉·α + 2⌈log N⌉·S·β latency log N 로 줄지만 각 step이 S 전부를 보냄 → β-term 커짐

small S / large N 에서 ring 보다 좋음.

5 Double Binary Tree ★ NCCL default 종종

idea 두 개의 binary tree를 serial link 이용 — 각 tree가 BW의 1/2 차지. leaf 의 upstream 대역폭 낭비 문제 해결.

link 양방향 full duplex 활용 → intra-tree BW = full
NCCL: 2-node+ 에서 자주 선택됨
step ≈ 2⌈log N⌉ · bytes/GPU ≈ 2S/ log N 수준 (algo-specific)

6 Recursive Halving-Doubling

구성 N = 2^k 가정. log N step RS (distance 2ⁱ, halve) + log N step AG (distance 2ⁱ, double). Rabenseifner algorithm.

T_HD = 2 log N · α + 2(N−1)/N · S · β latency tree 수준 + bandwidth ring 수준 → 양쪽 장점

단 N 이 2의 거듭제곱일 때 깔끔. non-power-of-2 는 virtual rank 조정 필요.

7 알고리즘 비교 ★

algo	latency (α)	bytes/GPU (β)	선호 조건
Ring	2(N−1)	2(N−1)/N · S	large S
Tree	2 log N	2 log N · S	small S, large N
Double BT	2 log N	≈ 2S	inter-node, mixed
Halving-Doubling	2 log N	2(N−1)/N · S	N = 2ᵏ
NVLS (in-net)	~ 1	~ S/N · 1	intra-switch, H100+

bytes/GPU 는 양방향 합. NCCL 은 tuner로 자동 선택 · NCCL_ALGO 로 강제 가능.

8 하한 이론 (!) 2(N-1)/N · S

Ring 의 2(N−1)/N·S 는 AllReduce bandwidth 이론 하한 — 어떤 알고리즘도 이보다 적게 전송할 수 없다 (Patarasuk & Yuan 2009).

1 NCCL 계층 용어

용어	의미
Communicator	참여 rank 집합 (`ncclComm_t`)
Rank	comm 내 GPU 하나 (0..N-1)
Channel	병렬 전송 경로 (= ring/tree 인스턴스)
Protocol	LL · LL128 · Simple
Algo	Ring · Tree · CollNet · NVLS
Transport	P2P(NVLink) · SHM · NET(IB/TCP)

2 Communicator 생성 흐름

ncclGetUniqueId (rank 0) → MPI/file 로 전파
각 rank가 ncclCommInitRank(id, N, r)
내부: topology detection · graph search · channel 빌드
완료 시 comm handle = 통신 plan 전체를 담고 있음

초기화 비용 O(N) ~ 수백 ms. 재사용 필수.

3 Channel ★ 병렬 ring

정의 동일 collective를 여러 개의 논리 ring으로 쪼개 parallel 실행. SM 점유·링크 BW를 동시 활용.

channel 수 = 기본 4~32 (topology·size에 따라 자동)
NCCL_NCHANNELS · NCCL_MIN_NCHANNELS 로 조정
많을수록 small msg 에선 비싼 초기화 + SM 낭비

4 Topology Detection

단계 nccl은 시작 시 host graph를 구성: PCIe/NVLink/NVSwitch/NIC 링크를 그래프 edge로 놓고 BW·hop·NUMA를 가중치로 설정. XML cache (NCCL_TOPO_FILE) 가능.

source: /sys/class/nvidia, nvidia-smi topo -m 내부 API, HWLOC
NIC 는 PCIe root 기준으로 CPU affinity·GPU locality 판단
결과로 ring/tree graph 후보 집합 생성

5 Ring 구성 알고리즘

candidate Hamilton cycle 탐색 (GPU 노드를 한 번씩 방문)
edge 가중치 최소화: NVLink 선호 · PCIe 회피 · NIC traversal 최소
multi-node: intra-node local ring + inter-node stitch (single NIC per rail)
결과: graph->inter/intra 경로 확정

6 Protocol 차이 ★

proto	size	방식
LL	< 16 KB	flag bit 끼워 low-latency, BW 절반
LL128	중간	128B 단위, 120/128 payload
Simple	large	flag 없음, full BW

NCCL_PROTO=LL,LL128,Simple 조합 지정 가능.

7 Proxy Thread

정의 inter-node 구간에서는 NIC 를 host CPU proxy 가 대리: GPU kernel 은 shared mem FIFO 에 쓰고, proxy thread 가 RDMA WRITE 로 뽑아내 보낸다.

GPUDirect RDMA 가능 시 GPU 메모리 직접 전송 (proxy는 메타만) ↗ §16
bottleneck 구간: proxy SM ↔ NIC PCIe lane

8 Kernel Launch 경로

host API (ncclAllReduce)
 └─ enqueue (graph+args)
    └─ kernel launch (stream)
       ├─ intra-node: P2P load/store NVLink
       └─ inter-node: FIFO write → proxy → NIC
                                        └─ NET send/recv
       └─ reduce (FMA in-kernel) + progress

collective 1회 = 1 CUDA kernel launch. CUDA Graph capture 가능.

9 Graph Mode & Capture

NCCL 2.11+ CUDA Graph capture 지원 · launch overhead 제거
decode path 에 유용 ↗ V16 §10
NCCL_GRAPH_MIXING_SUPPORT=1

1 NVLink 세대별 BW

gen	GPU	링크/GPU	BW/GPU (양방향)
NVLink 2	V100	6	300 GB/s
NVLink 3	A100	12	600 GB/s
NVLink 4	H100	18	900 GB/s
NVLink 5	B200	18	1,800 GB/s

공식 스펙 (per-GPU aggregate, both directions). PCIe Gen5 x16 = 128 GB/s 와 비교 시 7× 이상.

2 PCIe vs NVLink ★

항목	PCIe Gen5	NVLink 4
raw BW	~128 GB/s	~900 GB/s
latency	μs 수십	μs 수 (P2P)
topology	tree (root PCIe)	mesh / switch
NUMA	socket 경계 비용	없음

3 NVSwitch fabric

정의 NVSwitch: NVLink non-blocking crossbar. DGX A100: 6 switch · DGX H100: 4 3세대 switch. 모든 GPU pair가 full BW 로 통신 가능 (n×n bipartite).

H100 NVSwitch: 64 port NVLink4, switch당 aggregate 3.2 TB/s
8-GPU DGX 내부: any-to-any non-blocking

4 DGX H100 8-GPU topology

              NVSwitch (x4)
       ╔══════════════════════════╗
       ║   full mesh crossbar     ║
       ╚╤═╤═╤═╤═╤═╤═╤═╤══════════╝
        │ │ │ │ │ │ │ │
       G0 G1 G2 G3 G4 G5 G6 G7
        │                    │
       NIC0 ... (8 CX-7 NIC, 400G IB)
        │
       ── ConnectX-7 → 2 rail (rail-optimized)

각 GPU → 2 NIC (via PXN). NVLink4 900 GB/s bi-dir intra-node.

5 Multi-node: IB Fabric

node 당 8 GPU + 8 CX-7 NIC (각 400 Gb/s = 50 GB/s)
rail-optimized: 같은 rail 의 NIC 끼리 leaf switch 공유 → inter-node same-rail 저지연
node당 NIC aggregate = 400 GB/s 수준 → NVLink 대비 1/2 이하

6 NVLink BW 이론 활용 ring

BW_effective = min(NVLink_GPU, NIC_GPU) [layer에 따라]
intra-node AllReduce BW ≈ NVLink · (N-1)/N
inter-node AllReduce BW ≈ min(NIC, NVLink intra) / (hop depth) hop depth: node 사이 switch 통과 수

7 왜 full-mesh 가 필요한가

이유 ring allreduce는 neighbor-only 통신이면 충분. 하지만 tree/DBT/AllToAll 은 임의 pair 통신 필요 → NVSwitch 없으면 hop 중첩으로 BW 낭비.

TP + EP 조합: alltoall 빈발 → fabric 필수
NVSwitch 없는 8-way 보드: ring 만 full BW, 나머지는 절반 이하

8 비(非) DGX 구성 warning

hybrid cube-mesh (NVLink 일부만 연결, 예: 4-GPU 보드): 논리 ring은 가능하지만 AllToAll·Tree 는 PCIe hop 발생. NCCL topology detection 이 자동 대응하지만 peak BW 의 30~50% 수준.

1 NVLS 개념 ★ switch-in reduce

정의 NVLink SHARP: 3세대 NVSwitch (H100+) 내부 ALU 가 스트림 전송 중인 payload 를 in-network reduce. 즉 switch 가 단순 라우팅이 아니라 reduction step 자체를 수행.

기존 IB SHARP 의 NVLink fabric 판
AllReduce / ReduceScatter / Broadcast 가속

2 기존 Ring 과의 차이

Ring (no NVLS):
  GPU0 → GPU1 → GPU2 → ... → GPU_{N-1} (N-1 step)
   각 GPU 가 부분합을 수행

NVLS (in-switch):
  GPU_r ──┐
  GPU_r' ─┼─► switch (ALU: sum) ──► all GPUs
  GPU_r'' ┘
   1 shot multicast-reduce

3 하드웨어 요건

H100 SXM + NVSwitch 3 (DGX H100 / NVL72)
CUDA 12.0+ · NCCL 2.17+
mmap multicast object (CUmemMulticast) 지원
Hopper PTX: multimem.ld_reduce, multimem.st

4 Bandwidth 이득 ★

AllReduce_ring: bytes/GPU = 2(N−1)/N · S ≈ 2S
AllReduce_NVLS: bytes/GPU ≈ S · (1/N + 1) ≈ S 한 번 올라가고 한 번 내려오면 끝 — latency step = 2 로 고정

실효 BW 2× 이득 (이론상)
실측: large msg 에서 1.5~1.9× 가속

5 Multicast Object

정의 Multicast memory: 여러 GPU 의 물리 mem slice 가 단일 virtual address 로 매핑됨. NVSwitch 가 해당 VA 접근을 broadcast/reduce 하드웨어로 라우팅.

CUmemGenericAllocationHandle h;
CUmemAllocationProp prop = {...};
cuMulticastCreate(&mc, &mc_prop);
cuMulticastAddDevice(mc, dev);
cuMulticastBindAddr(mc, addr, size, 0);
// multimem.* PTX 로 접근

6 multimem PTX

instr	의미
`multimem.ld_reduce`	switch ALU 에서 reduce 후 load
`multimem.st`	모든 참여 GPU 에 broadcast store
`multimem.red`	atomic reduce to multicast mem

7 NCCL 활성화

NCCL_ALGO=NVLS · NCCL_NVLS_ENABLE=1
comm init 시 자동 감지 — H100 + driver + nvidia-peermem OK 이면 활성
제한: intra-node 만 (single NVSwitch fabric)
multi-node: CollNet SHARP (IB SHARP) 별도 → ↗ §16

8 용도별 효과

용도	효과
small AllReduce (grad bucket)	latency ↓↓ (2 step)
large AllReduce	BW 1.5~2×
AllGather	multicast store 로 가속
AllToAll	적용 X — pairwise 이므로

9 Trade-off 주의

NVLS 는 mmap 주소 공유 에 의존 — 장애 시 recover 복잡. fine-grained sync 섞이면 효용 하락. FP8 / INT 은 switch ALU 지원 dtype 만 가속 (fp16/bf16/fp32 기본).

1 문제 정의

상황 DGX 노드: GPU 8 · NIC 8. 각 GPU 는 "local NIC" 하나 (같은 PCIe root). 하지만 rail-optimized 토폴로지 에서는 GPU 와 목적지 GPU 가 서로 다른 rail 이면 local NIC 이 아니라 다른 rail 의 NIC 을 써야 inter-node single-hop.

2 Rail-optimized 란

Leaf switches: L0 L1 L2 ... L7 (rail 별 1개)
node A : G0→NIC0→L0 · G1→NIC1→L1 · ...
node B : G0→NIC0→L0 · G1→NIC1→L1 · ...
 같은 rank index 끼리는 같은 leaf → 1 hop
 다른 rank 끼리는 spine switch 경유 → 2+ hop

학습 시 AllReduce 는 ring 구성상 주로 same-rank pair → rail-matched 가 최적.

3 PXN 아이디어 ★ (!) local P2P then NET

전략 GPU_r 이 목적지에 맞는 다른 GPU(같은 node, 원하는 rail) 로 NVLink P2P 전송 → 그 GPU 가 자기 NIC 으로 NET send. 결과: NUMA 경계를 우회해 항상 local-rail NIC 사용.

4 PXN 흐름도

GPU_r (rail i, 목적지 rail j)
   │ ① NVLink P2P
   ▼
GPU_k (같은 node, rail j)
   │ ② RDMA WRITE via NIC_j
   ▼
NET (leaf-L_j 직행)
   ▼
GPU_r' (remote node, rail j)

비교 (no-PXN):
GPU_r → NIC_i → spine → L_j → GPU_r'  (2 hop)

5 효과 수치

scenario	no-PXN	PXN
same rail	1 hop	1 hop (의미 無)
cross rail	2+ hop	1 hop (+ NVLink)
BW 활용	spine BW 나눔	leaf 만 사용
NIC contention	높음	분산

6 NCCL 활성화

NCCL_PXN_DISABLE=0 (기본값) — auto 활성
NVLink P2P 가능 pair 간에만 동작
ACS 이슈 없어야 함 (ACS = PCIe Access Control Services, 켜져있으면 P2P 차단)

7 NUMA & CPU affinity

정의 NUMA node = CPU socket + 로컬 DRAM + 로컬 PCIe root. GPU·NIC 가 서로 다른 NUMA 에 있으면 QPI/UPI 경유 → CPU proxy thread 가 cross-NUMA 접근 시 latency·BW 모두 하락.

DGX: 2 NUMA · 4 GPU per NUMA
PXN 이 cross-NUMA 만큼은 NVLink 로 우회

8 Rail 활용 체크리스트

GPU ↔ NIC 쌍 확인: nvidia-smi topo -m
IB 인터페이스 bind: NCCL_IB_HCA, NCCL_IB_GID_INDEX
rail 인덱스 일치: node 간 같은 rank → 같은 leaf
CPU affinity: proxy thread 를 GPU-local NUMA 에 pin
ACS off · IOMMU pass-through

9 Intra-node 변형

PXN 이 intra-node 병목 에도 쓰인다. 예: NVLink 가 없는 pair 간 (AMD CPU + consumer GPU) 에 PCIe P2P 로 우회. 단 BW 는 PCIe 한계.

1 TP 정의 matmul 분할

정의 weight matrix 를 hidden 차원 으로 쪼개 각 GPU 가 부분곱 수행, collective 로 합산/concat. layer 하나가 여러 GPU 에 걸쳐 존재.

Megatron-LM (Shoeybi 2019) 가 표준
TP group = 동일 layer 를 공유하는 rank 집합

2 Column-parallel ★

W ∈ ℝ^K×N → W_r ∈ ℝ^K×(N/P) (열 분할)
y_r = x · W_r (각 GPU 가 N/P 열 보유)
y = concat(y₀,…,y_P−1) (필요 시 AllGather) x 는 전 rank 복제 · 입력 통신 無 · 출력 partition

3 Row-parallel ★

W ∈ ℝ^K×N → W_r ∈ ℝ^(K/P)×N (행 분할)
y_r = x_r · W_r (입력 x 도 partition)
y = AllReduce(y₀,…,y_P−1) x 는 partition 입력 (앞 단계가 col-parallel 이므로 OK)

4 Megatron MLP 패턴 ★ col→row→AR

MLP(x) = GeLU(x · W₁) · W₂

 W₁ : column-parallel (K×4H split N)
   y = GeLU(x · W₁ᵣ)     # output partition
 W₂ : row-parallel (4H×K split K)
   z = y · W₂ᵣ
   out = AllReduce(z)      # ← 1 AllReduce / MLP

activation (GeLU) 는 element-wise → partition 상태 유지 OK.

5 Attention 패턴

QKV proj   : column-parallel (head 차원 분할)
 each GPU → head subset 의 Q,K,V

attention  : local (heads 독립)

out proj   : row-parallel + AllReduce
 ← 1 AllReduce / attention block

head 수가 P 의 배수 여야 깔끔. GQA/MQA 에선 KV head 수 제약 ↗ V07 §6.

6 통신량 (layer 1회)

forward: 2× AllReduce (MLP + Attn) @ size B·S·H
backward: 2× AllReduce additional
총 forward+backward = 4× AllReduce(B·S·H) B : batch · S : seq · H : hidden dim

7 TP 크기 선택

TP size	제약
≤ 8 (intra-node)	NVLink 필수 — AllReduce BW
= 4	GQA 4-KV-head 에 친화
> 8 (cross-node)	IB AllReduce 병목 — 일반적으로 회피
= 1	TP 없음 (DP 만)

8 왜 AllReduce 위치가 그런가

원리 col→row 조합은 중간에 통신을 피한다. col-output 을 row-input 으로 자연 매치 (같은 partition 축). 마지막에 partial sum 을 AR 하면 full output. 반대 순서 (row→col) 는 중간에 AllGather 필요.

9 SP 와의 겹침 (!) SP=RS+AG

Megatron+SP 에선 TP 의 AllReduce = ReduceScatter + AllGather 로 분해되어 dropout/layernorm 사이에 SP shard 를 끼워 활성 메모리 감소. ↗ §11

1 PP 정의 layer 분할

정의 transformer L 개 layer 를 P 개 stage 로 나눠 각 GPU 가 stage 하나 담당. 순차적 데이터 흐름: stage_s 출력 = stage_{s+1} 입력 (P2P send/recv).

micro-batch M 개 로 batch 쪼갬 → pipeline 채움
통신 = stage 경계 activation P2P · backward grad P2P

2 GPipe naive

stage↓  time →
 0 :  F1 F2 F3 F4 ..          B4 B3 B2 B1
 1 :     F1 F2 F3 F4 ..      B4 B3 B2 B1
 2 :        F1 F2 F3 F4 ..  B4 B3 B2 B1
 3 :           F1 F2 F3 F4 .B4 B3 B2 B1
 forward 전부 → backward 전부 (all-forward-all-backward)

단순 · 하지만 activation 전부 보관 → 메모리 ↑
bubble ratio = (P−1)/M

3 1F1B ★ PipeDream / Megatron

stage↓  time →
 0: F1 F2 F3 F4 B1 F5 B2 F6 B3 F7 B4 ..
 1:    F1 F2 F3 F4 B1 F5 B2 F6 B3 F7 ..
 ...
   F 와 B 를 1 micro-batch 씩 교대

memory: stage 당 보관 activation ≈ P (GPipe 는 M)
bubble ratio = (P−1)/M (동일)
throughput 유사, memory 우위

4 Interleaved 1F1B ★ Megatron virtual

정의 각 GPU 가 v 개의 virtual stage 담당 (model chunk). layer 를 stage 순이 아니라 round-robin 배치 → bubble 감소.

bubble_interleaved = (P−1) / (M · v) v = virtual stage 수 per GPU · 통신량은 v× 증가

5 Zero Bubble 최근

backward 를 B (input grad) + W (weight grad) 로 쪼갬
W 는 critical path 밖 → bubble 채움
조건 만족 시 bubble = 0 (수식상)

6 Bubble 공식 비교

schedule	bubble ratio	mem/stage
GPipe	(P−1)/M	M activations
1F1B	(P−1)/M	P activations
interleaved 1F1B	(P−1)/(Mv)	v·P activations
Zero Bubble	≈ 0	similar to 1F1B

Megatron 기본: v=2~4. throughput 개선 vs 통신 overhead trade.

7 통신 패턴

stage 경계: SEND/RECV activation (forward) + grad (backward)
P2P message — NCCL ncclSend/ncclRecv
size: B · S · H per micro-batch per boundary
NVLink intra-node / IB inter-node

8 PP 크기 선택 (!) M ≥ 4P

PP size P 선택 시 micro-batch 수 M ≥ 4P 를 목표. bubble ratio 가 25% 이하 되어야 유용. M 이 부족하면 1F1B 든 뭐든 throughput 심각.

9 PP & 학습 loop

학습 loop 통합·FSDP 연동은 ↗ V17 §10 에서 상세. 본 권은 통신 패턴에만 초점.

1 Vanilla DP

정의 각 GPU 가 model 전체 복제. batch 만 쪼개 병렬. backward 후 grad AllReduce 로 sync.

2 학습 메모리 구성 Adam 기준

성분	bytes/param	설명
param (BF16)	2	forward weight
grad (BF16)	2	backward
param master (FP32)	4	update precision
momentum (FP32)	4	m
variance (FP32)	4	v
합계	16	per param

자세한 mixed precision 은 ↗ V09 §2 · 학습 loop 통합 ↗ V17 §2.

3 ZeRO-1 optimizer shard

mem/GPU = 2 + 2 + 12/N = 4 + 12/N bytes/param
통신 = AllReduce(grad) + AllGather(param update) 없음
실질 AllReduce 동일 (DDP 와 같음)

4 ZeRO-2 + grad shard

mem/GPU = 2 + (2 + 12)/N = 2 + 14/N
통신: ReduceScatter(grad) + no AllGather
= AllReduce 와 같은 BW (분해 등식)

5 ZeRO-3 / FSDP ★ + param shard

mem/GPU = 16/N
통신 = forward AllGather(param) + backward AllGather + ReduceScatter(grad)
총 = 1.5× vanilla DP 통신량 AG+RS = AR 이지만 layer 당 추가 AG (forward) 발생

PyTorch FSDP = ZeRO-3 native · unit = FlatParameter (여러 param concat).

6 FSDP 한 layer 흐름 ★

layer forward:
  AllGather(param)        ← full W 복원 (N→1)
  y = layer(x)
  free param shard-only   ← shard 로 복귀

layer backward:
  AllGather(param)        ← 다시 필요
  dx, dw = layer.bwd(...)
  ReduceScatter(dw)       ← grad shard 로
  free param

layer 마다 prefetch 로 AG overlap
use_orig_params=True (FSDP-2) 로 inference 호환

7 Offload 변형

ZeRO-Offload: CPU 로 opt state 내림
ZeRO-Infinity: NVMe 까지 확장
제한: CPU-GPU BW (PCIe Gen4 ~32 GB/s) 가 bottleneck

8 비교 매트릭스

단계	mem (B/param)	extra comm
DDP	16	0
ZeRO-1	4 + 12/N	0
ZeRO-2	2 + 14/N	0 (RS 대체)
ZeRO-3/FSDP	16/N	+50% (AG forward)

N = DP rank 수. 상세 수식 유도 ↗ V17 §3.

1 MoE 배경

정의 Mixture of Experts: MLP layer 를 E 개 expert 로 병렬. router (gate) 가 각 token 을 top-k expert 에 보냄. sparse activation — token 당 k ≪ E.

param ↑↑ (E·MLP_size) · FLOP 는 k/E 배
Switch Transformer (k=1) · Mixtral (k=2, E=8) 등

2 EP 정의

정의 expert 들을 rank 에 분산 저장 (expert_r 들이 rank r 에). token 을 해당 expert 가 있는 rank 로 전송 → MLP → 결과 역송.

3 핵심 통신 = AllToAll ★

MoE forward:
 x [B·S, H]                       각 rank 보유 token
   └─ router → assignment [B·S, k]
   └─ sort by expert
   └─ AllToAll (dispatch)  ← 1
   └─ expert.MLP (local)
   └─ AllToAll (combine)   ← 2
   └─ scale by gate · sum top-k

1 layer 당 2× AllToAll

4 Capacity Factor

정의 capacity c = ⌈(B·S·k / E) · cf⌉ (cf = capacity factor, 일반적으로 1.0~1.5). expert 당 최대 c token 수용 — 초과 시 drop.

cf 높이면 drop ↓, 통신량 ↑
cf 낮추면 GPU 활용도 ↑, 성능 손실 위험

5 AllToAll 크기

dispatch bytes/rank ≈ (B·S·k/E) · H · 2 · (E−1)/E
total per layer = 2 × 위 값 B=batch · S=seq · k=top-k · H=hidden · E=expert 수

Mixtral 8x7B (k=2, E=8, H=4096): B·S = 8192 · 2·H=16KB/token · bytes ≈ 130 MB/rank/layer.

6 Token Routing 스키마

방식	특징
Top-k	token → k 개 expert
Expert-choice	expert → top-k token 선택
Hash routing	gating 없이 hash
BASE layer	학습 가능한 balance

7 AllToAll 구현 알고리즘

Ring: N−1 step, neighbor 전송. BW-optimal.
Bruck: log N step, index reorder 필요. latency-optimal small msg.
Pairwise exchange: N step. NVSwitch 하에서 단순.
NCCL 내부 hierarchical: intra-node pairwise + inter-node ring

8 EP + TP 조합

DeepSpeed-MoE EP group (expert 위치) · TP group (expert 내부 MLP sharding) 중첩. comm = AllToAll(EP) + AllReduce(TP).

typical: EP=8, TP=1 (expert 단위로 분할)
EP > 8 → inter-node AllToAll 병목

9 Load Balance Loss (!) uniform→no drop

router 학습 시 auxiliary loss(load balance) 없으면 expert 사용 편중 → 특정 GPU 만 로드, 나머지 유휴. DeepSeek-MoE 는 aux-loss-free 변형.

1 SP 동기 TP 한계

문제 TP 만 쓰면 LayerNorm·Dropout·residual 은 전 rank 복제 → activation 메모리 N 배. sequence 차원 은 TP 로 분할 안 됨.

Megatron-SP: TP 그룹 내부에서 norm/dropout 을 seq 차원 shard
Ulysses (DeepSpeed): 독립적 SP — seq 차원 shard, head 차원 AllToAll

2 Megatron-SP ★ TP 보조

기존 TP:
 [B,S,H] → AllReduce(TP) → [B,S,H] (전 rank 복제)

Megatron-SP:
 [B,S,H] → ReduceScatter(S축) → [B,S/P,H]
         → LayerNorm/Dropout (local)
         → AllGather(S축) → [B,S,H]
         → TP attn/MLP
AR = RS + AG 분해 → 추가 통신 없음, activation ÷P

3 Ulysses ★ 독립 SP

idea 입력: [B, S/P, H] (seq shard). attention 전 AllToAll(S↔Head): [B, S, H/P]. attention 계산 후 AllToAll 역변환.

[B, S/P, H]
   │ AllToAll (S ↔ Head)
   ▼
[B, S, H/P]   ← head 차원 분할
 attention (per head 독립)
   │ AllToAll (Head ↔ S)
   ▼
[B, S/P, H]

4 Ulysses 통신량

AllToAll bytes/rank = 2 · (P−1)/P · B·S·H · bytes_per_elt
per layer = 4× (forward 2 + backward 2) TP 의 AllReduce 와 비슷한 규모지만 head 독립이라 attention 내부 통신 無

5 SP & 활성 메모리

방식	activation mem/rank
TP only	B·S·H
TP + Megatron-SP	B·S·H / P
Ulysses (pure SP)	B·S·H / P
TP · SP 조합	B·S·H / (P_TP·P_SP)

6 제약 head 수

Ulysses 는 head 수 = P · k 여야 깔끔 분할. GQA 에선 KV head 수 작아 SP 크게 못 키움. Ring attention (CP) 로 보완 ↗ §12.

7 Norm 과의 상호작용

LayerNorm/RMSNorm 은 hidden 축 평균 → shard 상태에서도 local 계산 가능 (hidden 축 full)
Megatron-SP: norm 앞에 RS (S축 shard) · 뒤에 AG (복원)
Dropout mask = random → rank 간 seed 동기화 필수 (같은 shard 는 같은 mask)

8 TP + SP 통합 흐름

input [B,S/P,H]
 ├─ AllGather → [B,S,H]
 ├─ attention QKV (col-TP) → [B,S,H/P]
 ├─ attention (head 독립)
 ├─ out proj (row-TP) → ReduceScatter → [B,S/P,H]
 ├─ dropout (local, seed sync)
 ├─ layernorm (local)
 ├─ MLP W1 (col-TP) → AllGather → [B,S,H/P_mlp]
 ├─ GeLU (local)
 └─ MLP W2 (row-TP) → ReduceScatter → [B,S/P,H]

AR = RS+AG 로 분해해 SP 결합. 총 통신량은 TP-only 와 동일, activation 는 shard.

9 SP 크기 가이드

일반: SP = TP (같은 그룹 재사용)
seq 매우 길면: Ulysses SP 독립 확장 · 혹은 CP 결합
attention 커널 호환성 확인 (FlashAttn varlen) ↗ V07 §9

1 CP 동기 long seq

문제 128K / 1M context 에서 Q·K·V · attention matrix 가 단일 GPU 에 안 들어감. SP (Ulysses) 는 head 수 제약 → CP 가 해법.

Q/K/V 를 seq 축으로 P 등분
attention 은 K/V 를 ring 으로 순환 시켜 모든 Q 가 모든 K/V 를 만남

2 Ring Attention ★ 핵심

GPU_r 보유: Q_r, K_r, V_r  (seq shard r)

for t in 0..P-1:
  K_{r-t}, V_{r-t} 로 local attention (Q_r × K_{r-t}·V_{r-t})
  partial softmax + weighted sum 누적
  P2P SEND K,V → rank r+1 ; RECV from rank r-1

FlashAttn-style online softmax 로 수치 안정

P2P 전송과 attention 연산 overlap 가능 → BW 덜 드러남. FA의 online max/lse 유지가 관건 ↗ V07 §4.

3 Online Softmax 누적

m^(t) = max(m^(t−1), m_new)
l^(t) = e^{m^(t−1)−m^(t)}·l^(t−1) + e^{m_new−m^(t)}·l_new
O^(t) = e^{m^(t−1)−m^(t)}·O^(t−1) + e^{m_new−m^(t)}·O_new m : running max · l : running lse · O : running output

4 Causal Load Imbalance ★

문제 causal mask 하에서 rank 0 (앞쪽) 은 적은 K/V 만 볼 필요. naive shard 는 rank 간 workload 불균형.

rank r 의 일 = Σ_t≤r |K_t| ∝ r+1
max rank 가 min rank 대비 P× 일

5 Striped CP balance

원래: rank r = [r·S/P, (r+1)·S/P) 연속 블록
striped: rank r = [r, r+P, r+2P, ...] stride
 → 어느 block 이든 causal workload ≈ uniform
 (앞·뒤 mix)

Context Parallel (NV), "Striped Attention" (Brandon 2023).

6 통신량

P2P bytes/rank = 2 · (P−1)·(S/P)·2·d_head·n_head·2 (K,V each)
≈ 4 · (P−1)·(S/P)·H
forward+backward 총 ≈ 8·(P−1)·(S/P)·H S=full seq · H=hidden · d_head·n_head = H

7 CP vs Ulysses 비교

항목	Ulysses	CP (Ring)
통신	AllToAll (S↔H)	P2P ring (K,V)
head 제약	H/P 분할 필요	없음
seq 상한	보통 ~32K	~1M+
overlap	어려움	자연스러움
causal	간단	striped 필요

8 CP & 다른 축 조합

CP + TP: attention 내부를 CP, MLP/proj 는 TP (독립 그룹)
CP + EP: MoE long-context 학습
CP = DP 처럼 data 축으로 보이지만 attention 에선 sync 필요

SP/CP attention 상세 알고리즘 ↗ V07 §11.

1 전체 조합 수식

world_size = DP × TP × PP × EP × CP
각 rank 는 5-tuple (dp, tp, pp, ep, cp) 에 속함
process_group = axis 별로 생성 (NCCL comm 5개+) 실무: 각 축은 2차 곱셈 그룹 — 곱이 total GPU 수

2 축별 통신 요약

축	primary collective	frequency
DP	AllReduce / RS (ZeRO)	step 끝
TP	AllReduce / AG+RS	layer × 2~4
PP	SEND/RECV (P2P)	stage 경계
EP	AllToAll	MoE layer × 2
CP	P2P ring (K,V)	attention 내부

3 축 배치 원칙 (!) TP≤8 · PP

TP 는 NVLink 내부 (≤8, 동일 node). PP 는 node 경계 OK. DP 는 가장 바깥 (ZeRO 시 RS BW 요구). EP 는 TP 와 겹치거나 독립.

4 모델 크기별 매트릭스 ★★

model	GPU	TP	PP	DP (ZeRO)	EP	CP
7B dense	8	1	1	8 (Z3)	—	1
7B long-ctx	8	1	1	2	—	4
70B dense	64	8	1	8 (Z3)	—	1
70B dense	512	8	4	16	—	1
400B dense	2048	8	8	32	—	1
400B dense long	4096	8	8	8	—	8
8×22B MoE	256	4	1	8	8	1
DeepSeek V3 (685B)	2048	1	16	16	8	1

예시 구성. DP·PP·TP 곱 = GPU. ZeRO-3 는 DP 내에서 ZeRO-1/2/3 중 선택. 숫자는 대표적 점 — 실제 tuning 필요.

5 7B 선택 근거

param·grad·opt 모두 한 node (8 GPU) 에 ZeRO-3 로 적재
TP=1: 작은 모델은 TP 통신 overhead 손해 더 큼
long-ctx: CP 확장으로 memory 해결

6 70B / 400B 선택 근거

70B: TP=8 + DP(ZeRO-3) · node 내에 layer 1개가 다 들어가는 상한
400B: TP=8 + PP=8 (node 간 layer 쪼갬) + DP 로 scale-out
PP 는 bubble (P-1)/M 을 만족할 M 필요 (M ≥ 32 목표)

7 MoE 특수

EP 축 추가 → AllToAll 비용 (node 내 유지가 최선)
TP 줄이고 EP 확장 경향 (expert 가 이미 param 분산 역할)
DeepSeek V3: TP=1, PP 로 파이프 · EP=8 내부

8 Rank 매핑 공식

rank_global = dp·(TP·PP·EP·CP) + tp·(PP·EP·CP) + pp·(EP·CP) + ep·CP + cp
(혹은 역순; 구현체별) NCCL subgroup 생성 시 각 축별 stride 로 communicator 5개 만듦

9 Node affinity

실무 rule of thumb: TP·CP·EP 는 intra-node, DP·PP 는 inter-node. AllToAll 비용을 NVLink 안에 가둔다.

1 Overlap 기본 원리

정의 통신 kernel 은 compute kernel 과 다른 CUDA stream 에 두고 event 로 의존성만 걸면, 두 kernel 이 HW 레벨에서 동시에 진행된다. HW: NVLink/NIC 는 SM 과 독립 unit.

SM 소비: NCCL kernel 은 channel × SM 을 얕게 잡음 (~2-4 SM)
나머지 SM 은 compute 여유 → overlap 성립

2 Backward AllReduce Bucketing

naive: grad 1개 생성마다 AR → 작은 msg × 많음
bucket: 연속 param grad 를 buffer 에 쌓아 합치기
 └─ 25 MB 정도 버킷 권장
 └─ 첫 bucket ready 시점부터 AR 시작
 └─ 이후 bucket 은 compute 와 overlap

PyTorch DDP bucket_cap_mb=25 기본. bucket 클수록 BW 효율 ↑, latency 시작 ↑.

3 FSDP Prefetch ★

layer_n forward  :  AG(param_n)   ┐
                    compute_n      │
layer_n+1 forward: AG(param_n+1) prefetch (overlap)

layer_n backward : AG(param_n)    │
                   compute_n      │
                   RS(grad_n)     │ overlap with next layer's AG
layer_n-1 backward: AG(param_n-1) prefetch

PyTorch FSDP backward_prefetch=BACKWARD_PRE
통신 stream + compute stream 분리 필수

4 Async Collective API

NCCL: kernel launch 후 반환 → CPU 는 다음 작업
PyTorch async_op=True → work handle
수동 overlap: handle.wait() 를 compute 이후에 배치

5 PP & overlap

1F1B 의 backward 기간 동안 다음 forward 와 P2P 교차
stream 2개: activation P2P · grad P2P 분리
NCCL send/recv 를 서로 다른 comm 으로 → 큐 충돌 회피

6 TP & overlap (advanced)

idea TP AllReduce 를 fine-grained 하게 tiled matmul 과 interleave — matmul tile 끝날 때마다 부분합 AR.

Megatron "async tensor parallel" — WGMMA 와 interleave ↗ V04 §7
Ring-allreduce 는 natural pipelining

7 Overlap 한계

증상	원인
overlap 0%	SM 꽉 참 / kernel 너무 짧음
overlap 부분	bucket too small / sync 빈번
NIC idle	bucket too big · 첫 AR 지연
div crash	stream 순서 버그 · event 누락

8 CUDA Graph & overlap

전체 iteration 을 graph capture → launch overhead 제거
decode path 에 효과 극대 ↗ V16 §10
NCCL 2.11+ 부분 지원 (CUDA_GRAPH_MIXING_SUPPORT)

9 SM 사용량 분배

NCCL kernel 이 compute SM 을 너무 많이 잡으면 overlap 이 안 된다. NCCL_MAX_NCHANNELS 낮추거나, collective 크기를 bucket 조정으로 짧게 유지.

1 α-β 모델 정의 ★

T(n) = α + β · n
α : per-message latency (fixed overhead)
β : per-byte cost = 1 / BW Hockney model (1994). MPI·NCCL 성능 분석의 기본.

α ∝ hop count · driver · NIC 지연
β ∝ 1/link_BW

2 확장: α-β-γ

T(n) = α + β·n + γ·n (γ : reduce compute) reduction op (sum, max) 의 FLOP 이 무시 못 할 때

BF16 allreduce 에선 γ ≪ β, 무시해도 됨. FP32 + 큰 tensor 에선 γ 가 드러남.

3 Typical 값 H100 DGX

link	α (μs)	β (ns/B)
NVLink 4 P2P	~3	~1.1
NVSwitch hop	~3.5	~1.1
IB 400G	~5	~20
PCIe Gen5 P2P	~10	~8

대략적 — driver/stack 에 따라 변동. 출처: NCCL-tests nominal 값.

2 AllReduce 이론 시간

T_ring(N, S) = 2(N−1) α + 2(N−1)/N · S · β
T_tree(N, S) = 2 log₂ N · (α + S · β)
T_HD(N, S) = 2 log₂ N · α + 2(N−1)/N · S · β ring latency ↑ (N−1 linear) · tree/HD latency ↓ (log N)

5 Switching point ★

S_* : T_ring = T_tree 조건
2(N−1) α + 2(N−1)/N · S · β = 2 log N · α + 2 log N · S · β
S_* = α · [N − 1 − log N] / β · [log N − (N−1)/N] S 이 이보다 크면 ring, 작으면 tree

실무: N=8, α=5μs, β=1ns/B → S_* ≈ 수 KB 범위.

6 이론 하한 — Bandwidth

min bytes/GPU (AllReduce) = 2(N−1)/N · S
min bytes/GPU (AllGather) = (N−1)/N · S
min bytes/GPU (ReduceScatter) = (N−1)/N · S
min bytes/GPU (AllToAll) = (N−1)/N · S 정보 이론적 하한 — 링크를 1개 썼을 때 기준

7 이론 하한 — Latency

min steps (AllReduce) = ⌈log₂ N⌉
min steps (AllGather) = ⌈log₂ N⌉
min steps (AllToAll) = ⌈log₂ N⌉ (hypercube) message-count 하한 — 실제 구현은 algorithm 선택에 따라 더 많음

8 Effective BW 식

BW_eff = S / T(N, S)
algorithm-adjusted: BW_ring,eff = S·N / [2(N−1)·(α + S·β/N)]
S→∞ : BW_eff → N·link_BW / 2(N−1) → link_BW/2 (N 크면) 왜 AllReduce 의 이상 효율이 link_BW 의 약 50% 인가의 근거

9 분해와 α 저감 (!) bucket = α↓

작은 msg 를 bucket 으로 묶으면 α 가 1회로 줄고 β 는 그대로 → 작은 grad 다수 시 critical. DDP bucketing 의 이론 근거.

1 RDMA 원리

정의 Remote Direct Memory Access: NIC 이 원격 메모리에 직접 read/write. kernel bypass · CPU 개입 없음 · zero-copy.

registered memory region (MR) 의 rkey 로 원격 접근
completion queue (CQ) 로 비동기 신호

2 InfiniBand vs RoCE

항목	InfiniBand	RoCE v2
L2	IB transport	Ethernet
L3	IB routing	IP/UDP
lossless	credit-based	PFC (DCB) 필요
latency	~1 μs	~2-3 μs
생태계	HPC 정통	cloud · hyperscale

RoCE v2 는 UDP 위에서 돌기 때문에 IP routing 가능 — 대규모 cloud 에 유리.

3 Queue Pair (QP) ★

정의 QP = (SendQ, RecvQ) 한 쌍. 두 NIC 이 서로의 QP 를 connected state 로 만들면 통신 가능. 각 rank pair 는 QP 여러 개 가능 (multi-queue).

RC (Reliable Connected) · UD (Unreliable Datagram) · UC (Unreliable Connected)
NCCL : RC 주로 사용, DC (Dynamic Connected) 실험

4 SEND/RECV vs RDMA WRITE ★

op	sender	receiver
SEND/RECV	post_send	pre-post recv buf
RDMA WRITE	post_send(addr,rkey)	passive (mem 준비만)
RDMA READ	post_send(addr,rkey)	passive
ATOMIC	FA/CAS 64bit	passive

NCCL intra-collective 는 주로 RDMA WRITE (receiver 가 어디에 둘지 알고 있음). 큰 msg 에 유리.

5 Work Request 흐름

app → ibv_post_send(wr)   # doorbell PCIe MMIO
NIC ─ DMA from reg-mem ─→ wire
wire ─→ remote NIC ─ DMA to reg-mem ─→ remote
complete → CQ entry → app polls / event

6 GPUDirect RDMA ★ (!) NIC ↔ GPU

정의 NIC 이 GPU HBM 을 직접 DMA source/sink 로 사용. CPU bounce buffer 삭제. nvidia-peermem 모듈 필요.

조건: PCIe ACS off · NIC-GPU 같은 root complex 또는 NVLink
성능: latency 2-3× 감소, BW 실효 ↑

7 GPUDirect Storage 구분

GDR (NIC ↔ GPU) 과 GDS (NVMe ↔ GPU) 는 별개. NCCL 은 GDR 만 사용. GDS 는 I/O 용.

8 IB SHARP in-net

정의 SHARP: IB switch 내부 aggregation tree 가 reduction 수행. inter-node AllReduce 가속. NVLS 의 IB 판 ↗ §5.

ConnectX-6+ 지원 · subnet manager 설정 필요
NCCL NCCL_COLLNET_ENABLE=1

9 디바이스 선택 환경

NCCL_IB_HCA=mlx5_0,mlx5_1 명시적 NIC 지정
NCCL_IB_GID_INDEX RoCE v2 에서 필수
NCCL_NET_GDR_LEVEL=PXB GDR 레벨 조정
NCCL_SOCKET_IFNAME bootstrap 용 TCP iface

1 Debug / 진단

env	값
`NCCL_DEBUG`	VERSION · WARN · INFO · TRACE
`NCCL_DEBUG_SUBSYS`	INIT, COLL, P2P, NET, GRAPH…
`NCCL_DEBUG_FILE`	rank-별 로그 파일 경로
`NCCL_TOPO_DUMP_FILE`	topology XML dump
`NCCL_GRAPH_DUMP_FILE`	ring/tree graph dump

첫 실행 시 NCCL_DEBUG=INFO 필수. 어떤 algo/proto 가 쓰이는지 확인.

2 Algorithm 선택

env	값
`NCCL_ALGO`	Ring · Tree · CollNet · NVLS · NVLSTree
`NCCL_PROTO`	LL · LL128 · Simple
`NCCL_NVLS_ENABLE`	1 / 0
`NCCL_COLLNET_ENABLE`	1 / 0 (SHARP)

3 Channel 수

NCCL_MIN_NCHANNELS, NCCL_MAX_NCHANNELS
default: topology에 따라 자동
너무 크면 small msg 초기화 비용 ↑

4 P2P 제어

env	용도
`NCCL_P2P_DISABLE`	NVLink P2P 끔 (디버깅)
`NCCL_P2P_LEVEL`	NVL · PXB · PHB · SYS
`NCCL_SHM_DISABLE`	host shared mem 경로 끔
`NCCL_PXN_DISABLE`	PXN 우회 끔

level: NVL(NVLink) < PXB(same PCIe switch) < PHB(host bridge) < SYS. 허용 수준 명시.

5 IB / 네트워크

env	용도
`NCCL_IB_HCA`	HCA list (mlx5_0,mlx5_1)
`NCCL_IB_GID_INDEX`	RoCE v2 GID
`NCCL_IB_TC`	traffic class (QoS)
`NCCL_IB_SL`	service level
`NCCL_IB_TIMEOUT`	timeout hop
`NCCL_NET_GDR_LEVEL`	GPUDirect 허용 level

6 Buffer 크기

env	default
`NCCL_BUFFSIZE`	~4 MB
`NCCL_NTHREADS`	256/512 (kernel)
`NCCL_P2P_NET_CHUNKSIZE`	~128KB

BUFFSIZE ↑ → large msg pipelining ↑, GPU mem 도 ↑.

7 동작 모드

NCCL_LAUNCH_MODE=PARALLEL — 모든 rank 동시 launch
NCCL_ASYNC_ERROR_HANDLING=1 — blocking 대신 이벤트로 error
NCCL_BLOCKING_WAIT=1 — wait 시 spin

8 Tuning 실마리 ★

증상	env 시도
small AR 느림	ALGO=NVLS / Tree · PROTO=LL
large AR 느림	ALGO=Ring · PROTO=Simple · channel ↑
inter-node 저하	IB_HCA 명시 · PXN 확인
init 오래	TOPO_FILE 캐시 사용
OOM	BUFFSIZE ↓ · channel ↓

9 Reproducibility

NCCL reduction 은 연산 순서에 따라 결과가 비결정적. NCCL_ALGO=Tree + 고정 seed + 고정 proto 로 준재현 가능. 완전 bitwise 재현은 어려움.

1 MPI 위치

정의 Message Passing Interface: HPC 의 표준. CPU 기반 다수 프로세스 간 통신. send/recv · collective · one-sided · I/O 까지.

구현체: OpenMPI · MPICH · MVAPICH · Intel MPI
GPU-aware MPI: CUDA ptr 을 send 에 직접 전달 가능

2 NCCL 위치

정의 NVIDIA Collective Comm Library: GPU 중심 collective 라이브러리. NVLink/IB 를 직접 통합 · CUDA kernel 로 구현.

GPU-GPU 만 다룸 (CPU 쪽 MPI 보완 관계)
collective + P2P 지원 (2.7+)

3 기능 비교 ★

기능	MPI	NCCL
프로세스 시작·rank	O	—
CPU ↔ CPU	O	—
GPU ↔ GPU P2P	O (GPU-aware)	O
GPU collective 성능	중간	최적
NVLink 활용	부분	full
topology 탐색	제한	O
one-sided	O	—
I/O	O (MPI-IO)	—

4 Hybrid 패턴 ★ (!) MPI rank + NCCL comm

표준 패턴 MPI 로 rank bootstrap · world_size 관리. 각 MPI rank 가 GPU 1개 담당. NCCL ncclUniqueId 를 MPI_Bcast 로 뿌린 후 ncclCommInitRank.

MPI_Init → MPI_Comm_rank/size
if rank==0: ncclGetUniqueId(&id)
MPI_Bcast(&id, ...)
cudaSetDevice(local_rank)
ncclCommInitRank(&comm, size, id, rank)
// now NCCL for collectives, MPI for orchestration

5 torchrun 대안

PyTorch: torchrun·torch.distributed.init_process_group("nccl")
내부적으로 TCPStore (rendezvous) + NCCL
MPI 없이 동작 — 많은 학습 스택에서 기본

6 Launcher 비교

launcher	특징
mpirun / mpiexec	slurm 친화 · HPC 표준
srun (Slurm)	plug with PMI/PMIx
torchrun	PyTorch native · elastic
accelerate	HuggingFace wrapper
DeepSpeed	deepspeed launcher

7 MPI 선택 이유

HPC cluster (Slurm + PMIx) — 기존 infra 재사용
CPU 측 complex workflow (preprocessing · simulation)
I/O 에 MPI-IO 필요 (large dataset stream)
MPI RMA (one-sided) 로 fault handling

8 NCCL 단독 선택

순수 학습/추론 스택 — overhead ↓
cloud k8s 환경 (TCPStore rendezvous 편리)
elastic/fault-tolerant 중시 (torch elastic)

9 상호 겹침 주의

MPI collective 와 NCCL collective 를 동일 데이터에 혼용 금지. 두 라이브러리의 progress engine 이 서로 다른 stream/thread → deadlock/ordering 위험. rank 관리만 MPI, 실제 collective 는 NCCL 로 분리 유지.

10 GLOO · UCC 맥락

PyTorch gloo backend: CPU collective. ucc: UCX 기반 hybrid. NCCL 이 안 되는 CPU/TPU 대체용.

1 Parallelism 선택 결정 트리 ★

model & batch 한 GPU fit ?
├─ YES → DP (DDP or ZeRO-1)  · done
└─ NO
   │
   param 한 GPU fit ?
   ├─ YES → ZeRO-2 / ZeRO-3 (FSDP)
   └─ NO
      │
      layer 1개 한 GPU fit ?
      ├─ YES → TP + ZeRO-3
      └─ NO
         │
         layer set 하나가 intra-node 들어감 ?
         ├─ YES → TP + PP + ZeRO
         └─ NO → TP + PP + EP (if MoE) + DP

seq > 32K ?                 → + SP (Ulysses)
seq > 128K ?                → + CP (ring attention)
MoE layer 포함 ?            → + EP

2 Collective × topology BW 표 ★★

op	algo	bytes/GPU	steps
AllReduce	Ring	2(N−1)/N · S	2(N−1)
	Tree	2 log N · S	2 log N
	DBT	~2S	2 log N
	NVLS	~ S	2
ReduceScatter	Ring	(N−1)/N · S	N−1
AllGather	Ring	(N−1)/N · S	N−1
AllToAll	Ring	(N−1)/N · S	N−1
AllToAll	Bruck	(N/2) · S	⌈log N⌉
Broadcast	Tree	log N · S	log N
Reduce	Tree	log N · S	log N

bytes/GPU = 양방향 합. step 은 α 계수. NCCL tuner 가 threshold 로 자동 선택.

3 5-축 통신 요약

축	collective	위치
DP (ZeRO-3)	AG + RS	step + layer
TP	AR (or RS+AG)	layer × 2~4
PP	SEND/RECV	stage 경계
EP	AllToAll × 2	MoE layer
SP (Ulysses)	AllToAll × 2	attention 경계
CP	P2P ring	attention 내부

4 α-β 빠른 추정

small (S < α/β 수준): T ≈ α · steps
large (S ≫ α/β): T ≈ bytes · β
crossover: S_* = α · (steps_tree − steps_ring) / β · (bytes_ring − bytes_tree) algorithm 선택의 cheat 식

5 NVLink intra vs IB inter BW

구간	peak	typical eff.
H100 NVLink 4	900 GB/s	~400 GB/s
NVSwitch 3 fabric	full bipartite	non-blocking
IB 400G (x8 NIC)	400 GB/s	~200 GB/s
PCIe Gen5 x16	128 GB/s	~100 GB/s

"typical eff" 는 AllReduce busBW 기준 근사. NCCL-tests 결과와 유사.

6 디버깅 10대 체크

NCCL_DEBUG=INFO 로그 확인 — algo/proto
nvidia-smi topo -m 으로 GPU-NIC affinity
rank ↔ GPU 매핑 (CUDA_VISIBLE_DEVICES)
IB HCA 명시: NCCL_IB_HCA
PXN 활성 / ACS off 확인
bucket size (DDP) · prefetch (FSDP) 설정
stream 분리 — 통신 ↔ 연산
TP ≤ 8 (intra-node), PP bubble ratio ≤ 0.25
GPUDirect RDMA 동작 (peermem loaded)
NCCL version 고정 · 여러 버전 혼재 금지

7 관련 권

주제	xref
학습 loop · FSDP 통합	↗ V17
inference 서빙 (vLLM)	↗ V16
attention SP/CP kernel	↗ V07
Hopper WGMMA · async	↗ V04
mixed precision grad	↗ V09
profiling · stall	↗ V18

8 5-축 니모닉 (!) DT-PE-C

병렬 5축: DT · PE · C (Data · Tensor · Pipeline · Expert · Context)
SP 는 TP 의 짝 (activation shard)

9 한 줄 요약

NVLink 는 TP·SP 안에 가두고, IB 는 DP·PP 사이로만 보낸다. AllToAll 은 intra-node 우선. ring 은 large-S BW 왕, tree/NVLS 는 small-S latency 왕.