《GPU Mode》 L078 2025 High priority transcript · failed

Iris — Multi-GPU Programming in Triton

Triton 은 단일 GPU 의 tile DSL 이다. 그러나 실제 inference 와 학습은 8-way, 16-way, 256-way로 늘어나고 — 그 자리에서 NCCL 의 collective 추상은 fine-grained overlap 의 여지를 거의 안 준다. AMD 의 Iris 는 Triton 의 tile-level 추상을 multi-GPU 의 communication primitive 까지 확장한 시도. 학습 노트.

Iris Triton multi-GPU AMD RDMA SHMEM overlap collective

Speaker

Muhammad Awad

AMD

Speaker

Muhammad Osama

AMD

Speaker

Brandon Potter

AMD

강의 번호

L078

학습 우선순위

High

자막

failed

백엔드

AMD MI300+

SUB-CHAPTERS · 12 SECTIONS↓ jump to

01강의가 풀려는 문제why this lecture exists 02Triton 위 multi-GPU 동기why Iris 03Iris 추상core abstractions 04통신 lifecycleput/get/wait 05예제all-gather GEMM 06NCCL 과의 관계where Iris fits 07채택 사례production users 08한계caveats 09다음 단계future work 10기억할 메모key takeaways 11다른 강의로의 연결connections 12열린 질문open questions

§ 01강의가 풀려는 문제· why this lecture exists

"Triton 으로 multi-GPU 가 가능한가" 의 답을 도구로 만든다

Triton 은 단일 GPU 의 tile DSL 로 디자인됐다. 그러나 실제 LLM 학습 / inference 는 multi-GPU 가 default 다. tensor parallel, sequence parallel, expert parallel 같은 분산 패턴이 모두 그 위에 깔린다. 단일 GPU 추상으로는 이 자리를 표현할 수 없다.

강의의 frame 은 명확하다.

NCCL / RCCL 의 collective 는 black box 다 — allreduce, allgather 한 줄로 동작하지만 그 안에서 어떤 chunk 가 언제 어디로 가는지를 통제할 수 없다. fine-grained overlap 이 어렵다.
compute 와 통신을 같은 커널 안에서 짤 수 있어야 한다 — tensor parallel 의 GEMM + reduce-scatter, 또는 sequence parallel 의 ring-attention 같은 패턴은 본질적으로 compute-communication overlap 이다. NCCL 위에서는 별도 stream 으로 어색하게 표현된다.
Triton 의 tile 추상을 multi-GPU 까지 확장한다. tile 안에 "remote" 위치까지 표현. 한 program 이 다른 GPU 의 데이터를 자기 것처럼 indexing.

강의의 인지적 frame

"NCCL 위에 wrapper 를 또 짜는 게 아니다 — multi-GPU 를 단일 커널의 추상으로 끌어내린다." Iris 의 핵심 입장. compute kernel 안에서 직접 remote put/get 을 발사하고 barrier 를 걸 수 있는 모델이 된다.

"NCCL 의 collective 는 너무 추상적이다 — 우리는 그 한 단계 아래에서 compute 와 communication 이 같이 짜이는 자리를 원했다."Iris 저자 · 확인 필요

§ 02Triton 위 multi-GPU 동기· why Iris

NCCL 위 wrapper 가 아닌 — 커널 안에서 communication primitive

왜 NCCL 의 한 줄이 부족한가 — 강의에서 짚는 세 가지 자리.

패턴	NCCL 으로 표현	한계
TP GEMM + reduce	`matmul` + `allreduce` 두 커널	두 커널 사이 sync, overlap 안 됨
SP attention	ring 통신을 user 가 짜야	compute step 마다 NCCL 호출, overhead 큼
EP routing	all-to-all + matmul	token 분포에 따라 imbalanced — collective 비효율
fine-grained pipeline	chunk 별 send/recv	chunk 마다 launch overhead

NCCL 의 모델은 "여러 GPU 가 동시에 같은 collective 를 부르고 결과가 나올 때까지 기다린다". 이 모델은 scale 에서는 효율적이지만 compute 안에 communication 을 끼워 넣고 싶은 자리에는 어울리지 않는다.

대안의 위치

NVIDIA 의 NVSHMEM, OpenMPI 의 SHMEM, MSCCL — 모두 "kernel 안에서 remote put/get" 을 허용한다. Iris 는 그 추상을 Triton tile model 위에서 표현하려는 시도. AMD 의 ROCm + RCCL 인프라 위에서.

§ 03Iris 추상· core abstractions

tile 안에 "rank" 가 들어온다

Iris 의 핵심 자료형 두 개 — distributed tensor 와 remote tile load. Triton 에 익숙한 개발자가 자연스럽게 따라갈 수 있는 형태로 디자인.

SymmetricHeap

모든 rank 에 동일한 virtual address 가 mapped. NVSHMEM/SHMEM 의 표준 model.

DistributedTensor

rank 별 shard 의 tile-level view. 한 program 이 자기 shard + remote shard 를 indexing.

iris.load(remote_rank, addr)

other GPU 의 tile 을 가져온다. async, RDMA 발사.

iris.barrier(group)

tile 단위 sync. NCCL 의 group 보다 fine-grained.

주요 primitive 들의 의미.

iris.load_remote(rank, ptr, mask) — Triton 의 tl.load 와 비슷하지만 첫 인자가 rank. RDMA 또는 NVLink 로 발사.
iris.store_remote(rank, ptr, val, mask) — remote tile 에 직접 write. one-sided put.
iris.atomic_add_remote — remote rank 의 tensor 에 직접 atomic. all-reduce 의 더 fine-grained 버전.
iris.signal(rank, flag) / iris.wait(flag) — fine-grained sync. mbarrier 의 inter-GPU 버전.

NVSHMEM 과의 관계

Iris 의 primitive 는 NVSHMEM/SHMEM 의 one-sided communication 과 1:1 으로 매핑된다. 차이는 Triton 의 program 안에서 호출 가능하다는 점. 즉 컴파일러가 그 호출을 RDMA op 로 lowering.

# Iris 의 distributed GEMM (개념)
@triton.jit
def tp_gemm_kernel(
    A_ptr, B_ptr, C_ptr,
    M, N, K, world_size: tl.constexpr,
    BLOCK: tl.constexpr,
):
    pid = tl.program_id(0)
    rank = iris.rank()

    # 자기 shard 의 일부
    a_block = tl.load(A_ptr + offsets_a)

    # 다른 rank 의 B shard 를 가져온다
    for r in range(world_size):
        if r != rank:
            b_remote = iris.load_remote(
                r, B_ptr + offsets_b, mask=mask)
            acc += tl.dot(a_block, b_remote)

    # local 결과 reduce
    iris.atomic_add_remote(
        0, C_ptr + offsets_c, acc)
    iris.barrier(group=ALL)
    tl.store(C_ptr + offsets_c, final)

§ 04통신 lifecycle· put/get/wait

fine-grained barrier 가 핵심 — collective 한 줄이 아닌 progress

Iris 의 통신은 one-sided + tile-level barrier. NCCL 의 group call 이 아닌 RDMA 모델. 그래서 progress (실제 데이터가 도착했는지) 의 통제가 fine-grained 하다.

01setup — SymmetricHeap allocate, 모든 rank 에 동일 VA

02distributed tensor 등록 — Triton 컴파일러에 layout 정보 전달

03kernel launch — 모든 rank 가 같은 kernel 동시 launch

04remote load 발사 — RDMA / NVLink 로 async fetch

05local compute — fetch 된 tile 위 tl.dot 등

06remote store 또는 atomic — one-sided put

07tile-level barrier — 같은 group 의 program 들 사이 sync

08final result load — barrier 통과 후 결과 사용

FIG · NCCL vs Iris 의 timelinecompute-communication overlap

이 overlap 의 효과는 tile 갯수가 많을수록 큼. 작은 GEMM 에서는 NCCL 도 충분, 큰 TP GEMM 에서는 Iris 의 fine-grained 모델이 우월.

§ 05예제· all-gather GEMM

tensor parallel 의 가장 흔한 패턴 — Iris 로 어떻게 짜이는가

강의에서 가장 진하게 다룬 예제. Megatron-style TP 의 핵심 — "내 weight shard 와 모든 rank 의 input 의 곱을 누적". NCCL 로는 GEMM + allreduce 두 단계, Iris 로는 한 커널.

NCCL 으로 짤 때의 형태.

모든 rank 가 자기 weight shard W_r 와 input X 의 곱 — local Y_r.
모든 rank 의 Y_r 을 sum (allreduce). 결과 Y 가 모든 rank 에 같다.

두 step 사이 sync. 통신과 compute 가 직렬.

Iris 로 짤 때의 형태.

tile 단위로 자기 weight tile 과 input tile 의 곱 — partial accumulator.
각 partial 을 다른 rank 의 buffer 에 atomic_add_remote.
같은 tile group 의 program 들이 barrier.
최종 결과 read-out.

compute 와 통신이 인터리브된 한 커널.

# NCCL 형 (PyTorch 표준)
y_local = torch.matmul(x, w_shard)
torch.distributed.all_reduce(y_local)
# Y = y_local now

# --- vs ---

# Iris 형 (단일 Triton kernel)
@triton.jit
def ag_gemm(x, w, y, ...):
    pid_m, pid_n = tl.program_id(0), tl.program_id(1)
    rank = iris.rank()

    acc = tl.zeros([BM, BN], dtype=tl.float32)
    for k in range(0, K, BK):
        x_tile = tl.load(x + offs_x)
        w_tile = tl.load(w + offs_w)
        acc += tl.dot(x_tile, w_tile)

    # 모든 rank 의 y[m,n] 에 atomic add
    for r in range(world_size):
        iris.atomic_add_remote(r, y + offs_y, acc)

    iris.barrier(group=ALL)

강의의 측정 (확인 필요): 큰 TP GEMM (예: M=N=K=8192, 8-way TP) 에서 Iris 가 NCCL 기반 baseline 보다 20-40% 더 빠른 wallclock 을 보였다. 작은 GEMM 에서는 NCCL 이 비슷하거나 더 빠름 — overhead 의 amortization 차이.

왜 fine-grained 가 빠른가

NCCL allreduce 는 ring-based 또는 tree-based 알고리즘으로 모든 데이터를 한 번에 처리. 그래서 시작 → 끝까지 일정 latency 가 정해져 있다. Iris 는 tile 단위로 partial 을 보내므로 각 tile 은 작은 latency, 다음 tile 의 compute 와 overlap. 평균 latency 가 작아진다.

§ 06NCCL 과의 관계· where Iris fits

대체재가 아니다 — 보완재

강의에서 명시적으로 강조 — Iris 는 NCCL/RCCL 을 대체하지 않는다. 두 도구의 자리가 다르다.

차원	NCCL/RCCL	Iris
모델	collective (group call)	one-sided put/get + barrier
granularity	전체 tensor	tile 단위
overlap	cuda stream 으로 어색하게	같은 커널 안에서 자연
topology 인식	자동 — ring/tree	사용자가 명시
scale 한계	1000+ GPU 까지 검증	node 단위 (확인 필요)
학습 곡선	완만 (한 줄)	가파름 — Triton + SHMEM 모델
최적 use	data-parallel allreduce, gradient sync	TP/SP 안 fine-grained overlap

실전 워크플로 — 강의에서 권장하는 형태.

data parallel gradient sync → NCCL/RCCL allreduce 한 줄. 단순하고 충분.
tensor parallel 의 forward/backward GEMM → Iris kernel 한 개. compute-communication overlap 으로 throughput 회복.
sequence parallel ring attention → Iris 가 가장 빛나는 자리. ring 단계마다 partial K/V tile 을 fine-grained 하게 가져온다.
expert parallel all-to-all → 강의에서 다뤄졌는지 확인 필요. NCCL all_to_all 보다 token imbalance 에 robust.

§ 07채택 사례· production users

AMD 인프라의 frontline

AMD ROCm 공식 라이브러리 — Iris 가 ROCm 의 multi-GPU Triton 표준 path 로 자리잡고 있다 (확인 필요). MI300X / MI325X / MI355X 에 default 로 들어감.
vLLM AMD 백엔드 — TP serving 의 핫패스. Iris 기반 GEMM-allreduce fused kernel.
SGLang / LMDeploy — 일부 fused multi-GPU op 가 Iris 위에서 짜여 있음 (확인 필요).
Megatron-LM AMD fork — NCCL 기반 standard 외에 Iris 기반 TP 구현 옵션.

채택 곡선의 패턴 — AMD 진영의 "NVIDIA 따라잡기" 의 한 점. NVIDIA 의 NVSHMEM + Triton 등가물이 AMD 위에서 표준화된 길로 자리잡으려는 시도. 성공은 ROCm 의 전체 채택 곡선에 의존.

왜 AMD 가 이걸 만드는가

NVIDIA 는 NVSHMEM + Triton + cuDNN 의 깊은 통합이 이미 있다. AMD 는 그 layer 가 부족했다. Iris 는 그 공백을 정확히 메우는 piece — Triton 위에서 multi-GPU 를 표현 가능하게.

§ 08한계· caveats

fine-grained 의 비용 — 그리고 추상의 누수

node 간 RDMA 의 overhead — 같은 node 의 NVLink/Infinity Fabric 위에서는 빠르지만 multi-node 의 InfiniBand RDMA 는 latency 큼. tile 단위 통신이 오히려 NCCL 보다 느려질 수 있다.
scale 의 한계 — NCCL 은 1000+ GPU 까지 검증, Iris 는 강의 시점에 single-node 또는 small multi-node 위주. 큰 cluster 에서의 동작은 확인 필요.
SymmetricHeap 의 메모리 비용 — 모든 rank 에 동일 VA 가 mapped 되어야 하므로 메모리 fragmentation 비용이 있다.
디버깅의 어려움 — 한 rank 의 wrong store 가 다른 rank 의 silent corruption 으로 나타난다. 단일 GPU Triton 보다 디버깅 난이도 큼.
NVIDIA 위 portability — Iris 는 AMD 출신. NVIDIA 위 NVSHMEM 백엔드가 가능하지만 강의 시점의 성숙도 확인 필요.
AMD 외 채택의 inertia — 산업 default 가 NCCL. Iris 의 가치가 깊지만 학습 곡선 + AMD-bound 가 채택을 늦춘다.

§ 09다음 단계· future work

강의에서 명시적으로 던진 다음 자리

multi-node 검증 — 16 / 32 node 위에서 Iris 의 성능 곡선. RDMA overhead 의 amortization 가능성.
NVIDIA NVSHMEM 백엔드 — 같은 추상이 NVIDIA 인프라 위에서도 동작하도록.
autotuning 통합 — tile size, communication granularity 의 자동 결정.
collective 의 자동 합성 — 사용자가 high-level 표현 (allreduce 한 줄) 을 적으면 Iris 가 fine-grained 으로 자동 lowering.
compiler-level fusion — torch.compile 또는 FlexAttention 같은 high-level 표현에서 Iris 백엔드로 자동 lowering.
FlashAttention with Iris — ring attention 의 fully-fused 버전.

"NCCL 의 collective 가 default 인 것은 abstractions 의 합의 — Iris 는 그 합의 아래로 한 단계 내려가서 새 자리를 만든다."학습 노트

§ 10기억할 메모· key takeaways

다시 열었을 때 손에 잡혀야 할 것

multi-GPU Triton

NCCL 위 wrapper 가 아닌 — kernel 안에서 directly remote put/get 가능. AMD 출신 Iris.

SymmetricHeap

모든 rank 에 동일 VA. NVSHMEM/SHMEM 의 표준 model. one-sided communication 의 base.

DistributedTensor

rank 별 shard 의 tile-level view. 한 program 이 자기 + remote shard 를 indexing.

load_remote / store_remote

RDMA / NVLink put/get. async, tile 단위. atomic_add_remote 도 있음.

fine-grained barrier

NCCL 의 group call 보다 tile 단위. signal/wait 으로 program 간 sync.

overlap 의 가치

큰 TP GEMM 에서 NCCL 대비 20-40% wallclock 이득 (확인 필요). compute 가 통신을 가린다.

NCCL 대체 아님

data-parallel gradient sync 는 NCCL. TP/SP 안 fine-grained 자리에 Iris.

AMD 인프라

MI300+ 의 multi-GPU Triton 표준 path. ROCm 의 NVSHMEM 등가물 자리.

YouTube강의 영상 (확인 필요)

Repogithub.com/ROCm/iris (확인 필요)

손에 새기기 — 실습 시퀀스

local Iris setup — 2-GPU 기준으로 hello-world. SymmetricHeap allocate, simple put/get.
TP GEMM 직접 짜기 — 4096³ matmul 의 4-way TP 를 Iris 로. NCCL allreduce baseline 과 wallclock 비교.
ring attention — sequence parallel attention 의 ring 단계를 Iris 로. K/V tile 의 fine-grained 통신.
tile size sweep — Iris 의 tile size 와 communication granularity 의 trade-off.
NCCL 과 cross-check — 같은 input 의 결과가 두 backend 에서 비트 동일한지 확인. silent drift 점검.
profile 해석 — rocprof 로 timeline 을 dump. compute 와 communication 의 overlap 을 직접 본다.

§ 11다른 강의로의 연결· connections

이 강의가 시리즈 안에서 어디로 이어지는가

L017

GPU collectives / NCCL

collective 추상의 base — Iris 의 출발점

L025

Composable Kernel

AMD 의 GEMM 인프라 — Iris 와 같은 진영

L077

DSLs for GPU Kernels

Iris 가 Triton 추상의 한 확장

L079

Mirage MPK

단일 GPU mega kernel — Iris 의 multi-GPU 등가물 후보

L039

FlashAttention deep dive

ring attention — Iris 와 자연스러운 결합

L055

vLLM internals

TP serving 의 핫패스 — Iris 의 채택 자리

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

repo 의 정확한 위치 — github.com/ROCm/iris 추정. 강의 영상에 명시된 url 확인.
NVIDIA backend 의 성숙도 — NVSHMEM 위 동작이 production-ready 인지.
multi-node 측정 — 16+ node 위 측정 결과. RDMA latency 의 dominance 시점.
EP all-to-all 패턴 — Mixture-of-Experts 에서 token routing. Iris 가 자연스럽게 표현 가능한지.
벤치마크의 baseline — § 05 의 20-40% 가 RCCL allreduce baseline 인지, 더 정교한 NCCL+overlap baseline 인지.
호환 PyTorch 버전 — Triton dependency 의 정확한 버전.
FlashAttention-3 with Iris — sequence parallel ring attention 의 fully-fused 버전이 강의에서 다뤄졌는지.

검증 메모

이 노트의 대부분은 NVSHMEM/SHMEM 모델 + Triton 도메인 지식 + AMD 인프라 일반론으로 재구성. Iris 의 정확한 API 명, repo URL, 측정 수치는 영상 직접 확인 필요. 특히 "20-40% 빠름" 은 추정값.

← Lecture 077DSLs for GPU Kernels Lecture 079 →Mirage MPK — Compiling LLMs into Mega Kernels