《GPU Mode》 L009 2024 · MAR · 09 High priority transcript · available

Reductions

PMPP Ch.10 의 정수 — “병렬 sum/max/min 이 왜 어렵고, 어떻게 한 단씩 빠르게 만들어 가는가”. 의존성 있는 첫 패턴 (앞 결과를 다음이 쓰는) 을 GPU 위에 어떻게 풀까. naive 부터 simple shared, control divergence 제거, coarsening, multi-stream, segment reduce 까지 — 같은 sum 알고리즘이 7개의 형태로 진화. 그리고 마지막에 PyTorch 가 일반 reduction 커널을 어떻게 일반화해 두는지의 한 페이지. Mark Saroufim 의 두 번째 “직접 .cu 파일을 빌드하면서 NCU 측정” 워크숍.

tree reduction control divergence shared memory warp shuffle atomic 의 함정 multi-stream segment reduce PMPP Ch.10

Speaker

Mark Saroufim

Meta · PyTorch core · L008 의 직속 후속

강의 번호

L009

스피커

Mark Saroufim

학습 우선순위

High · 정독

다시 볼 때

7개 reduction kernel 직접

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제의존성 있는 첫 패턴 02왜 reduction 이 trivial 하지 않은가병렬 sum 의 본질 03tree reduction 의 stride 패턴log₂(n) step 04naive — simple_reduce.cudivergence 가 박힌 첫 형태 05divergence 제거 — control_divergence_reducestride 방향을 바꾼다 06shared memory — shared_reduce.cuglobal 트래픽 1/log(n) 07coarsening — reduce_coarsening한 thread 가 여러 element 08multi-stream · segment reduce한 SM 더 채우기 09정확도 — float reduction 의 함정accuracy.py · nondeterminism.py 10기억할 메모와 코드 자료key takeaways · repo 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· Why this lecture exists

지금까지 다룬 모든 패턴은 “독립” — 이번엔 의존성

L002~L005 까지의 모든 예제 (vector add, RGB→grayscale, mean filter, matmul) 는 한 thread 가 자기 출력 element 만 책임지는 embarrassingly parallel 패턴이었다. reduction 은 그 가정이 처음 깨지는 자리. 한 출력 (예: sum) 을 만들기 위해 모든 thread 가 협력해야 한다. block 사이의 통신, atomic, syncthreads, warp shuffle 같은 새 도구들이 모두 이 한 영역에서 등장.

강의가 답하는 세 질문.

병렬 sum 이 왜 trivial 하지 않은가 — 의존성 있는 알고리즘의 일반적 어려움.
같은 sum 이 어떤 점에서 한 단씩 빨라지는가 — naive 부터 7개 kernel 의 진화.
PyTorch 의 일반 reduction 은 어떻게 동작하는가 — min, max, norm, mean 이 한 dispatch path 에서.

시리즈 안의 위치

이 강의는 L008 CUDA Performance Checklist 의 직접 후속. 같은 “.cu 파일 빌드 후 NCU 로 검증” 의 워크숍 형식. 그리고 L012 FlashAttention 의 online softmax 가 reduction 의 가장 정교한 형태 — 이 강의가 그 prequel.

“embarrassingly parallel 의 다음 단 — 의존성을 GPU 의 layered 구조로 어떻게 푸는가.”학습 노트 · L009 §01

§ 02왜 reduction 이 trivial 하지 않은가· 병렬 sum 의 본질

한 출력을 만들려면 모든 입력의 “정보” 가 그 한 자리에 모여야 한다

시퀀셜 sum 은 한 줄 — for i: s += a[i]. 의존성이 있다 (다음 step 이 이전 step 의 결과를 본다). 단순히 thread N 개로 쪼개면 — 각 thread 가 서로의 결과를 기다려야 한다. 답은 tree다.

n 개 element 의 sum 을 첫 step 에서 n/2 개의 pair 로 모은다 (각 pair 가 1 thread).
다음 step 에서 n/4 개, 그 다음 n/8 ... 총 log₂(n) step.
각 step 사이에 모든 thread 의 진행이 sync — block 안이면 __syncthreads, block 사이면 kernel boundary 또는 atomic.

이 tree 패턴이 모든 reduction 알고리즘의 base. min, max, norm, dot product, softmax 의 normalize 부분 모두 같은 형태. “연산이 결합 법칙을 만족하면 (associative) tree 로 풀 수 있다” 가 reduction 알고리즘의 충분 조건.

왜 “block 사이” 통신이 어려운가

L004 §03 에서 깐 사실 — block 사이 동기화는 kernel boundary 가 유일. 즉 큰 reduction 은 한 kernel 으로 끝낼 수 없다. block 안에서 부분합 → block 별 출력을 따로 저장 → 두 번째 kernel 이 그것들을 다시 reduce. 또는 atomic 으로 마지막 step.

§ 03tree reduction 의 stride 패턴· log₂(n) step

같은 tree 를 어떤 방향으로 그리느냐 — naive vs sequential addressing

tree reduction 의 step 별 active thread 와 그 지속 거리는 두 방식 — “인접 pair” vs “half-half”. 같은 알고리즘이지만 warp 활용도가 다르다.

핵심 차이 — naive 는 active thread 가 stride 마다 “alternate” 한다 (t0, t2, t4, … vs t0, t4, t8, …). 같은 warp 안의 thread 들이 active/idle 로 갈라져 — divergence. sequential addressing 은 “첫 절반 → 첫 1/4 → …” 로 active 한 thread 들이 모인다 — 한 warp 이 한 path.

이 한 시각 차이가 강의의 핵심 진화. naive (simple_reduce.cu) 가 “인접 pair” 방식이고 — 후속 버전이 모두 sequential addressing 으로.

§ 04naive · simple_reduce.cu· divergence 가 박힌 첫 형태

PMPP 책의 첫 형태 — 정확하지만 warp divergence 가 심하다

강의의 첫 reduction. 한 thread 가 인접 pair 를 합치고 sync, 그 결과를 다음 step 의 wider stride 로 다시 합치고 sync. 코드는 짧고 명확하지만 warp divergence 가 심하다.

// simple_reduce.cu — 강의 repo 그대로
__global__ void SimpleSumReductionKernel(float* input, float* output)
{
    unsigned int i = 2 * threadIdx.x;
    for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
        if (threadIdx.x % stride == 0) {                    // ← divergence 의 자리
            input[i] += input[i + stride];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0) *output = input[0];
}

stride=1

t0t1t2t3t4t5t6t7 t8t9t10t11t12t13t14t15

stride=2

t0t1t2t3t4t5t6t7 t8t9t10t11t12t13t14t15

stride=4

t0t1t2t3t4t5t6t7 t8t9t10t11t12t13t14t15

한 warp 안에 active 와 idle 이 섞임 → 모든 step 이 warp divergence 를 겪는다. 동시에 모든 read/write 가 global memory 에 — bandwidth 도 비효율.

이 첫 형태는 정확하다. 작은 입력 (size = 2048) 에서는 결과가 맞고, 강의의 메시지는 “이게 한참 부족한 출발점이다, 다음 단계들이 같은 일을 빠르게”.

§ 05divergence 제거· control_divergence_reduce

stride 방향을 바꾼다 — active thread 들이 한 warp 에 모이도록

두 번째 형태는 sequential addressing. stride 를 1, 2, 4, ... 로 키우는 대신 N/2, N/4, N/8 ... 로 줄인다. 결과 — active thread 들이 첫 절반에 모이고, 그 다음 첫 1/4 에 모이고, ... 한 warp 이 한 path 를 유지.

// control_divergence_reduce.cu — sequential addressing
__global__ void ConvergedSumReduction(float* input, float* output)
{
    unsigned int t = threadIdx.x;
    for (unsigned int stride = blockDim.x; stride > 0; stride /= 2) {
        if (t < stride) {                          // 첫 stride 개 thread 만 active
            input[t] += input[t + stride];
        }
        __syncthreads();
    }
    if (t == 0) *output = input[0];
}

stride=8

t0t1t2t3t4t5t6t7 t8t9t10t11t12t13t14t15

stride=4

t0t1t2t3t4t5t6t7 t8t9t10t11t12t13t14t15

stride=2

t0t1t2t3t4t5t6t7 t8t9t10t11t12t13t14t15

active 와 idle 의 경계가 warp 경계와 정렬. 첫 step 에서 한 warp 전체가 active. 마지막 step 들은 한 warp 만 active 지만 — 적어도 한 warp 안에서는 같은 path.

이 변환의 효과 — warp execution efficiency 가 극적으로 좋아짐. NCU 가 직접 보여준다. 같은 알고리즘, 같은 정확도, 측정 가능한 시간 차이.

§ 06shared memory· shared_reduce.cu

중간 결과를 global memory 에 안 둔다 — log(n) 번의 HBM 왕복 → 1번

이전 두 버전은 모든 step 에서 input[t] += input[t + stride] — global memory 에 read/write. log(n) step × 2 = 2 log(n) HBM access per element. shared memory 로 옮기면 그 모두가 on-chip 으로 사라진다.

// shared_reduce.cu — 강의 repo 그대로
__global__ void SharedMemoryReduction(float* input, float* output)
{
    __shared__ float input_s[BLOCK_DIM];
    unsigned int t = threadIdx.x;
    // 첫 add 도 동시에 — 한 thread 가 두 element 를 한 번에
    input_s[t] = input[t] + input[t + BLOCK_DIM];
    for (unsigned int stride = blockDim.x / 2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (t < stride) {
            input_s[t] += input_s[t + stride];           // ← shared, 빠름
        }
    }
    if (t == 0) *output = input_s[0];
}

강의에서 Mark 의 솔직한 한 마디

코드 주석에 그대로 — “This is the code from the book but I couldn't get this to run faster even with occupancy calculator. L1 throughput is dramatically increased though.” 이게 흥미로운 메시지: NCU 의 L1 throughput 카운터는 좋아 보이는데 실제 시간은 안 줄어들 수 있다. 작은 입력 (size=2048) 에서는 launch overhead 가 dominant 해서 어떤 변환의 효과도 묻힌다.

이 일화의 메시지를 풀어 보면 — 변환의 효과는 입력 사이즈에 의존한다. 큰 입력 (수천만 element) 에서는 shared 로 옮기는 게 큰 차이를 만들지만, 작은 입력에서는 거의 차이 없을 수 있다. NCU 의 metric 이 좋아 보이는 것과 wall-clock 시간이 정렬되지 않는 자리.

크기 한계의 함정

Mark 가 강의에서 직접 짚은 자리 — “입력을 2048 보다 키웠더니 결과가 0 으로 나왔다.” 원인 — 이 코드가 입력 전체가 한 block 의 shared 에 들어간다고 가정하기 때문. 즉 size 가 BLOCK_DIM 의 두 배보다 크면 안 됨. 더 큰 입력엔 — 두 단계 (block 안 reduce + block 끼리 reduce) 의 hierarchical 형태가 필요.

§ 07coarsening · reduce_coarsening· 한 thread 가 여러 element

L008 의 [04] coarsening 을 reduction 에 적용

L008 §06 의 thread coarsening 이 reduction 에 가장 자연스럽게 적용된다. 한 thread 가 시작할 때 4 또는 8 element 를 register 에 누적해 sum 을 만든 뒤, 그 sum 들을 shared 에 적고 tree reduction.

// reduce_coarsening.cu — coarsen factor 4 의 골격
__global__ void CoarsenedReduction(float* input, float* output, int n)
{
    __shared__ float input_s[BLOCK_DIM];
    unsigned int t = threadIdx.x;
    unsigned int base = blockIdx.x * blockDim.x * 4 + t;

    float sum = 0;                                    // register 에 누적
    sum += input[base + 0 * blockDim.x];
    sum += input[base + 1 * blockDim.x];
    sum += input[base + 2 * blockDim.x];
    sum += input[base + 3 * blockDim.x];
    input_s[t] = sum;

    for (unsigned int stride = blockDim.x / 2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (t < stride) input_s[t] += input_s[t + stride];
    }
    if (t == 0) atomicAdd(output, input_s[0]);     // block 끝에 1번 atomic
}

이 형태가 실용적으로 가장 빠른 패턴. 이유들.

register accumulator — 첫 4 add 가 모두 register 에서. shared 에 한 번만 적는다.
coalesced HBM read — thread 들이 stride blockDim.x 로 떨어진 element 를 읽음. 한 warp = 한 transaction.
blocks 끼리 atomic 한 번 — block 의 partial sum 을 global atomic 으로 모음. block 수만큼만의 atomic.
한 kernel 으로 끝 — 두 단계 launch 안 함. atomic 의 결과가 곧 답.

왜 block 끝에 atomic 이 빠른가

block 수가 input 사이즈에 비해 작다 (예: input 1B element, block 256 thread × coarsen 4 = block 1024 element → block 수 ~1M). atomic 의 contention 은 block 수에 비례 — 1M 번이 100M 번보다 1/100 빠름. privatization 의 일반화 (L008 §07).

“가장 빠른 reduction 은 ‘한 thread 안에서 가능한 한 많이 누적, 그 다음 shared 에서 tree, 마지막에 atomic 한 번’.”학습 노트 · L009 §07

§ 08multi-stream · segment reduce· 한 SM 더 채우기

여러 reduction 을 동시에 — 또는 한 입력의 다른 segment 를 따로

강의의 multistream-reduce.cu 와 segment_reduce.cu 가 마지막 두 형태. multi-stream 은 여러 다른 reduction (예: per-row sum 의 row 별) 을 다른 CUDA stream 에서 동시에 띄움 — 한 reduction 의 launch overhead 를 다른 reduction 의 compute 와 overlap.

segment reduce 는 한 input 을 여러 segment 로 나누고 각 segment 의 sum 을 따로 — “per-row sum” 의 한 형태. PyTorch 의 tensor.sum(dim=…) 이 이 형태로 lower 됨.

multistream여러 reduction 동시 같은 입력의 다른 segment 또는 다른 입력들을 별개 stream 에서. cudaStreamCreate, kernel launch 시 stream 인자. overlap +10-30%

segment reduceper-row, per-group 한 thread block = 한 segment. 같은 block 안에서 self-contained reduction. 마지막 atomic 도 segment-별 출력에. parallel rows scaling

실전에서의 segment reduce

이 패턴이 layernorm, RMSnorm, softmax 의 backbone. 각 row (sequence position) 가 자기 reduction 을 따로. attention 의 softmax, batch norm 의 mean/var, beam search 의 score sort 모두 segment reduce 의 변형.

강의의 마지막 한 페이지는 — “PyTorch 안에 min, max, sum 의 분리된 kernel 이 없다, 한 일반 reduction kernel 이 op 인자로 받음”. 이 사실의 의미가 흥미롭다. 같은 코드가 여러 op 에 reuse 된다는 추상의 가치, 그리고 — 커스텀 reduction 도 같은 path 위에 layered 가능.

§ 09정확도 — float reduction 의 함정· accuracy.py · nondeterminism.py

병렬 sum 은 시퀀셜 sum 과 같은 답을 주지 않는다 — 결합 법칙이 깨진다

강의에서 깐 가장 미묘한 자리. accuracy.py 와 sensitivity.py 가 보여주는 사실 — float 산술은 결합 법칙을 만족하지 않는다. (a + b) + c ≠ a + (b + c) 인 케이스가 있다. 그런데 tree reduction 은 시퀀셜과 다른 순서로 sum — 결과가 다를 수 있다.

시퀀셜((1000 + 0.001) + 0.001) + … + 0.001~1000.5 (가능)

tree(1000 + 0.001) + (0.001 + 0.001) + …~1001.0

매 run 다른 순서block 끝의 atomic 순서가 비결정매번 다름

accuracy.py 의 핵심 예 — 1000.0 (fp32) 에 0.001 을 1000번 더한다. fp32 의 mantissa 가 1000 근처에서는 0.001 을 표현하지 못함 — 그래서 더해도 1000 그대로. 즉 시퀀셜이면 답이 1000.0 인데 — tree 로 0.001 들을 먼저 모으면 1.0 이 되고 그게 1000.0 에 더해져 1001.0. 시퀀셜과 tree 의 답이 다르다.

# accuracy.py — 강의 repo (요약)
import torch
large_value = torch.tensor([1000.0], dtype=torch.float32)
small_values = torch.full((1000,), 0.001, dtype=torch.float32)

# 시퀀셜
res = large_value.clone()
for v in small_values:
    res += v
# res = ~1000.0  (작은 값들이 mantissa 에 쌓이지 못함)

# tree (PyTorch 의 sum)
res2 = large_value + small_values.sum()
# res2 = ~1001.0  (작은 값들이 먼저 모여 1.0 이 됨)

그리고 nondeterminism.py 가 보여주는 더 큰 함정 — 같은 input, 같은 코드, 다른 run 에서 결과가 다르다. 이유는 atomic 의 순서가 비결정. block 끝의 atomic add 가 어떤 순서로 일어나느냐가 GPU 의 scheduling 에 의존. “training 의 reproducibility 가 깨지는 자리”.

실전에서의 의미

training 의 reproducibility 를 원하면 — PyTorch 의 torch.use_deterministic_algorithms(True) 와 같은 설정으로 atomic 을 피하는 (느린) reduction kernel 로 fallback. 속도와 결정성의 trade-off. 정확한 정확도가 아닌, “매번 같은 답” 의 문제.

“GPU reduction 은 fast 이고, 하지만 ‘determinism 과 fast 의 둘 다’ 는 어렵다 — 이게 numerical reproducibility 의 가장 흔한 함정.”Mark Saroufim · L009

§ 10기억할 메모와 코드· key takeaways · repo

다시 열었을 때 5분 안에 손으로 잡혀야 할 것

7개 reduction 의 진화와 각 단계의 핵심 한 줄.

tree reduction

log₂(n) step. 각 step 에 active thread 절반. associative op 라면 적용 가능.

sequential addressing

stride 를 N/2, N/4 ... 로 줄임. active thread 들이 첫 절반에 모여 — warp 한 단위로 align.

shared memory

log(n) HBM 왕복을 1번으로. 단 입력 전체가 한 block 의 shared 에 들어가야 함.

coarsening

한 thread 가 4-8 element 를 register 에 누적. HBM read 의 효율 + accumulator reuse.

block-level atomic

각 block 의 partial sum 을 atomic 으로 global 에 한 번. block 수만큼만의 atomic.

multi-stream

여러 reduction 을 다른 stream 에서 동시. launch overhead overlap.

segment reduce

per-row, per-group reduction. block 한 개 = segment 한 개. layernorm/softmax 의 base.

float 결합 법칙

시퀀셜 sum ≠ tree sum. 큰 값 + 작은 값 들의 케이스에서 차이가 보인다.

nondeterminism

atomic 의 순서가 GPU scheduling 에 의존 → 매 run 결과 다를 수 있음. torch.use_deterministic_algorithms.

YouTube youtube.com/watch?v=09wntC6BT5o

Slides Google Slides

Code (CUDA) simple_reduce.cu · control_divergence_reduce.cu · shared_reduce.cu · reduce_coarsening.cu · segment_reduce.cu · multistream-reduce.cu

Code (Python) accuracy.py · nondeterminism.py · sensitivity.py · reduce_compile.py

Repo gpu-mode/lectures/lecture_009

참고 PMPP Ch.10 (Parallel Reduction) · Mark Harris “Optimizing Parallel Reduction in CUDA” NVIDIA whitepaper · CUB 라이브러리

손에 새기기 — 실습 시퀀스

7개 .cu 모두 빌드 · 측정 — repo clone, 모든 reduction kernel 의 시간을 자기 GPU 에서. 강의의 진화가 자기 측정에서도 같은 방향으로 나오는가.
입력 사이즈 sweep — size = 1024, 2048, 65536, 1M, 100M. 각 형태의 시간 변화. shared 의 효과가 작은 입력에서는 안 보일 수 있다.
warp execution efficiency 측정 — NCU 의 해당 카운터를 simple vs convergent 에 대해 본다. divergence 제거의 정량적 효과.
coarsen factor sweep — N = 1, 2, 4, 8, 16, 32. 시간과 register 사용량의 관계.
accuracy.py 직접 돌리기 — 시퀀셜 sum 과 PyTorch sum 의 차이를 확인. 1000 + 0.001×1000 의 케이스가 자기 환경에서 재현되는가.
nondeterminism.py 의 결과 — 같은 코드를 여러 번 돌려 결과가 매번 같은지 (또는 다른지) 확인. torch.use_deterministic_algorithms(True) 의 효과.
warp shuffle 추가 도전 — strict tree reduction 의 마지막 32 thread 단계를 __shfl_xor_sync 로 바꿔본다 (강의에서 본격 안 다루지만 NVIDIA whitepaper 의 표준).
한 페이지 plan — 자기 모델의 한 reduction (예: layernorm 의 mean) 을 골라 7개 형태 중 어느 것이 적합한지 — bound 와 segment 수 기반 결정.

§ 11다른 강의로 이어지는 길· connections

이 강의의 reduction 패턴이 어디서 본격적으로 등장하는지

tree reduction · segment reduce · online reduction 의 패턴이 시리즈 거의 모든 LLM kernel 강의에서 다른 옷을 입고 등장.

L001

Profiling CUDA kernels in PyTorch

Mark Saroufim — 도구의 base. 이 강의의 NCU 측정 모두가 그 위에서.

L004

Compute and Memory Basics

Thomas Viehmann — block 사이 통신의 어려움의 정량.

L005

Going Further with CUDA for Python

Jeremy Howard — shared memory 의 첫 도구. 이 강의의 base 도구 중 하나.

L008

CUDA Performance Checklist

Mark Saroufim — 이 강의의 직접 prequel. divergence/coarsening/privatization 의 같은 항목들.

L012

Flash Attention

Tri Dao — online softmax 가 “tile 단위 segment reduce + running max 추적” 의 정수 형태

L013

Ring Attention

Hao Liu — multi-GPU segment reduce. 이 강의의 multi-stream 의 distributed 형태.

L028

Liger Kernel

Byron Hsu — fused layernorm/softmax/cross-entropy 가 모두 segment reduce + 다른 op 의 fusion

L034

Low Bit Triton Kernels

Pavan Balaji — quantized scale 계산이 reduction. 이 강의의 patterns 가 Triton 으로.

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

강의에서 흐릿한 자리들과, 자기 GPU 에서 직접 봐야 손에 박힐 사실들.

warp shuffle 의 본격 형태 — 강의에서 직접 다루지 않음. __shfl_xor_sync, __shfl_down_sync 가 마지막 32 thread reduction 의 표준. shared 와 sync 가 사라짐 → 빠름. NVIDIA whitepaper 또는 CUB 의 BlockReduce 코드 참조.
shared_reduce 가 안 빨라지는 이유의 정확한 진단 — Mark 의 코드 주석이 그대로 남아 있는 자리. 입력 사이즈 변화로 효과가 보이는가? 아니면 본질적인 다른 이유가 있는가? (확인 필요)
cooperative groups — block 사이 동기화의 새 메커니즘. cudaLaunchCooperativeKernel. 강의에서 안 다룸. 한 kernel 으로 큰 reduction 을 끝낼 수 있게 함.
CUB 라이브러리 — production reduction 의 사실상 표준. cub::DeviceReduce::Sum. 강의에서 직접 인용 안 함. PyTorch 의 일반 reduction kernel 도 일부 CUB 사용.
Hopper 의 distributed shared memory — block 끼리 직접 shared 를 share 가능. 큰 reduction 의 새 옵션.
online softmax 와의 관계 — 이 강의의 “단순 reduction” 이 attention 의 online softmax 로 가는 길은 — running max 도 같이 추적. L012 FlashAttention 에서.
fp16/bf16 reduction 의 정확도 — accuracy.py 가 fp32 의 케이스. fp16 에서는 mantissa 가 짧아 같은 함정이 더 일찍 발생. mixed precision training 에서의 reduction 은 fp32 accumulator 가 표준.

검증 메모

이 노트의 모든 NCU 측정 추정 (warp execution efficiency, L1 throughput) 은 자기 GPU 에서 직접 측정해야 한다. 강의의 측정도 작은 입력 (size=2048) 에서는 차이가 묻혀 보일 수 있음. 큰 입력 (수천만 element) 에서 진정한 차이가 드러난다.

← Lecture 008 Mark Saroufim — CUDA Performance Checklist Lecture 010 → Oscar Booth — Build a Prod Ready CUDA library