LESSON 03 · 2026.04.18 · T4

Reduction — atomic 이 100배 느린 이유

sum(x) 하나를 구하는 5 가지 구현. shared memory, warp shuffle, 그리고 __syncthreads 하나가 만든 29% 의 차이.

GPU · T4 버전 · v1–v4 + thrust sweep · 25 runs

다섯 버전

v1 atomic — 모든 스레드가 하나의 global 주소에 atomicAdd (baseline, 나쁜 예)
v2 shared — 블록당 shared memory 트리, 루트만 atomic
v3 unroll — v2 + 마지막 warp 를 shuffle 로 대체, __syncthreads 5 번 제거
v4 shuffle — warp 가 shuffle 로 로컬 리듀스, warp 간에만 shared memory
thrust — thrust::reduce, cub 기반 multi-pass

버전별 best_ms

n	v1	v2	v3	v4	thrust
2²⁰	2.082	0.018	0.013	0.012	0.028
2²²	8.309	0.070	0.066	0.066	0.089
2²⁴	33.227	0.260	0.258	0.258	0.289
2²⁶	132.903	1.089	1.087	1.090	1.118
2²⁸	531.577	4.860	4.870	4.887	4.648

교훈 1 · v1 의 "2 GB/s floor"

v1 의 유효 대역폭이 n 에 무관하게 ~2.0 GB/s 고정. 이건 HBM 대역폭이 아니라 atomic throughput 상한이다. 100만 스레드가 한 주소에 동시 접근하면 하드웨어가 직렬화 — 실질적으로 1 스레드와 같다.

atomic 은 "결과 1 개를 만드는 장치"로 쓰면 안 된다. 블록당 1 번 (트리 리덕션 루트) 이나, 저빈도 카운터 수준으로.

교훈 2 · v2 → v3 → v4, 각 단계의 기여

구간	Δ time	개선	제거된 비용
v1 → v2	2.082 → 0.018 ms	113×	직렬 atomic → 병렬 트리
v2 → v3	0.018 → 0.013 ms	−29%	`__syncthreads` 5 번
v3 → v4	0.013 → 0.012 ms	−6%	smem → 레지스터 shuffle

__syncthreads() 하나가 수백 cycle. 5 번이면 작은 커널 실행 시간의 1/3 을 차지한다. 이 숫자가 머릿속에 남는다.

교훈 3 · 큰 n 에선 모두 수렴

n ≥ 2²⁴ 부터 v2/v3/v4/thrust 가 ±5% 안에서 동률. 이유는 단순하다. HBM 대역폭이 바닥 (~77%) 에 닿아 있어서, 읽어야 하는 바이트가 고정이면 "가져오는 시간" 이 전체를 지배한다. 리덕션 트리를 어떻게 꾸며도 이 시간은 안 줄어든다.

작은 n 에서만 tail 최적화를 신경 써라. 큰 n 이라면 "코드 단순성 > 커널 내부 미세 최적화".

Thrust 는 언제 이기는가

Thrust 는 cub 기반 multi-pass: 1 차로 블록당 부분합을 임시 버퍼에 쓰고 2 차 커널로 최종. 오버헤드 때문에 작은 n 엔 불리, 큰 n 에서 HBM 스케줄링이 유리해 역전한다. 정확도는 thrust 가 항상 최고 — 트리 깊이가 균형잡혀 FP 오차 상쇄가 유리.

왜 직접 짜보는가

프로덕션에서 답이 thrust/cub 이라면, v1-v4 를 손으로 짜는 이유는 하나다. "왜 atomic 이 100배 느린가" 를 숫자로 기억하는 엔지니어와 그렇지 않은 엔지니어의 판단력 차이. Triton, Mojo, 새로운 컴파일러를 평가할 기준점이 필요하다. fusion 결정, bottleneck 추정, roofline 독해 — 전부 이 감각 위에서 움직인다.

Prev · 02pageable vs pinned Next · 04matmul

LESSON 03 · 2026.04.18 · T4

Reduction — why atomic is 100× slower

Five implementations for sum(x). Shared memory, warp shuffle, and the 29% one __syncthreads made.

GPU · T4 versions · v1–v4 + thrust sweep · 25 runs

Five versions

v1 atomic — every thread does atomicAdd on a single global address (baseline, bad example)
v2 shared — per-block shared-memory tree, only the root does an atomic
v3 unroll — v2 + final warp replaced with shuffles, 5 __syncthreads removed
v4 shuffle — warps reduce locally via shuffle, shared memory only between warps
thrust — thrust::reduce, cub-based multi-pass

best_ms by version

n	v1	v2	v3	v4	thrust
2²⁰	2.082	0.018	0.013	0.012	0.028
2²²	8.309	0.070	0.066	0.066	0.089
2²⁴	33.227	0.260	0.258	0.258	0.289
2²⁶	132.903	1.089	1.087	1.090	1.118
2²⁸	531.577	4.860	4.870	4.887	4.648

Lesson 1 · v1's "2 GB/s floor"

v1's effective bandwidth is pinned at ~2.0 GB/s regardless of n. That's not an HBM ceiling — it's an atomic throughput ceiling. When a million threads hit the same address simultaneously, the hardware serializes — effectively one thread.

Never use atomic as "the thing that produces one result." Use it once per block (the tree-reduction root), or as a low-frequency counter.

Lesson 2 · contribution of each step v2 → v3 → v4

step	Δ time	speedup	cost removed
v1 → v2	2.082 → 0.018 ms	113×	serial atomic → parallel tree
v2 → v3	0.018 → 0.013 ms	−29%	5 `__syncthreads`
v3 → v4	0.013 → 0.012 ms	−6%	smem → register shuffle

A single __syncthreads() costs hundreds of cycles. Five of them take up 1/3 of a small kernel's runtime. A number worth remembering.

Lesson 3 · at large n, everything converges

From n ≥ 2²⁴, v2/v3/v4/thrust all tie within ±5%. The reason is simple. HBM bandwidth is bottomed out (~77%) — once the bytes-to-read is fixed, "fetch time" dominates everything. No reduction tree layout shrinks that time.

Only worry about tail optimization at small n. At large n, prefer "code simplicity > in-kernel micro-optimization."

When does Thrust win?

Thrust is cub-based multi-pass: pass 1 writes per-block partial sums to a temp buffer, pass 2 finalizes. Overhead hurts at small n, HBM scheduling helps at large n — hence the crossover. Thrust always wins on accuracy — balanced tree depth cancels FP error more favorably.

Why write these by hand?

If the production answer is thrust/cub, the reason to hand-write v1–v4 is one thing: the gap in judgment between an engineer who remembers "why atomic is 100× slower" in numbers and one who doesn't. You need a reference point to evaluate Triton, Mojo, or the next compiler. Fusion decisions, bottleneck estimates, roofline reading — all of it runs on this sense.

Prev · 02pageable vs pinned Next · 04matmul