《The Stack》 EP.42 2026 · APR · 19 38 MIN IMAGINARY

너희 GPT 는 왜
내 GPU 를 이렇게 쓰게 됐을까

7주 만에 vector_add 에서 Flash Attention 까지 걸어온 한 엔지니어의 경로를 역순으로 재생해본다 — 그러면 현대 LLM 이 왜 정확히 이 모양으로 진화했는지, 하드웨어와 모델이 어떻게 공진화해왔는지가 드러난다.

Host

샘 (Sam)

가상 · 모델 쪽에서 묻는 사람

Guest

젠슨 (Jensen)

가상 · 칩 쪽에서 대답하는 사람

CHAPTERS · 6 ACTS38:12

02:1401거리가 제일 비싸다HBM vs FMA — 100배 08:0302병렬은 공짜가 아니다atomic 133 ms vs tree 1 ms 13:4103행렬곱이 LLM을 먹여 살린다0.4 → 7.9 TFLOPS 20:0204Fusion, 어두운 기술HBM trip 4 → 2 25:3805모든 것이 수렴하는 순간Flash Attention · 65× 절감 33:0706왜 PyTorch 야torch.ops · 단일 접점

00 · 02:14Cold Open

역순으로 가보자 — 그러면 모든 게 설명돼.

SAM

오늘 좀 이상한 컨셉으로. 엔지니어 한 명이 CUDA 배우는 걸 옆에서 지켜봤어. 7주 만에 vector_add 에서 Flash Attention 까지. 그걸 보다 보니 계속 질문이 맴돌아. 왜 GPGPU 는 하필 이 모양으로 진화했을까?

JENSEN

(웃음) 그럼 역순으로 가보자고. 요즘 사람들이 제일 많이 부르는 커널이 Flash Attention 이잖아. 거기서 거꾸로 가보면 대충 모든 게 설명돼.

01 · 08:03거리가 제일 비싸다

계산은 공짜에 가깝고, 데이터 이동이 100배 비싸다.

GPU 안에서 숫자 하나 더하는 건 1 나노초. 그걸 HBM 에서 가져오는 건 100 나노초. 이 비율 하나가 GPU 프로그래밍의 모든 결정을 지배한다.

FIG 1 · 계층별 접근 비용lower = faster

FMA (register)~1 ns

L1 / smem~5 ns

L2 cache~20 ns

HBM~100 ns

PCIe H↔D~10 µs

InfiniBand~2 µs /byte²

레슨 1 · 2 에서 몸으로 확인한 것

vector_add 커널은 3.5 ms. 그런데 같은 데이터의 PCIe 복사만 60 ms. 커널이 17 배 짧다. pinned memory 를 쓰지 않으면 end-to-end 가 5.6 배 느려진다.

SAM

음… GPU 안에서 숫자 하나 더하는 데 얼마나 걸려?

JENSEN

대략 1 나노초. 근데 그 숫자를 HBM 에서 가져오는 데는 400–500 사이클, 거의 100 나노초. 계산은 공짜에 가깝고, 데이터 이동이 100 배 비싸.

"실리콘 밖으로 나가는 순간 100배.
그 사실이 LLM 아키텍처의 절반을 정했어."— Jensen, 09:41

SAM

그러니까 게임은 계산을 빨리 하는 것 이 아니라 데이터 이동을 피하는 것.

JENSEN

바로 그거. 그래서 NVLink 만들고, HBM 집적도 올리고, L2 캐시 48 MB 까지 키운 거야. 자꾸 이 "거리" 를 줄이려는 거지. 근데 물리학이 허락 안 해주는 부분이 있어.

SAM

GPU 클러스터 묶을 때 제발 같은 노드에 두려고 애쓰는 이유가 이거였구나.

↳ Lesson 01 · vector_add Lesson 02 · pinned vs pageable

02 · 13:41병렬은 공짜가 아니다

병렬 하드웨어에 직렬 알고리즘을 넣으면 하드웨어가 직렬로 변한다.

쓰레드가 수만 개니까 원자적 덧셈을 다 같이 하면 빠르겠지? 아니다. 100 배 느려진다. 그래서 GPU 에는 __shfl_down_sync 같은 이상한 명령어가 박혀 있다.

SAM

GPU 는 쓰레드가 수만 개잖아. 그냥 다 동시에 더하면 빠르지 않아?

JENSEN

그게 레슨 3 의 함정이야. 버전 1: 모든 쓰레드가 같은 주소에 atomicAdd. 결과가 몇 ms 나왔게?

SAM

… 빠를 것 같은데?

133 ms. 다른 버전은 1 ms.
100 배 차이.— Jensen, 14:58

SAM

잠깐, 그러니까 GPU 하드웨어의 이상한 feature 들 — shared memory, warp shuffle — 이게 다 "소프트웨어가 이런 패턴을 필요로 하니까 하드웨어에 넣은 것" 이라고?

JENSEN

그게 co-design 이야. __shfl_down_sync 는 1.0 에는 없었어. 사람들이 tree reduction 을 너무 많이 쓰니까 6년 뒤에 넣어준 거지. 지금은 표준이고.

FIG 2A · atomicAdd — 한 줄에 선 백만 명SERIAL

FIG 2B · tree reduction — log N 단계PARALLEL

__syncthreads() 도 공짜가 아니다

5 번을 지우면 29% 빨라진다. 작은 커널에서는 sync 한 번이 실행시간의 1/3. 그래서 마지막 warp 는 sync 없이 shuffle 만 돌리는 게 관용구가 됐다.

↳ Lesson 03 · Reduction 네 가지 버전

03 · 20:02행렬곱이 LLM을 먹여 살린다

네 번의 점프 — 그리고 마지막 점프는 하드웨어가 등장한다.

레슨 4 가 제일 길었던 이유. tiling 과 register blocking 이 연산 강도를 올리는 소프트웨어의 기술이라면, 그 다음 8 TFLOPS 천장을 뚫은 건 Tensor Core 라는 새 유닛 이었다.

FIG 3 · matmul v1 → v4 — T4, 4096³TFLOPS

SAM

v4 에서 축이 점프했네.

JENSEN

나머지 세 개는 "HBM 덜 읽게 만드는 법" — 연산 강도(AI)를 올리는 거야. byte 당 몇 flop 뽑냐. 근데 아무리 해도 FP32 peak 8 TFLOPS 에서 막혀. 그래서 Volta 때 하드웨어에 4×4 matmul 블록을 박아넣은 거지.

SAM

2023년에 GPT-4 훈련하면서 FP16, BF16, FP8 얘기 들었던 게 다 Tensor Core 에 맞추려고 그랬던 거구나.

100% 그거야. Tensor Core 는 정밀도를 낮춘 입력에만 동작해. FP4 로 계속 내려가는 건 모델 때문이 아니라 내 칩 때문이기도 해.— Jensen, 22:48

occupancy trap

"타일을 크게 하면 AI 가 올라가서 빠르겠지" 는 반만 맞다. 블록 수가 SM 수보다 충분히 많아야 의미가 있다. 그래서 decode 커널과 prefill 커널은 다르게 생겼다.

↳ Lesson 04 · Matmul 네 가지 버전

04 · 25:38Fusion, 어두운 기술

세 커널을 하나로 합치면 HBM trip 이 절반이 된다 — 그게 2× speedup.

FIG 4A · BEFORE · 3 kernel, 4 HBM tripsunfused

k1max(x) → mR → W

k2exp(x − m) → num, sumR → W

k3num / sum → yR → W

total4 trips · 1.0×

FIG 4B · AFTER · 1 kernel, 2 HBM tripsfused online

k1pass 1: running (m, s) in registersR

k1pass 2: y = exp(x−m)/sR → W

total2 trips · ~2×

SAM

Softmax 는 수학적으로 세 단계지. 그냥 편하게 커널 세 번 부르면 되잖아.

JENSEN

그러면 HBM trip 4 번. 세 단계를 한 커널에 합치면 HBM trip 2 번. 2× 빨라져. 측정치 — 2.02×, 1.86×, 1.92×. 거의 이론치 그대로. 이게 operator fusion 이야.

왜 다 합쳐서 하나의 거대 커널로 안 만들지?
— shared memory 가 유한, 코드 조합 폭발, 정밀도 이슈.
"항상 좋은 것" 이 아니라 "핫패스에만 골라서 하는 기술."— Jensen, 27:10

SAM

그리고 online softmax — 레슨 5 의 v3 — 이 뭔가 중요한 느낌이던데.

JENSEN

그게 Flash Attention 의 수학적 절반이야. 전체 행을 못 담을 때 (max, sum) 을 정확하게 병합하는 공식. 1985 년에 NVIDIA 직원 논문에 있었고, 2022년에 Tri Dao 가 attention 에 적용했지.

ONLINE MERGE · 수식exact, O(N) memory

# merging two partial softmax statistics
new_max = max(m1, m2)
new_sum = s1 * exp(m1 − new_max)
        + s2 * exp(m2 − new_max)

↳ Lesson 05 · Softmax & Online

05 · 28:16모든 것이 수렴하는 순간

Flash Attention 은 근본적으로 새로운 기술이 없다 — 다만 조합이 너무 날카로웠을 뿐.

N×N 중간 행렬을 아예 물리적으로 만들지 않는다. 앞서 본 5 가지 기술 — coalesced load, HBM 감각, warp reduce, tiled matmul, online softmax — 이 한 커널 안에 모두 들어간다.

FIG 5 · N=4096 에서의 HBM 트래픽65× 감소

65× 감소 · time 4.79×

SAM

65 배 감소면 65 배 빨라져야 할 것 같은데?

JENSEN

(웃음) 실제로는 4.79 배. 왜? naive 도 실은 HBM-bound 가 아니었거든. L2 캐시가 S 행렬의 상당 부분을 캐싱해버려. FA 의 진가는 L2 가 못 담는 크기 에서 나와 — N=8k, 16k, 32k. 거기서 naive 는 아예 돌지도 못해.

GPT-4o 의 128k context 가 돌아가는 이유가 FA-2 + FA-3 때문이야. 없으면 불가능해.— Jensen, 31:44

FIG 5B · FA 한 커널에 들어간 5 가지 레슨synthesis

L1Q / K / V coalesced load→ reg

L2HBM ↔ L2 감각 · 타일 크기Br, Bc

L3warp-reduce for row max / sum__shfl_xor

L4tiled matmul in registersQ@Kᵀ, P@V

L5online (m, s) mergeAct 4 공식

↳ Lesson 06 · Flash Attention (capstone)

06 · 33:07왜 PyTorch 야

현대 LLM 서빙 스택의 단일 접점 — `torch.ops.*`.

SAM

2018 년까진 CUDA 커널 짜면 대개 standalone 바이너리였잖아. 이제 아무도 그렇게 안 해.

JENSEN

생태계 때문이야. GPT-5 학습하는 PyTorch 안에 체크포인팅, 데이터 로딩, FSDP, autograd, 옵티마이저, 모델 zoo… 다 있어. 이걸 다시 만드는 건 미친 짓이야. 그래서 새 커널을 쓰려면 PyTorch 에 꽂을 수 있어야 해.

vLLM, FlashAttention-3, Mamba — 전부 torch.ops.* 에 등록된 CUDA 커널.
이게 프로덕션 LLM 서빙의 단일 접점 이 된 거야.— Jensen, 35:20

FIG 6 · 커널 하나가 생태계에 꽂히는 순간torch.compile fullgraph

측정치 · 레슨 9

AttentionBlock 을 torch.compile(..., fullgraph=True) 에 통째로 넣었다. 그래프 브레이크 0 건. eager vs compiled err = 0.00e+00 — 비트 단위 동일.

↳ Lesson 07 · PyTorch custom op Lesson 09 · MHA causal

그래서 왜 이런 모양이 됐나

그 친구가 7주 만에 걸어간 경로를 거꾸로 보면, 실은 Flash Attention 이 필연 이었다는 게 보인다.

FIG 7 · 6 단계 필연 체인forward → reverse

SAM

근데 한 가지만. 너 5 년 뒤에 매트릭스곱 말고 뭘 최적화할 것 같아?

JENSEN

(잠시 침묵) … 그건 다음 편에 얘기하자고.

모든 수치는 cudatraining 레슨 1–9 핸드오프 문서, T4 sm_75 / L4 sm_89 실측.

← Back Index · 11편 기록 Ep.43 → Triton 이 숨기는 것, 노출하는 것

《The Stack》 EP.42 2026 · APR · 19 38 MIN IMAGINARY

Why did your GPT
end up using my GPU this way

Replay in reverse the 7-week path of an engineer who walked from vector_add to Flash Attention — and you start to see why the modern LLM evolved into exactly this shape, and how hardware and model co-evolved.

Host

Sam

imaginary · asking from the model side

Guest

Jensen

imaginary · answering from the chip side

CHAPTERS · 6 ACTS38:12

02:1401Distance is the costliest thingHBM vs FMA — 100× 08:0302Parallelism isn't freeatomic 133 ms vs tree 1 ms 13:4103Matmul feeds the LLM0.4 → 7.9 TFLOPS 20:0204Fusion, the dark artHBM trips 4 → 2 25:3805The moment everything convergesFlash Attention · 65× less 33:0706Why PyTorchtorch.ops · single entry point

00 · 02:14Cold Open

Let's go in reverse — then everything explains itself.

SAM

Slightly unusual concept today. I watched an engineer learn CUDA from the sidelines. Seven weeks, from vector_add to Flash Attention. And one question kept circling. Why did GPGPU evolve into exactly this shape?

JENSEN

(laughs) Then let's walk it backward. The kernel people call most these days is Flash Attention, right? Go backward from there and almost everything explains itself.

01 · 08:03Distance is the costliest thing

Compute is nearly free, and moving data is 100× more expensive.

Inside a GPU, adding one number is 1 ns. Fetching it from HBM is 100 ns. This single ratio governs every decision in GPU programming.

FIG 1 · access cost by tierlower = faster

FMA (register)~1 ns

L1 / smem~5 ns

L2 cache~20 ns

HBM~100 ns

PCIe H↔D~10 µs

InfiniBand~2 µs /byte²

What lessons 1 & 2 confirmed in practice

The vector_add kernel: 3.5 ms. The PCIe copy of the same data alone: 60 ms. The kernel is 17× shorter. Without pinned memory, end-to-end runs 5.6× slower.

SAM

So… how long does adding one number inside the GPU take?

JENSEN

About 1 nanosecond. Fetching it from HBM: 400–500 cycles, nearly 100 nanoseconds. Compute is nearly free, and data movement is 100× more expensive.

"The moment you leave the silicon, 100×.
That one fact set half the shape of LLM architecture."— Jensen, 09:41

SAM

So the game isn't compute faster, it's avoid moving data.

JENSEN

Exactly. That's why we built NVLink, pushed HBM density, and grew L2 cache to 48 MB. We keep shrinking the "distance." Physics only lets us go so far.

SAM

So that's why people scream to keep GPUs in the same node when they build clusters.

↳ Lesson 01 · vector_add Lesson 02 · pinned vs pageable

02 · 13:41Parallelism isn't free

Feed a serial algorithm to parallel hardware and the hardware turns serial.

If you have tens of thousands of threads, doing atomic adds together should be fast, right? No. 100× slower. That's why GPUs ship weird instructions like __shfl_down_sync.

SAM

The GPU has tens of thousands of threads. Wouldn't adding everything in parallel be fast?

JENSEN

That's the trap in Lesson 3. Version 1: every thread does atomicAdd on the same address. Guess how many ms?

SAM

… fast, I'd think?

133 ms. Other versions: 1 ms.
100× difference.— Jensen, 14:58

SAM

Wait — so the GPU hardware's weird features — shared memory, warp shuffle — are all "the hardware went in because software needed these patterns"?

JENSEN

That's co-design. __shfl_down_sync didn't exist in 1.0. People used tree reductions so much we added it six years later. Now it's standard.

FIG 2A · atomicAdd — a million in a single lineSERIAL

FIG 2B · tree reduction — log N levelsPARALLEL

__syncthreads() is not free either

Remove five of them and you get 29% back. In small kernels, a single sync is a third of runtime. That's why "last warp only shuffles, no sync" became an idiom.

↳ Lesson 03 · Four versions of Reduction

03 · 20:02Matmul feeds the LLM

Four jumps — and the last one is a new piece of hardware.

Why Lesson 4 was the longest. Tiling and register blocking are software techniques that raise arithmetic intensity; the final breakthrough past the 8-TFLOPS ceiling came from a new unit called the Tensor Core.

FIG 3 · matmul v1 → v4 — T4, 4096³TFLOPS

SAM

The axis just jumped at v4.

JENSEN

The first three are "read less HBM" — lifting arithmetic intensity (AI), flops per byte. No matter what you do, FP32 peak 8 TFLOPS is the ceiling. So Volta baked a 4×4 matmul block into the hardware.

SAM

The FP16 / BF16 / FP8 stuff I kept hearing while training GPT-4 in 2023 — all of it was to feed Tensor Cores.

100% — Tensor Cores only work on reduced-precision inputs. The descent to FP4 is about the chip as much as it is about the model.— Jensen, 22:48

occupancy trap

"Bigger tiles raise AI, so they're faster" is only half true. You also need enough blocks to feed all the SMs. That's why decode kernels and prefill kernels look different.

↳ Lesson 04 · Four versions of Matmul

04 · 25:38Fusion, the dark art

Fuse three kernels into one and HBM trips are halved — that's 2× speedup.

FIG 4A · BEFORE · 3 kernels, 4 HBM tripsunfused

k1max(x) → mR → W

k2exp(x − m) → num, sumR → W

k3num / sum → yR → W

total4 trips · 1.0×

FIG 4B · AFTER · 1 kernel, 2 HBM tripsfused online

k1pass 1: running (m, s) in registersR

k1pass 2: y = exp(x−m)/sR → W

total2 trips · ~2×

SAM

Softmax is mathematically three steps. Fine to just call three kernels, right?

JENSEN

Then HBM trips 4. Merge the three into one kernel: HBM trips 2. 2× faster. Measured: 2.02×, 1.86×, 1.92×. Almost exactly theoretical. That's operator fusion.

Why not fuse everything into one giant kernel?
— shared memory is finite, code combinatorics explode, precision issues.
"Not always good" — "picked only for hot paths."— Jensen, 27:10

SAM

And online softmax — Lesson 5's v3 — felt like it mattered more than it looked.

JENSEN

That's the mathematical half of Flash Attention. The formula that exactly merges (max, sum) when you can't fit the whole row. It was in an NVIDIA employee's 1985 paper; Tri Dao applied it to attention in 2022.

ONLINE MERGE · formulaexact, O(N) memory

# merging two partial softmax statistics
new_max = max(m1, m2)
new_sum = s1 * exp(m1 − new_max)
        + s2 * exp(m2 − new_max)

↳ Lesson 05 · Softmax & Online

05 · 28:16The moment everything converges

Flash Attention uses no fundamentally new technique — the combination was just sharp.

It never physically creates the N×N intermediate. All five earlier techniques — coalesced loads, HBM intuition, warp reduce, tiled matmul, online softmax — fit inside one kernel.

FIG 5 · HBM traffic at N=409665× less

65× less · time 4.79×

SAM

65× less should mean 65× faster, shouldn't it?

JENSEN

(laughs) In practice it's 4.79×. Why? naive wasn't really HBM-bound. L2 cached most of S. FA's real win shows up beyond the size L2 can hold — N=8k, 16k, 32k. At those sizes naive won't even run.

The reason GPT-4o's 128k context works is FA-2 + FA-3. Without them, it's impossible.— Jensen, 31:44

FIG 5B · five lessons inside one FA kernelsynthesis

L1Q / K / V coalesced load→ reg

L2HBM ↔ L2 intuition · tile sizesBr, Bc

L3warp-reduce for row max / sum__shfl_xor

L4tiled matmul in registersQ@Kᵀ, P@V

L5online (m, s) mergeAct 4 formula

↳ Lesson 06 · Flash Attention (capstone)

06 · 33:07Why PyTorch

The single entry point of the modern LLM-serving stack — `torch.ops.*`.

SAM

Until 2018 a CUDA kernel was mostly a standalone binary. Nobody does that now.

JENSEN

Because of the ecosystem. GPT-5 training runs under PyTorch — checkpointing, data loading, FSDP, autograd, optimizers, model zoo, everything. Rebuilding that is insane. So to ship a new kernel you have to plug into PyTorch.

vLLM, FlashAttention-3, Mamba — all CUDA kernels registered under torch.ops.*.
This became the single entry point of production LLM serving.— Jensen, 35:20

FIG 6 · a kernel plugging into the ecosystemtorch.compile fullgraph

Measurement · Lesson 9

Placed the whole AttentionBlock inside torch.compile(..., fullgraph=True). Graph breaks: 0. eager vs compiled err = 0.00e+00 — bit-for-bit identical.

↳ Lesson 07 · PyTorch custom op Lesson 09 · MHA causal

So why did it end up looking like this

Running the engineer's 7-week path backward, you can see that Flash Attention was inevitable.

FIG 7 · 6-step inevitability chainforward → reverse

SAM

One last thing. Five years from now, what do you think you'll be optimizing besides matmul?

JENSEN

(a brief silence) … let's save that for the next episode.

All numbers come from cudatraining lesson 1–9 handoff docs, measured on T4 sm_75 / L4 sm_89.

← Back Index · 11-post log Ep.43 → What Triton hides, and what it exposes

너희 GPT 는 왜내 GPU 를 이렇게 쓰게 됐을까

역순으로 가보자 — 그러면 모든 게 설명돼.

계산은 공짜에 가깝고, 데이터 이동이 100배 비싸다.

병렬 하드웨어에 직렬 알고리즘을 넣으면 하드웨어가 직렬로 변한다.

네 번의 점프 — 그리고 마지막 점프는 하드웨어가 등장한다.

세 커널을 하나로 합치면 HBM trip 이 절반이 된다 — 그게 2× speedup.

Flash Attention 은 근본적으로 새로운 기술이 없다 — 다만 조합이 너무 날카로웠을 뿐.

현대 LLM 서빙 스택의 단일 접점 — torch.ops.*.

그래서 왜 이런 모양이 됐나

Why did your GPTend up using my GPU this way

Let's go in reverse — then everything explains itself.

Compute is nearly free, and moving data is 100× more expensive.

Feed a serial algorithm to parallel hardware and the hardware turns serial.

Four jumps — and the last one is a new piece of hardware.

Fuse three kernels into one and HBM trips are halved — that's 2× speedup.

Flash Attention uses no fundamentally new technique — the combination was just sharp.

The single entry point of the modern LLM-serving stack — torch.ops.*.

So why did it end up looking like this

너희 GPT 는 왜
내 GPU 를 이렇게 쓰게 됐을까

현대 LLM 서빙 스택의 단일 접점 — `torch.ops.*`.

Why did your GPT
end up using my GPU this way

The single entry point of the modern LLM-serving stack — `torch.ops.*`.