2026 · APR · 18 → 20 · 3 DAYS · T4 → L4

vector_add 한 줄에서
cuDNN FA-2 의 80%까지.

9CUDA / Triton 커널

4.79×Flash Attention speedup

65×HBM 트래픽 절감

54 TFLOPSTriton fp16 matmul

0.86×cuDNN FA-2 대비

PART I · Memory

vector_add — 첫 커널이 가르쳐 준 것

T4 에서 230 GB/s. 그러나 진짜 병목은 커널이 아니라 PCIe 위의 복사였다.

LESSON 01

복사 비용의 정체 — pageable vs pinned

D2H 가 12.7 배 느려지는 하나의 이유. 그리고 end-to-end 5.6배 차이.

LESSON 02

PART II · Parallelism

Reduction — atomic 이 100배 느린 이유

shared memory, warp shuffle, 그리고 __syncthreads 하나가 만든 29%.

LESSON 03

Matmul — memory-bound 에서 Tensor Core 까지

0.4 → 7.9 TFLOPS. 20배의 로프라인 여정을 네 층으로.

LESSON 04

PART III · Fusion & Attention

Softmax & Fusion — Flash Attention 의 수학적 절반

커널을 3개에서 1개로 합치면 정확히 2배 빨라진다. online softmax 의 탄생.

LESSON 05

Flash Attention — 다섯 레슨이 80줄에 수렴하다

N=4096 에서 4.79× 빠르고 65× 적은 HBM 트래픽. 그리고 그 경계선.

LESSON 06 · CAPSTONE

PART IV · Integration

PyTorch Custom Op — 50줄이 뚫은 한 층

torch.ops.mylib.flash_attention. 이 순간부터 vLLM 코드가 읽힌다.

LESSON 07

Triton 포팅 — 추상화는 얼마를 먹는가

L4 로 점프. 4개 커널을 Triton 으로. cuBLAS 를 근소하게 이긴 구간 발견.

LESSON 08

MHA + Causal Flash Attention

4-D shape + causal + torch.compile(fullgraph=True). cuDNN FA-2 의 80-90%.

LESSON 09

PART V · Essays

Triton vs CUDA — 추상화의 비용은 어디서 나타나는가

50줄이 5000줄을 대체할 수 있는가. 네 개 커널의 답.

ESSAY

300줄짜리 Triton FA 가 cuDNN 의 80-90%를 따라잡는 방법

IS_CAUSAL: tl.constexpr, loop-skip, custom_op. 세 가지 트릭의 합.

ESSAY

BONUS · 《The Stack》 Podcast

너희 GPT는 왜 내 GPU를 이렇게 쓰게 됐을까

샘 × 젠슨 (가상). 7주의 경로를 역순으로 재생하면 Flash Attention 이 필연이었다는 게 보인다.

EP.42 · 38 MIN

Triton 이 숨기는 것, 노출하는 것

파일 일곱 개를 펼쳐놓고 한 줄씩. CUDA 가 Triton 으로 접힐 때 무엇이 사라지고 무엇이 남나.

EP.43 · 42 MIN

vector_add 한 줄에서
cuDNN FA-2 의 80%까지.

vector_add — 첫 커널이 가르쳐 준 것

복사 비용의 정체 — pageable vs pinned

Reduction — atomic 이 100배 느린 이유

Matmul — memory-bound 에서 Tensor Core 까지

Softmax & Fusion — Flash Attention 의 수학적 절반

Flash Attention — 다섯 레슨이 80줄에 수렴하다

PyTorch Custom Op — 50줄이 뚫은 한 층

Triton 포팅 — 추상화는 얼마를 먹는가

MHA + Causal Flash Attention

Triton vs CUDA — 추상화의 비용은 어디서 나타나는가

300줄짜리 Triton FA 가 cuDNN 의 80-90%를 따라잡는 방법

너희 GPT는 왜 내 GPU를 이렇게 쓰게 됐을까

Triton 이 숨기는 것, 노출하는 것

From a single line of vector_add
to 80% of cuDNN FA-2.

vector_add — what the first kernel taught

The true cost of copies — pageable vs pinned

Reduction — why atomic is 100× slower

Matmul — from memory-bound to Tensor Cores

Softmax & Fusion — the mathematical half of Flash Attention

Flash Attention — five lessons converging into 80 lines

PyTorch Custom Op — one layer broken through in 50 lines

Triton port — what does abstraction cost?

MHA + Causal Flash Attention

Triton vs CUDA — where does the cost of abstraction show up?

How a 300-line Triton FA closes in on 80–90% of cuDNN

Why did your GPT end up using my GPU this way?

What Triton hides, and what it exposes

vector_add 한 줄에서cuDNN FA-2 의 80%까지.

vector_add — 첫 커널이 가르쳐 준 것

복사 비용의 정체 — pageable vs pinned

Reduction — atomic 이 100배 느린 이유

Matmul — memory-bound 에서 Tensor Core 까지

Softmax & Fusion — Flash Attention 의 수학적 절반

Flash Attention — 다섯 레슨이 80줄에 수렴하다

PyTorch Custom Op — 50줄이 뚫은 한 층

Triton 포팅 — 추상화는 얼마를 먹는가

MHA + Causal Flash Attention

Triton vs CUDA — 추상화의 비용은 어디서 나타나는가

300줄짜리 Triton FA 가 cuDNN 의 80-90%를 따라잡는 방법

너희 GPT는 왜 내 GPU를 이렇게 쓰게 됐을까

Triton 이 숨기는 것, 노출하는 것

From a single line of vector_addto 80% of cuDNN FA-2.

vector_add — what the first kernel taught

The true cost of copies — pageable vs pinned

Reduction — why atomic is 100× slower

Matmul — from memory-bound to Tensor Cores

Softmax & Fusion — the mathematical half of Flash Attention

Flash Attention — five lessons converging into 80 lines

PyTorch Custom Op — one layer broken through in 50 lines

Triton port — what does abstraction cost?

MHA + Causal Flash Attention

Triton vs CUDA — where does the cost of abstraction show up?

How a 300-line Triton FA closes in on 80–90% of cuDNN

Why did your GPT end up using my GPU this way?

What Triton hides, and what it exposes

vector_add 한 줄에서
cuDNN FA-2 의 80%까지.

From a single line of vector_add
to 80% of cuDNN FA-2.