LESSON 06 · 2026.04.18 · T4 · CAPSTONE

Flash Attention — 다섯 레슨이 80줄에 수렴하다

Tiled matmul + online softmax + HBM 트래픽 감량. 레슨 1–5 가 쌓아온 것들을 하나의 커널에 밀어넣었다. N=4096 에서 4.79× 빠르고 65× 적은 메모리 트래픽.

GPU · T4 d · 64, FP32 sweep · 8 runs

두 구현

naive — 3 개 커널 (QK→softmax→PV). 중간 행렬 S, P (N×N) 를 HBM 에 완전 materialize.
flash — 단일 커널. Q 를 Br=64 행 블록, K/V 를 Bc=32 열 블록. 각 타일에서 Q@K^T, online softmax, P@V 를 한 번에 누적. S/P 는 HBM 에 안 쓴다.

결과

N	naive ms	flash ms	speedup	naive HBM	flash HBM	HBM ratio
512	0.402	0.593	0.68×	4.5 MB	0.5 MB	9×
1024	1.088	0.881	1.24×	17.0	1.0	17×
2048	3.076	1.235	2.49×	66.0	2.0	33×
4096	11.857	2.477	4.79×	260.0	4.0	65×

GFLOP/s (연산량 기준): naive 169 → 366, flash 115 → 1754 (T4 peak 의 22%). 정확도: 둘 다 max_abs_err < 5e-7.

교훈 1 · HBM 절감은 N² → N·d 스케일링

N=4096, d=64:

naive HBM ≈ 4N² + 4N·d = 260 MB
flash HBM ≈ 4N·d = 4 MB

비율 65×, N 제곱으로 벌어진다. 이게 FA 의 전부. 나머지는 이 절감을 시간으로 환산하는 엔지니어링.

교훈 2 · Crossover — N=512 에서는 naive 가 더 빠르다

N=512 에서 flash 는 오히려 느리다 (0.68×). 이유:

naive 의 S (1 MB) 가 T4 L2 (4 MB) 에 들어감 → 실제 HBM 로드는 이론치보다 훨씬 적음
flash 는 K/V 를 각 Q 블록마다 순회 → recompute 오버헤드
Br=64 블록 수가 적어 T4 의 40 SM 을 못 채움 → occupancy 낮음

"FA 가 항상 빠르다" 는 거짓. sequence length 가 길어야 우위가 난다. 실제 LLM prefill (N=4096~32k) 에서 FA 가 결정적인 이유.

교훈 3 · HBM 절감 vs 실시간 속도

N=4096 에서 HBM 은 65× 주는데 시간은 4.79× 만 빨라진다. 왜?

naive 가 실제로는 HBM-bound 가 아님 (effective 22 GB/s → T4 의 7%). L2 가 대부분 흡수.
FLOPs 는 동일 — flash 는 연산이 아니라 데이터 이동만 줄인다
flash 가 1.75 TFLOPS 에서 멈춤 → compute-bound 단계로 진입

FA 의 큰 성능은 H100 같은 고대역 + 고 peak 에서 빛난다. T4 에서도 절감은 실재하지만 극적이지 않다.

교훈 4 · 다섯 레슨이 한 커널에 수렴

Lesson 01 vector_add  → coalesced load 패턴 (FA 의 Q/K/V 로드)
Lesson 02 memory      → HBM↔L2 트래픽 의식 (FA 의 존재 이유)
Lesson 03 reduction   → warp reduce (row max/sum)
Lesson 04 matmul      → tiled matmul (S = Q@K^T, O += P@V)
Lesson 05 softmax     → online (max, sum) update (심장)
Lesson 06 flash       → 위 다섯의 fusion

80 줄짜리 커널 하나에 저게 다 들어있다. FA 논문이 "별로 새롭지 않은 기술의 날카로운 조합" 이라 평가되는 이유.

LLM serving 번역

Prefill: N 수천~수만. FA 없이는 메모리도 모자람. vLLM / TensorRT-LLM 이 FA-2/FA-3 을 호출하는 지점.
Decode: seq_len = 1 (현재 쿼리). Q 가 아주 작아 FA 타일 구조 자체가 overkill. Paged Attention 등 다른 최적화가 중요.

우리 단일 커널 FA 는 prefill 쪽의 dynamics 를 설명한다. 여기서 "CUDA 레벨 1" 학습이 종료. 다음 레슨부터는 PyTorch 에 붙인다.

Prev · 05softmax & fusion Next · 07pytorch custom op

LESSON 06 · 2026.04.18 · T4 · CAPSTONE

Flash Attention — five lessons converging into 80 lines

Tiled matmul + online softmax + HBM-traffic reduction. Everything lessons 1–5 built, compressed into a single kernel. 4.79× faster and 65× less memory traffic at N=4096.

GPU · T4 d · 64, FP32 sweep · 8 runs

Two implementations

naive — 3 kernels (QK → softmax → PV). Fully materializes intermediates S and P (N×N) in HBM.
flash — single kernel. Q in Br=64 row blocks, K/V in Bc=32 column blocks. Each tile accumulates Q@K^T, online softmax, and P@V in one pass. S and P never touch HBM.

Results

N	naive ms	flash ms	speedup	naive HBM	flash HBM	HBM ratio
512	0.402	0.593	0.68×	4.5 MB	0.5 MB	9×
1024	1.088	0.881	1.24×	17.0	1.0	17×
2048	3.076	1.235	2.49×	66.0	2.0	33×
4096	11.857	2.477	4.79×	260.0	4.0	65×

GFLOP/s (by compute count): naive 169 → 366, flash 115 → 1754 (22% of T4 peak). Accuracy: both max_abs_err < 5e-7.

Lesson 1 · HBM reduction scales N² → N·d

At N=4096, d=64:

naive HBM ≈ 4N² + 4N·d = 260 MB
flash HBM ≈ 4N·d = 4 MB

Ratio 65×, widening quadratically in N. This is the whole of FA. The rest is engineering that converts this reduction into time.

Lesson 2 · Crossover — naive is faster at N=512

At N=512, flash is actually slower (0.68×). Reasons:

naive's S (1 MB) fits in T4's L2 (4 MB) → actual HBM loads are far below theoretical
flash rotates K/V for each Q block → recompute overhead
Br=64 means too few blocks to fill T4's 40 SMs → low occupancy

"FA is always faster" is false. It wins once sequence length is long. That's why FA is decisive for real LLM prefill (N=4096–32k).

Lesson 3 · HBM reduction vs wall-clock speed

At N=4096, HBM drops 65× but wall-clock only 4.79×. Why?

naive isn't really HBM-bound (effective 22 GB/s → 7% of T4). L2 absorbs most of it.
FLOPs are identical — flash only cuts data movement, not compute
flash plateaus at 1.75 TFLOPS → entering a compute-bound regime

FA's big wins appear on high-bandwidth + high-peak silicon like H100. On T4 the savings are real but not dramatic.

Lesson 4 · Five lessons converge in one kernel

Lesson 01 vector_add  → coalesced load pattern (FA's Q/K/V loads)
Lesson 02 memory      → HBM↔L2 traffic awareness (the reason FA exists)
Lesson 03 reduction   → warp reduce (row max/sum)
Lesson 04 matmul      → tiled matmul (S = Q@K^T, O += P@V)
Lesson 05 softmax     → online (max, sum) update (heart)
Lesson 06 flash       → fusion of the five above

All of that sits inside one 80-line kernel. That's why the FA paper is often described as "a sharp combination of not-particularly-novel techniques."

LLM-serving translation

Prefill: N in the thousands to tens of thousands. Without FA the memory itself runs out. This is where vLLM / TensorRT-LLM call FA-2/FA-3.
Decode: seq_len = 1 (the current query). Q is so small that FA's tile structure is overkill. Other optimizations like Paged Attention matter more.

Our single-kernel FA explains the dynamics on the prefill side. This closes "CUDA Level 1." Next lesson onward we attach this to PyTorch.

Prev · 05softmax & fusion Next · 07pytorch custom op