LESSON 04 · 2026.04.18 · T4

Matmul — memory-bound 에서 Tensor Core 까지

0.4 → 7.9 TFLOPS. 같은 GPU, 같은 matrix. 네 번의 구현을 거치며 지붕이 두 번 바뀌었다. Roofline 의 축이 수직으로 점프한 자리.

GPU · T4 peak · FP32 8.1 · TC 65 TFLOPS sweep · 4 sizes × 4 versions

네 버전

v1 naive — 1 thread / 1 output. global memory only
v2 tiled — 32×32 shared memory tile (AI = 8 FLOP/byte)
v3 register — block 128×128, thread tile 8×8, BK=8 (AI = 32 FLOP/byte)
v4 tensor — WMMA API, block 64×64, FP16 입력 → FP32 축적

TFLOPS 매트릭스

N	v1 naive	v2 tiled	v3 reg	v4 tensor
256	0.22	0.56	0.20 ⚠	1.51
512	0.47	0.75	0.95	3.71
1024	0.40	0.64	1.43	3.59
2048	0.40	0.83	2.05	7.93

v3 @ 2048 = FP32 peak 의 25%. v4 @ 2048 = TC peak 의 12%. 단순 WMMA 만으로 7.93 TFLOPS 가 나왔다는 게 핵심.

교훈 1 · Occupancy trap — v3 가 N=256 에서 v2 보다 느림

v3 의 block tile 은 128×128. N=256 이면 블록 수 = 4 개뿐이다. T4 의 40 SM 중 36 SM 이 놀고 있다. 반면 v2 는 tile 32×32 → 64 블록 → 모든 SM 활용.

타일이 클수록 AI 증가 = 좋다. 단 블록 수 ≥ 2 × SM 수 를 지켜야 한다. 이걸 놓치면 오히려 느려진다.

LLM 서빙 함의: 큰 matmul (FFN) 은 문제없지만, small-batch decode 의 attention matmul 은 "작은 tile variant" 가 필요하다.

교훈 2 · v3 의 본질

스레드당 8×8 output 을 레지스터에 보유. 그러면:

shared memory 에서 같은 값 읽는 횟수가 v2 대비 8 배 감소
Block 당 thread 수: 1024 → 256 (4 배 감소) → SM 당 거주 블록 수 증가 = occupancy 회복
Block tile: 32×32 → 128×128 (16 배) → AI 4 배 증가

이 세 효과가 겹쳐서 N=2048 에서 2.05 TFLOPS. cuBLAS SGEMM 의 3–4 TFLOPS 에 근접한다.

교훈 3 · v4 는 새 지붕 — Tensor Core

한 mma_sync 명령 = 16×16×16 매트릭스 곱 = 4096 FMA, 약 8 cycle 완료. 같은 warp-cycle 대비 FP32 FMA 의 16 배 처리량이다.

우리가 peak 의 12% 만 찍은 이유:

Fragment load 가 swizzled layout 이 아님
Double buffering (다음 타일 HBM→shared 로드 + 현재 타일 mma 오버랩) 없음
Block tile 64×64 로 작음 (AI 제한)

이 셋을 다 구현한 게 CUTLASS. 수개월 작업.

교훈 4 · 정밀도의 값

버전	max_abs_err @ N=1024
v1, v2, v3 (FP32)	7.6e-5
v4 (FP16 입력)	1.4e-2 · 180×

FP16 입력 → mantissa 10 bit → 약 3 decimal 정밀도. LLM 추론은 허용, 학습은 BF16/FP32 혼합 필요. AWQ, GPTQ, FP8 quantization 이 존재하는 이유가 여기 있다. 더 낮은 정밀도 → 더 빠른 TC → 출력이 거의 같으면 이득.

Roofline 감각

                    │       Tensor Core peak ━━━━━━━━━━
perf ▲              │
(TFLOPS)            │                                  v4 ●
                    │                                (7.9)
                    │      ← FP32 peak 8.1 ──────────────
                    │                         v3 ●
                    │                       (2.0)
                    │                 v2 ●
                    │              (0.8)
                    │      v1 ●
                    │    (0.4)
                    └────────────────────────────────────▶
                       low AI                       high AI

v3 까지는 HBM 대역폭 싸움, v4 에서 축이 수직으로 점프 — 다른 지붕. LLM inference 는 거의 전부 맨 아래 두 줄에서 실행된다. v1-v4 는 거기까지 올라가는 사다리.

Prev · 03reduction Next · 05softmax & fusion

LESSON 04 · 2026.04.18 · T4

Matmul — from memory-bound to Tensor Cores

0.4 → 7.9 TFLOPS. Same GPU, same matrix. Across four implementations the ceiling shifted twice. The spot where the roofline axis jumps vertically.

GPU · T4 peak · FP32 8.1 · TC 65 TFLOPS sweep · 4 sizes × 4 versions

Four versions

v1 naive — 1 thread / 1 output. global memory only
v2 tiled — 32×32 shared-memory tile (AI = 8 FLOP/byte)
v3 register — block 128×128, thread tile 8×8, BK=8 (AI = 32 FLOP/byte)
v4 tensor — WMMA API, block 64×64, FP16 inputs → FP32 accumulate

TFLOPS matrix

N	v1 naive	v2 tiled	v3 reg	v4 tensor
256	0.22	0.56	0.20 ⚠	1.51
512	0.47	0.75	0.95	3.71
1024	0.40	0.64	1.43	3.59
2048	0.40	0.83	2.05	7.93

v3 @ 2048 = 25% of FP32 peak. v4 @ 2048 = 12% of TC peak. The point: plain WMMA alone lands 7.93 TFLOPS.

Lesson 1 · Occupancy trap — v3 is slower than v2 at N=256

v3's block tile is 128×128. At N=256 that's just 4 blocks. 36 of T4's 40 SMs sit idle. v2's 32×32 tiles → 64 blocks → all SMs working.

Bigger tiles mean higher AI — good. But you must keep blocks ≥ 2 × SM count. Miss that and you go slower instead.

LLM-serving implication: big matmuls (FFN) are fine, but the small-batch decode attention matmul needs a "small-tile variant."

Lesson 2 · What v3 is really doing

Each thread holds an 8×8 output tile in registers. That gives:

8× fewer reads of the same shared-memory value compared to v2
Threads per block: 1024 → 256 (4× fewer) → more resident blocks per SM = occupancy restored
Block tile: 32×32 → 128×128 (16×) → AI up 4×

These three compound to 2.05 TFLOPS at N=2048 — close to cuBLAS SGEMM's 3–4 TFLOPS.

Lesson 3 · v4 is a new ceiling — Tensor Cores

One mma_sync instruction = 16×16×16 matrix product = 4096 FMAs, done in ~8 cycles. Per warp-cycle, that's 16× the throughput of FP32 FMAs.

Why we only hit 12% of peak:

Fragment loads are not in swizzled layout
No double buffering (next tile HBM→shared overlapped with current tile mma)
Block tile 64×64 is small (AI limited)

CUTLASS implements all three. Months of work.

Lesson 4 · The cost of precision

version	max_abs_err @ N=1024
v1, v2, v3 (FP32)	7.6e-5
v4 (FP16 input)	1.4e-2 · 180×

FP16 inputs → 10-bit mantissa → about 3 decimal digits of precision. Fine for LLM inference, training needs BF16/FP32 mix. This is why AWQ, GPTQ, and FP8 quantization exist. Lower precision → faster TC → if output is nearly the same, net win.

Roofline intuition

                    │       Tensor Core peak ━━━━━━━━━━
perf ▲              │
(TFLOPS)            │                                  v4 ●
                    │                                (7.9)
                    │      ← FP32 peak 8.1 ──────────────
                    │                         v3 ●
                    │                       (2.0)
                    │                 v2 ●
                    │              (0.8)
                    │      v1 ●
                    │    (0.4)
                    └────────────────────────────────────▶
                       low AI                       high AI

Up through v3 is a bandwidth fight; v4 jumps the axis vertically — a different ceiling. LLM inference runs almost entirely in the bottom two rows. v1–v4 is the ladder up to there.

Prev · 03reduction Next · 05softmax & fusion