《GPU Mode》 L023 Tensor Cores · CUTLASS · CuTe High priority transcript · slides · available

Tensor Cores

Volta 부터 Hopper 까지 — 한 instruction 으로 작은 행렬 곱을 끝내는 hardware unit 이 어떻게 진화했는가. 그리고 그 위에 올라가는 software stack CUTLASS · CuTe · Collective 의 layered abstraction. NVIDIA Architecture 의 Vijay Thakkar 와 Pradeep Ramani 가 직접 와서 보여주는 hardware MMA 와 그 사용자 facing API 의 학습 노트.

Tensor Core MMA CUTLASS CuTe WGMMA · TMA FP16 · BF16 · FP8 Volta → Hopper collective MMA epilogue fusion tile-based programming

Speakers

Vijay Thakkar & Pradeep Ramani

NVIDIA · CUTLASS / CuTe core team

강의 번호

L023

스피커

Vijay Thakkar & Pradeep Ramani

학습 우선순위

High · 정독

다시 볼 때

CUTLASS example 직접

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제왜 이 강의가 존재하는가 02Tensor Core 의 진화Volta → Hopper 03FP16 · BF16 · TF32 · FP8정밀도 와 throughput 04mma.sync vs WMMA vs WGMMAPTX instruction 의 layer 05tile shape · M N Klayout · swizzle 06CUTLASS 의 layered APIcollective · kernel · device 07CuTe — layout 의 algebratile-based programming 08Hopper 의 새 기능들TMA · WGMMA · async 09HBM 대역폭과의 관계peak FLOPs vs feed 10기억할 메모와 코드 자료key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· 왜 이 강의가 존재하는가

Tensor Core 위에서 “정상속도” 까지 짜내려면 — 그 layer 들을 안다

현대 GPU 의 산술 power 의 95%+ 가 Tensor Core 에서 나온다 (FP16 기준 H100 의 989 TFLOPS 중 ~95%). 그런데 평범한 CUDA 커널은 거의 자연스럽게 그 unit 을 못 쓴다 — mma.sync 같은 PTX 또는 그 위 abstraction (wmma::, CUTLASS) 을 명시적으로 호출해야 한다.

강의가 깐 큰 질문 세 개.

Tensor Core 가 어떻게 4 세대를 거치며 진화했는가 — Volta 의 m16n16k16 부터 Hopper 의 m64n256k16 WGMMA 까지(§ 02).
그 hardware 위에 올라가는 software 의 layer 가 어떻게 쌓여 있는가 — CUTLASS 의 collective / kernel / device 3 단계와 CuTe 의 layout algebra(§ 06–07).
이걸 왜 알아야 하는가 — FlashAttention, GEMM 변형, custom epilogue 같은 모든 modern 커널이 이 stack 위에서 짜인다.

강의의 인지적 frame

“CUTLASS 는 한 layer 만 써도 되고, 모든 layer 를 써도 된다”. 가장 흔한 entry — collective builder 한 줄로 GEMM 한 번. 더 깊이 들어가야 할 때는 layer 별로 customize. 강의의 backbone 메시지는 — “tensor core 는 복잡한데, 그 복잡함의 대부분은 여러 layer 의 helper 들로 가려져 있다”.

“tensor core 를 동작하게 만드는 건 hardware 가 아니라 layout 이다 — 같은 hardware 가 다른 layout 으로는 동작도 안 한다.”Vijay Thakkar (요약)

§ 02Tensor Core 의 진화· Volta → Hopper

4 세대 — 더 큰 tile, 더 적은 thread, 더 많은 dtype

FIG · Tensor Core 4 세대같은 idea, 다른 size

Volta · 2017

1st gen TC

m16n16k16 (FP16)

한 warp 가 4×4 sub-tiles 들을 들고 — sub-warp granularity. wmma:: API 등장.

Turing · 2018

2nd gen

+ INT8 · INT4

inference 용 저정밀. 같은 tile size 골격, 새 dtype.

Ampere · 2020

3rd gen

m16n8k16 (BF16/FP16)
+ TF32 · sparsity

warp-level mma.sync 의 표준. 4 warps cooperatively for SM-wide tile. async copy 도 등장.

Hopper · 2022

4th gen

WGMMA m64n256k16
+ FP8 (E4M3·E5M2)

warp-group MMA — 4 warp 이 한 묶음. shared memory 에서 직접 read. TMA 와 함께 async pipeline.

한 instruction 의 tile size 가 16×16×16 → 64×256×16 으로 256× 커졌다. 그만큼 instruction 한 번의 일이 커지고 — 동시에 instruction 이 asynchronous 화 되어, MMA 가 도는 동안 다음 tile load 가 동시에 진행됨.

strict 한 의미에서 — 한 instruction 이 끝내는 산술의 양이 256배 커졌다. instruction 자체가 길어진 게 아니라, hardware 가 그 한 명령에 더 많은 일을 한다. 같은 산술 throughput 을 더 적은 instruction 으로 — instruction issue overhead 가 비례적으로 줄어든다.

§ 03FP16 · BF16 · TF32 · FP8· 정밀도와 throughput

같은 hardware, 다른 dtype — throughput 이 정확히 비례한다

현대 Tensor Core 는 다양한 dtype 을 지원. 각 dtype 의 throughput 은 거의 정확히 bit width 에 반비례.

FP16 (half)

10-bit mantissa, 5-bit exponent. 학습/추론 모두. dynamic range 좁음 — overflow 위험. loss scaling 필요.

BF16

7-bit mantissa, 8-bit exponent (= FP32 와 같은 range). gradient 안전. 현대 학습의 표준. throughput 은 FP16 과 동일.

TF32

10-bit mantissa, 8-bit exponent. FP32 와 호환되지만 산술은 19-bit 정밀도. FP32 코드를 TF32 로 묵시 변환해주는 게 Ampere 의 표준 동작.

FP8 — E4M3

4-bit exponent, 3-bit mantissa. forward 에 자연. throughput FP16 의 2×.

FP8 — E5M2

5-bit exponent, 2-bit mantissa. backward gradient (큰 dynamic range) 에 자연.

INT8 / INT4

inference 용 quantize. 정확도 보전을 위해 calibration / QAT 가 필요한 경우 많음.

throughput 의 산수 (H100 SXM)

FP32 (cuda core) — 67 TFLOPS. FP16/BF16 (TC) — 989 TFLOPS. FP8 (TC) — 1979 TFLOPS. 거의 정확히 dtype bit width 에 반비례. “같은 시간에 2× 더 많은 산술 — 그 대가는 정밀도 손실”의 trade-off 가 명확.

§ 04mma.sync vs WMMA vs WGMMA· PTX instruction 의 layer

같은 hardware, 세 abstraction layer

L0 · PTXmma.sync (Ampere) / wgmma.mma_async (Hopper)warp 또는 warp-group 이 직접 호출하는 instructionSASS 가깝게

L1 · WMMAnvcuda::wmma:: C++ APIfragment 단위 — 사용자가 thread 별 데이터 위치를 추상화CUDA SDK

L2 · CuTe MMA atomSM80_16x8x16_F32F16F16F32_TN 같은 atomlayout 정보가 type 에 박혀 있음 — compile-timeCUTLASS 4.x

L3 · Collective MMAcutlass::gemm::collective::CollectiveMma4 warps cooperative — SM-wide tileCUTLASS

L4 · GEMM kernelcutlass::gemm::kernel::GemmUniversal전체 GEMM kernel 로 wrapCUTLASS

L5 · Device adaptercutlass::gemm::device::Gemmhost 에서 호출하는 cuBLAS-like APICUTLASS

강의의 강한 메시지 — “각 layer 가 독립적으로 customize 가능”. 가장 흔한 entry 가 L5 (cuBLAS-like 한 줄). 새 dtype 이나 새 epilogue 가 필요하면 L3 만 새로 짜고 L4–L5 는 재사용. 진짜 새 hardware feature 면 L2 까지 내려간다.

왜 이런 layered 디자인인가

tensor core 의 layout 제약이 워낙 미묘해서 — “register 의 어느 lane 에 어느 element 가 있는지” 까지 정해져 있다. 이 layout 을 type system 안에 넣어 — compile time 에 잘못된 결합을 막는 게 CUTLASS/CuTe 의 핵심 디자인 결정.

§ 05tile shape · M N K· layout · swizzle

한 MMA instruction 의 “모양” — 그리고 그 모양의 register 분배

Ampere 의 표준 instruction mma.sync.aligned.m16n8k16.f32.f16.f16.f32. 의미 — 한 warp(32 thread) 가 협동해 16×16 의 A · 16×8 의 B → 16×8 의 C 를 곱한다. 그런데 “어느 thread 가 A 의 어느 element 를 들고 있는가” 가 정확히 정해져 있다.

FIG · m16n8k16 의 thread-layout (개념)32 thread 가 A·B·C 를 어떻게 나눠 듬

A 의 16×16 = 256 element 를 32 thread 가 나누면 thread 당 8 element. C accumulator 는 thread 당 4 element. 이 분배가 hardware 가 강제. 사용자는 — 데이터를 register 에 올릴 때 이 분배에 맞춰 올려야 함. 안 맞으면 __shfl 등으로 재배치 필요(비싸다).

그리고 swizzle — shared memory 의 layout 을 비틀어 같은 32-thread access 가 bank conflict 없이 도게 만드는 것. CUTLASS 의 layout type 이 이 swizzle 정보를 자동으로 담는다.

왜 이게 까다로운가

(1) shared memory 의 32 bank conflict 회피. (2) MMA 가 요구하는 thread-별 element 분배. (3) global memory coalesced load. 세 제약을 동시에 만족하는 layout 이 swizzle. 손으로 짜기 매우 어려운 자리 — CUTLASS 가 이걸 type 으로 박아서 자동화한다.

§ 06CUTLASS 의 layered API· collective · kernel · device

“가장 위 한 줄” 로도 동작하고, “가장 깊이” 도 들어갈 수 있는 stack

실제 사용자 코드 예시. 같은 GEMM 을 두 layer 에서.

Device API — cuBLAS-like 한 줄

using Gemm = cutlass::gemm::device::GemmUniversalAdapter<
  cutlass::gemm::kernel::DefaultGemmUniversal<
    half_t, RowMajor,    // A
    half_t, ColMajor,    // B
    float,  RowMajor,    // C
    float                // acc
  >::GemmKernel>;

Gemm gemm;
gemm({{M,N,K}, {dA,K}, {dB,K}, {dC,N}, {dC,N}, {alpha, beta}});

Collective Builder — 한 단계 내려가

using CollectiveOp = cutlass::gemm::collective::CollectiveBuilder<
  arch::Sm90, arch::OpClassTensorOp,
  half_t, LayoutA, 8,
  half_t, LayoutB, 8,
  float,
  Shape<_128, _256, _64>,    // CTA tile shape
  Shape<_2, _1, _1>,          // cluster
  collective::StageCountAuto,
  collective::KernelTmaWarpSpecializedPingpong
>::CollectiveOp;

강의의 작은 demo

강의에서 Vijay 가 보여준 단계적 변형 — 같은 collective 위에서 (a) default schedule, (b) persistent + pingpong schedule, (c) ping pong + 5 stages 로 옮겨가면서 throughput 이 80% → 92% → 96% 로 올라감. “코드의 큰 부분은 그대로, schedule 만 swap”. layered abstraction 의 직접적 증거.

“localize your changes to the smallest subset that is affected — 새 epilogue 가 필요하면 collective 만 새로 짜고 schedule/kernel 은 재사용.”Vijay Thakkar (요약)

§ 07CuTe — layout 의 algebra· tile-based programming

“tensor 의 layout 을 type 으로 표현해 compile-time 에 검증”

CUTLASS 4.x 의 핵심 새 abstraction. tensor 의 (shape, stride) 를 단순한 metadata 가 아니라 compile-time 에 알려진 type 으로 들고 다닌다. 그 위에 layout 의 algebra (composition, complement, divide, product) 가 정의돼 있다.

// CuTe 핵심 — tensor 의 layout
using Shape  = cute::Shape<_128, _256>;      // (M=128, N=256)
using Stride = cute::Stride<_256, _1>;       // row-major
using Layout = cute::Layout<Shape, Stride>;

cute::Tensor t = cute::make_tensor(ptr, Layout{});
// t 는 type 안에 layout 을 다 알고 있음

// tile 분해 — type-level 연산
auto tCgC = local_tile(t, Shape<_64, _64>{}, make_coord(_, _));
// tCgC 는 (64×64) tile 들의 tensor of tensors

layout = (shape, stride)

tensor 의 spatial 표현. shape 가 size, stride 가 next-element offset. row-major / col-major / swizzle 등이 stride 패턴으로 표현됨.

composition

두 layout 을 합성. 한 tensor 를 다른 tensor 의 “관점” 에서 본다. matmul 의 A·B → C 의 index 변환을 합성으로 표현.

tiling

한 layout 을 더 작은 tile 의 layout 으로 분해. local_tile, logical_divide. CTA tile, warp tile, MMA atom 까지의 hierarchy.

copy atom

"이 layout 의 source 에서 저 layout 의 dest 로 데이터를 옮기는 한 instruction". cp.async, TMA 등.

왜 type-level 이어야 하나

layout 정보가 runtime 에만 있으면 — 한 step 마다 if 로 분기하느라 instruction 효율이 떨어진다. type 안에 박혀 있으면 — compile time 에 모든 분기가 inline, runtime 에는 곧장 mma instruction sequence 만 남는다. C++ template 메타프로그래밍이 hardware-tier 성능에 매핑되는 자리.

§ 08Hopper 의 새 기능들· TMA · WGMMA · async

“데이터 이동도 instruction 한 번 — MMA 도 한 번 — 둘이 동시에 도는”

TMA (Tensor Memory Accelerator)

global → shared memory 의 “bulk copy” 를 한 instruction 으로. tensor 의 shape, stride, dtype 정보를 descriptor 에 넣고 hardware DMA 가 실행. ALU 가 풀려난다.

WGMMA (Warp-Group MMA)

4 warp = 1 warp-group 이 한 묶음으로 한 큰 MMA 를 함. m64n256k16 같은 큰 tile. shared memory 에서 직접 read — register 로 미리 load 안 해도 됨.

async pipeline

MMA 가 도는 동안 다음 tile 을 TMA 로 load. 두 instruction 이 동시에 진행. cp.async.bulk + wgmma.commit_group + wgmma.wait_group 의 패턴.

distributed shared memory

cluster (인접 SM 들의 묶음) 의 SM 간 shared memory 가 서로 access 가능. cross-SM tile 통신.

FP8 dtype

E4M3, E5M2 두 변종. throughput FP16 의 2×. dynamic range 가 좁아 per-tensor scaling 필요.

warp specialization

한 block 안 일부 warp 는 producer (TMA load), 일부 warp 는 consumer (WGMMA). pipeline parallelism.

CUTLASS 의 새 schedule 들

Hopper 위의 KernelTmaWarpSpecialized, KernelTmaWarpSpecializedPingpong, KernelTmaWarpSpecializedCooperative 등 — 같은 collective 위에서 “producer 와 consumer 의 분배” 가 다른 schedule. 각 schedule 별로 강한 영역이 다름. autotune 의 새로운 차원.

§ 09HBM 대역폭과의 관계· peak FLOPs vs feed

“TC 가 peak 으로 돌게 하려면 HBM 이 그만큼 빨라야 한다”

arithmetic intensity 의 회계. H100 의 peak — FP16 989 TFLOPS, HBM 3 TB/s. 비율 ≈ 329 FLOP/byte 가 필요하다 (peak utilization 을 위해). 모든 GEMM 이 이 임계값을 넘는 건 아니다.

arithmetic intensity 의 산수

(M, N, K) GEMM — 산술 ≈ 2·M·N·K, byte ≈ 2·(M·K + K·N + M·N) (FP16). intensity = 산술/byte = M·N·K / (M·K + K·N + M·N). 이 값이 크려면 — M, N, K 모두 충분히 커야 함. K 가 작은 GEMV 는 거의 항상 memory-bound, decode 의 weight read 가 그 영역 (L022 § 02).

FIG · roofline — TC peak 도달 가능성arithmetic intensity 별

GEMV (decode) · M=1, N=4096, K=4096~1 FLOP/byte · memory-bound

small GEMM · 64·64·64~32 FLOP/byte

medium GEMM · 512·512·512~170 FLOP/byte

large GEMM · 4096·4096·4096~1365 FLOP/byte · TC peak

FlashAttention forward (long seq)~250 FLOP/byte

큰 GEMM 만이 TC peak 에 가까이 간다. medium 이하 size 는 memory 가 dominant — 그래서 attention/decode 같은 워크로드는 “HBM 이 bottleneck” 이 표준 분석.

“tensor core 의 peak 를 보는 건 — large GEMM 의 매우 좋은 case 에서. 실전 LLM 의 평균은 그 절반에서 한참 떨어진다.”Pradeep Ramani (요약)

§ 10기억할 메모와 코드 자료· key takeaways

4 세대 진화

Volta (m16n16k16) → Ampere (m16n8k16, async copy) → Hopper (WGMMA m64n256k16, TMA, FP8). 한 instruction 의 일이 256× 커짐.

layout 이 핵심

tensor core 는 thread 별 element 위치가 hardware 가 강제. swizzle, stride, fragment 의 정렬이 안 맞으면 결과가 틀리거나 매우 느려짐.

CUTLASS 의 6 layer

PTX → WMMA → CuTe atom → Collective → kernel → device. 각 layer 가 독립 customize. 한 줄로도 동작하고 깊이도 들어감.

CuTe = layout algebra

tensor 의 (shape, stride) 가 type 안에 박힘. composition, tiling, copy/MMA atom 의 algebra. compile-time inline.

dtype 의 throughput 산수

FP32 < FP16/BF16 < FP8 < INT4. 거의 정확히 bit width 에 반비례. H100 peak: FP16 989 TFLOPS, FP8 1979.

Hopper 의 async

TMA 로 데이터 이동, WGMMA 로 산술 — 두 instruction 이 동시에. warp specialization 으로 producer/consumer 분리.

arithmetic intensity 임계

H100 FP16 — 329 FLOP/byte 필요. 작은 GEMM/GEMV 는 거의 항상 memory-bound.

epilogue fusion

Collective 의 epilogue visitor 로 bias/activation/scale 을 register 단계에서 fuse. L018 § 07 의 LoRA 가 그 패턴.

YouTube youtube.com/watch?v=hQ9GPnV0-50

Slides Google Drive — Tensor Cores 강의 자료

Repos NVIDIA/cutlass · CUTLASS examples · CuTe headers

PTX docs PTX ISA — mma.sync / wgmma

손에 새기기 — 실습 시퀀스

WMMA hello — nvcuda::wmma API 로 16×16×16 GEMM 한 번. fragment load/store/mma 의 순서 손에 박기.
CUTLASS 한 줄 — cutlass::gemm::device::Gemm 으로 cuBLAS 와 throughput 비교. 큰 GEMM (4096^3) 에서 두 라이브러리가 거의 같은 peak 에 닿는지.
collective builder swap — 같은 GEMM 에 schedule 을 default → pingpong → cooperative 로 바꿔보고 시간 비교.
arithmetic intensity 측정 — 작은 GEMM / 큰 GEMM / GEMV 세 종 — intensity 계산 + 실제 tensor core utilization NCU 로 측정. roofline 그리기.
CuTe layout play — cute::print_layout 으로 row-major / col-major / swizzle 의 stride 직접 출력. local_tile 의 결과가 어떤 layout 을 가지는지.
FP8 GEMM — H100 환경이라면 cutlass::gemm::device::Gemm 의 FP8 변종을 띄워보고 FP16 대비 throughput 2× 확인.
epilogue fusion — bias + ReLU 가 fuse 된 GEMM 을 collective 의 EpilogueVisitor 로. 이 강의의 L018 와 직접 연결.
SASS 확인 — 작은 mma kernel 의 SASS dump (cuobjdump --dump-sass). 실제 명령이 HMMA / QGMMA 로 박혀 있는지.

§ 11다른 강의로 이어지는 길· connections

L018

Fusing Kernels

Kapil Sharma — § 06 의 epilogue fusion 의 LoRA 케이스. CUTLASS abstraction 의 직접 응용

L012

Flash Attention

— FlashAttention 이 CUTLASS/CuTe 위에서 짜여 있음. 이 강의의 사용 사례

L025

Composable Kernel (CK)

Haocong Wang — AMD 의 CUTLASS 대응. 같은 idea, 다른 hardware

L022

Speculative Decoding

— § 09 의 “batch 큰 영역에서 compute-bound” 의 그 compute 자체

L034

Low Bit Triton Kernels

— § 03 의 FP8/INT4 dtype 을 Triton 에서 어떻게 다루는지의 자리

L001

CUDA profiling

— § 09 의 roofline 분석 도구. NCU 의 “TC utilization” metric

L028

Liger Kernel

— LLM 학습 fused 커널이 CUTLASS / Triton 위에 어떻게 올라가는지

L029

Triton Internals

Kapil Sharma — Triton 이 자체적으로 어떻게 mma instruction 을 emit 하는지

§ 12열린 질문· open questions

WGMMA layout 의 정확한 thread 분배 — 64×256×16 의 4 warp 가 element 를 어떻게 나누는지의 정확한 표는 강의에서 자세히 안 보여줌. PTX docs 직접 확인 필요.
warp specialization schedule 의 trade-off — pingpong vs cooperative vs persistent 의 성능 차이가 어떤 GEMM shape 에서 어떻게 갈리는지. autotune 결과를 직접 봐야.
Blackwell 의 5세대 TC — 강의 시점 (2024) 이후의 generation. 새 instruction (wgmma3?) 과 dtype (FP4?) 은 후속 자료.
distributed shared memory 의 사용 사례 — Hopper cluster 의 DSMEM 이 실제 어떤 커널에서 큰 이득을 주는지의 구체 사례는 별도 추적.
CuTe 의 learning curve — type-level layout 이 처음에 매우 어렵다. 정식 학습 path (CUTLASS examples 의 어떤 순서) 는 강의에서 짧게만 언급.
FP8 의 calibration — dynamic range 가 좁아 per-tensor scale 이 필요. 학습 안정성을 위한 정식 recipe 는 별도 자료.

검증 메모

이 노트의 throughput 숫자(989 TFLOPS 등)는 NVIDIA 공식 spec sheet 의 H100 SXM 기준. 실제 측정값은 thermal/power/cooling 환경에 따라 더 낮을 수 있다.

← Lecture 022 Hacker's Guide to Speculative Decoding in vLLM Lecture 024 → Scan at the Speed of Light — Jake Hemstad & Georgii Evtushenko