CUDA 18VOL SERIES · T1 HW & ISA · A4 LANDSCAPE · 20p

GPU Architecture Quantitative

SM Internals · Memory Subsystem · Ampere(GA100) · Hopper(GH100) · Blackwell(GB100/202)

Volume 02/18

Tier T1 HW & ISA

선행 V01

용도 HW 정량 지도 (체화 트리거)

범례

핵심 용어 (노란 배경)

표 헤더 / 매우 중요

정의 · 공식 박스

예시 · 워크드 박스

빨강주의 · 실수하기 쉬움

★시험 · 실무 핵심

(!)니모닉 (첫글자 암기)

↗권 간 cross-reference

→인과 · 흐름

∵∴이유 · 결론

인쇄 A4 가로 / 여백 없음 / 배경 그래픽 포함 · Ctrl(⌘)+P

NVIDIA GA100 WP v1.0 · GH100 WP v1.04 · GB200 NVL72 WP · CUDA C PG 12.x App. H

1 SM = 4 processing partition Ampere/Hopper 공통 4 quad

구조 하나의 Streaming Multiprocessor (SM)는 4개의 processing partition(=processing block)으로 쪼개짐. 각 partition은 독립 warp scheduler·dispatch·register file 서브뱅크·execution unit 한 세트를 보유.

partition당 warp ≤ 16 (GA100: 64 warp/SM · 16×4)
partition 내 warp scheduler = 1 cycle · 1 warp 발행
partition 간 공유: L1/Smem, TC, instruction cache, constant cache

출처 NVIDIA GA100 WP v1.0 Fig.5 / GH100 WP v1.04 Fig.7

2 Partition 내부 unit

unit	역할	개수/partition
SP/FP32	FP32 FFMA	16 (A100) · 32 (H100)
INT32	정수 IMAD	16
FP64	FFMA.F64	8 (A100) · 16 (H100)
Tensor Core	MMA/WGMMA	1 (3rd/4th gen)
LSU	Load/Store	8
SFU	transcendental	4

출처 GA100 WP Fig.6 · GH100 WP §3.1

3 SM 블록도 ASCII

┌──────────────── SM ────────────────┐
│ L1 I-Cache · const cache (shared)  │
├────┬────┬────┬────┐                │
│ P0 │ P1 │ P2 │ P3 │   4 partitions │
│ ┌──┴──┐ ...                         │
│ │warp │ scheduler · dispatch        │
│ │sched│ 1 warp/cyc                  │
│ └──┬──┘                             │
│ ┌──┴──┐                             │
│ │ RF  │ 16K × 32b (partition)       │
│ └──┬──┘                             │
│ ┌──┴──┐                             │
│ │ OC  │ operand collector            │
│ └──┬──┘                             │
│ ┌──┴──────────────────┐              │
│ │FP32·INT·FP64·SFU·LSU│              │
│ │ Tensor Core (4th)   │              │
│ └──┬──────────────────┘              │
├────┴──────── L1/Smem 228 KB ───────┤
│ Tex unit · RT Core (GA10x/AD10x)   │
└────────────────────────────────────┘

4 Dual-issue 조건 partition 1 cyc 2 이슈

한 warp는 1 instruction/cyc 만 dispatch
partition 단위로 서로 다른 pipe에 나눠 이슈 가능 (FP32 + LSU 등)
compute capability 8.0+: dependency/throughput 조건 충족 시 inst pair

출처 CUDA C PG 12.6 App. H "Compute Capability 8.0/9.0"

5 Instruction / Constant Cache

cache	용량	scope
L0 I-cache	partition 전용	1 warp scheduler
L1 I-cache	~32 KB	SM 전체
constant $	8 KB / partition	`__constant__`

GA100/GH100 동일 계층. 수치는 CUDA C PG App.H 추정 상한.

6 RT Core 참고 (DC용 GA100/GH100에는 없음)

BVH traversal + triangle-ray intersect HW
Ada(RTX 40): 3rd-gen RT, opacity micromap
DC 카드(A100·H100)에는 없음 ↗ V02 §17

7 핵심 공식: dispatch 상한

peak_warp_inst/SM/cyc = partitions × 1
= 4 inst/cyc/SMpartition당 1 scheduler, 1 warp 발행

eligible warp 부족하면 실제 ≪ peak. ↗ V02 §11

1 Register File 용량 ★ 65536 × 4B

arch	reg/SM	byte/SM	max reg/thread
Ampere sm_80	65,536	256 KB	255
Hopper sm_90	65,536	256 KB	255
Blackwell sm_100	65,536	256 KB	255

출처 CUDA C PG 12.6 Table 21 "Technical Specifications per Compute Capability"

2 Partition 내 sub-bank

256 KB RF = 4 partition × 64 KB × (multiple bank)
register 분할: sub-bank에 RR(round-robin) 배치
동일 cycle 에 같은 bank 복수 read → bank conflict

HW가 bank를 노출하지 않으므로 직접 제어 불가. compiler ptxas가 allocation 담당.

3 Operand Collector 정의

OC Operand Collector는 issue된 instruction의 source operand를 여러 cycle에 걸쳐 RF에서 수집해 ALU port로 정렬 송출하는 staging buffer. bank conflict 를 파이프라인화해 숨김.

4 FFMA 3-operand 수집 흐름

cycle t:    FFMA r0, r1, r2, r3
             │
  ┌──────────┴──────────┐
  │   Operand Collector │
  │   slot: A B C accum │
  │   ↑ ↑ ↑             │
  └───┬──┬──┬──┬────────┘
      │  │  │  │
     RF RF RF RF  (bank select cycle-staggered)
      │  │  │  │
      └──┴──┴──┴─→ ALU FP32 FMA

개념도 · 출처: Greg Smith GTC 2010 "Fermi SM" · Volta 이후 구조 동일.

5 Register Spill 비용 ★

spill thread가 요구하는 live register > max 할당량 → compiler가 local memory(=thread-private global)로 내보냄. HW reg 이탈.

storage	latency	BW
register	~1 cyc	full
local (L1 hit)	~30 cyc	SM L1 BW
local (HBM)	~400 cyc	HBM BW

출처 CUDA C PG App.H "Local Memory" + 표준 micro-benchmark 권장값.

6 Occupancy 제약 공식

warp/SM ≤ floor(65536 / (reg/thread × 32))
warp/SM ≤ floor(smem_cap / smem/block) × block_warps
warp/SM ≤ max_warp (A100 64 · H100 64) 세 제약 중 min이 실제 occupancy

출처 CUDA Occupancy Calculator (deprecated) 동일 식 · PG Table 21.

7 Hopper `setmaxnreg` ★

warpgroup 내 warp별 register budget 동적 재분배
setmaxnreg.inc.sync.aligned 232: consumer warp이 큰 reg
setmaxnreg.dec.sync.aligned 40: producer warp은 최소

Warp Specialization GEMM: producer 40 reg, consumer 232 reg → SM당 동일 thread 수에서 accumulator tile ↑. ↗ V04 §10

8 실무 가이드

-maxrregcount or __launch_bounds__로 상한 지정
-Xptxas -v로 spill 확인 (stack frame bytes)
spill ≠ 항상 나쁨. occupancy 회복이 더 이익이면 ok

1 L1/Smem 통합 ★ unified SRAM

통합 Volta 이후 L1 data cache와 shared memory는 같은 SRAM 블록을 partition 수준에서 나눠 씀. carve-out은 per-kernel runtime 설정.

arch	L1+Smem 총량/SM	Smem 최대/block
Volta (GV100)	128 KB	96 KB
Ampere (GA100)	192 KB	164 KB
Ada (AD102)	128 KB	100 KB
Hopper (GH100)	256 KB	228 KB
Blackwell (GB100)	256 KB	228 KB

출처 GA100 WP Table 2 · GH100 WP §3.2 · GB100 WP v1.0 §3.1

2 Carve-out 옵션 (GA100)

0 / 8 / 16 / 32 / 64 / 100 / 132 / 164 KB smem
나머지가 L1 data cache
cudaFuncSetAttribute(..., cudaFuncAttributePreferredSharedMemoryCarveout, %)

출처 CUDA C PG 12.6 §3.2.5 / GA100 WP Table 3

3 Smem Bank 구조 ★

bank smem = 32 bank × 4 byte/cycle (8.0+도 동일). 32 lane이 서로 다른 bank에 access → 1 cycle. 같은 bank → 순차화.

32-bit mode: word = 4B, bank = (addr/4) mod 32
64-bit mode: double 용. cudaDeviceSetSharedMemConfig
broadcast: 동일 주소 read는 1 cycle에 모두 제공

stride-2 FP32 access → 2-way conflict. stride 33 padding으로 분산.

4 Hopper swizzle WGMMA 용

TMA load 시 128B / 64B / 32B swizzle mode 선택
WGMMA matrix descriptor에 swizzle 종류 포함
bank conflict-free access 를 HW 가 보증

상세 PTX는 ↗ V04 §4·§7

5 Distributed Shared Memory ★ DSM = cluster smem

DSM Hopper의 Thread Block Cluster 내 block 들이 서로의 smem 을 직접 읽고 쓸 수 있음. cluster 내 최대 16 block 의 smem 이 하나의 가상 주소공간을 형성.

cluster (1..16 CTA, default ≤8)
 ├── CTA0.smem 228 KB ─┐
 ├── CTA1.smem 228 KB ─┤  unified
 ├── CTA2.smem 228 KB ─┼── logical
 └── CTA3.smem 228 KB ─┘  space
addr = (rank, offset)

6 DSM access 경로

cluster.map_shared_rank (PTX) → remote ptr
SM-to-SM network (GPC 내부) 로 이동
read latency ≫ local smem, but ≪ L2

출처 GH100 WP §3.3 "Thread Block Cluster" · PTX ISA 8.x §9.7.12

7 Smem 사용 정석 3종

tile buffer (GEMM/conv) — V01 §7, V06
reduction scratchpad — V01 §11
producer/consumer ring — ↗ V04 §9

1 L2 용량 · 분할 ★

arch	L2 total	구조
Volta GV100	6 MB	unified
Ampere GA100	40 MB	2 partition × 20 MB
Ada AD102	96 MB	unified
Hopper GH100	50 MB	2 partition × 25 MB
Blackwell GB100 die	~60 MB / die	dual-die via NV-HBI

출처 GA100 WP Table 2 · GH100 WP §3.4 · GB200 WP §2.1 (die partitioned)

2 L2 partition 구조 ★ NUMA-like

partition Ampere/Hopper의 L2는 2개의 물리 partition 으로 나뉨. SM 에서 원격 partition 접근은 추가 latency + 감소 BW. NUMA-like.

GH100: partition A ↔ B link BW ≈ L2 BW 의 부분값
Blackwell: chiplet 2 die 간 NV-HBI 10 TB/s (공식)

출처 GH100 WP Fig.10 · GB200 WP "NV-HBI" 항목

3 L2 Persistence Window ★

기능 특정 global 영역을 L2 에 우선 상주 시키는 access policy window. 리커런트 access 많은 lookup 테이블/KV cache 에 유용.

// host
cudaStreamAttrValue attr = {};
attr.accessPolicyWindow.base_ptr = p;
attr.accessPolicyWindow.num_bytes = N;
attr.accessPolicyWindow.hitRatio = 1.0f;
attr.accessPolicyWindow.hitProp =
  cudaAccessPropertyPersisting;
attr.accessPolicyWindow.missProp =
  cudaAccessPropertyStreaming;
cudaStreamSetAttribute(s,
  cudaStreamAttributeAccessPolicyWindow,&attr);

출처 CUDA C PG 12.6 §5.2.6 "L2 Cache Set-Aside"

4 set-aside 카빙

arch	L2 set-aside 최대
GA100	30 MB (40 MB 중)
GH100	≤ 75%

cudaDeviceGetLimit(cudaLimitPersistingL2CacheSize)

5 Compression 무손실 CCS

GA100+: L2 내 compressible sector HW 압축
동일 byte 패턴·제로런 중심 → 효과는 데이터 의존
사용자 투명. PTX 에 별도 지시 없음

공식 numeric ratio 는 whitepaper 미공개. 데이터에 따라 0~수 %로만 명시.

출처 GA100 WP §2.4 "Compute Data Compression"

6 L2 위치가 중요한 이유

HBM_req_rate = (1 − hit_L2) × access_rate
sustained_BW ≈ min(L2_BW, HBM_BW / (1 − hit_L2)) hit_L2 ↑ → HBM 대역 여유

GEMM: reuse → L2 hit ≈ tile 크기 의존
attention KV cache: 크기가 L2 > 이면 miss

7 흔한 실수

persistence window 가 커널 종료 후에도 남는 것 잊음 → clear 필요
window 범위를 L2 용량보다 크게 설정 → hitRatio 낮춰도 효과 반감

1 HBM 세대 비교 ★ 2e→3→3e

spec	HBM2e	HBM3	HBM3e
stack 높이	8-hi	8/12-hi	8/12-hi
stack 용량	16 GB	16/24 GB	24/36 GB
ch/stack	8	16	16
pin/ch	16-bit	16-bit	16-bit
data rate	3.2 Gb/s	6.4 Gb/s	8~9.2 Gb/s
BW/stack	~410 GB/s	~820 GB/s	~1.2 TB/s

출처 JEDEC JESD238A (HBM3) · NVIDIA GH100/GB200 WP 공식 수치

2 GPU별 HBM 구성

GPU	stack 수	용량	peak BW
A100 SXM4 80GB	5(=6 중 1 disabled)	80 GB HBM2e	2.039 TB/s
H100 SXM5 80GB	5(=6 중 1)	80 GB HBM3	3.35 TB/s
H200 SXM	6	141 GB HBM3e	4.8 TB/s
B200 (GB100)	8 (2 die × 4)	192 GB HBM3e	8.0 TB/s

출처 NVIDIA A100 WP Table 2 · H100 WP Table 2 · H200 datasheet · B200 GTC'24 spec

3 Memory Controller 계층

SM → L2 slice → MC → HBM stack
                   ↑
              per-stack controller
                   ↑
              channel pseudo-channel
                   ↑
              bank group · bank · row

GH100: 12 × 512-bit HBM3 MC (총 6144-bit)
MC 단위로 access coalesce 및 row-open 관리

출처 GH100 WP §3.4 "Memory Subsystem"

4 Coalescing 조건 복습

조건 warp 내 32 thread 의 global load/store 주소가 128-byte segment 에 정렬·연속 → 1 transaction. 흩어지면 N transaction.

sector = 32 byte (L2 ↔ HBM 최소 단위)
128 B = 4 sector = warp FP32 load 크기

PG §5.3.2 "Device Memory Access"

5 ECC Overhead ★

ECC HBM2/3 는 in-band ECC 지원. 데이터 비트당 패리티 비트로 SECDED. A100/H100 는 자동 on.

항목	영향
용량	user-visible 크기 이미 ECC 감안 (명시된 80 GB)
BW	공식 peak 는 ECC on 기준
off 옵션	A100+ 는 `nvidia-smi -e 0` 지원 (DC only)

출처 NVIDIA HBM ECC 기술문서 · nvidia-smi man.

6 GPU Memory Partition =GPC · uGPU

GA100: 8 GPC × (2 TPC × SM + L2 slice)
GH100: 8 GPC (enabled 7 on H100 SXM5 : 132 SM)
MIG 는 GPC + L2 slice + HBM ch 묶음으로 자른다 ↗ V02 §18

7 PCIe / NVLink 외부 링크

link	A100	H100	B200
NVLink	600 GB/s (3rd)	900 GB/s (4th)	1800 GB/s (5th)
PCIe	Gen4 x16 64 GB/s	Gen5 x16 128 GB/s	Gen5 x16 128 GB/s

출처 A100/H100 WP · GB200 GTC'24 session

1 세대 개요

gen	arch	CC	대표 shape
1st	Volta GV100	7.0	m8n8k4 FP16→FP32
2nd	Turing TU102	7.5	+ INT8/INT4
3rd	Ampere GA100	8.0	m16n8k16 + TF32/BF16/FP64
4th	Hopper GH100	9.0	wgmma m64nNk16 + FP8
5th	Blackwell GB100	10.0	+ FP4/FP6 (micro-scale)

출처 NVIDIA 공식 세대 표기 (각 arch WP).

2 dtype 지원 매트릭스 ★★ 16→BF→TF→8→4

dtype	V100	T4	A100	H100	B200
FP64 TC	—	—	✓	✓	✓
TF32	—	—	✓	✓	✓
FP16	✓	✓	✓	✓	✓
BF16	—	—	✓	✓	✓
INT8	—	✓	✓	✓	✓
INT4	—	✓	✓	(dep)	—
FP8 E4M3/E5M2	—	—	—	✓	✓
FP6 E3M2/E2M3	—	—	—	—	✓
FP4 E2M1	—	—	—	—	✓

출처 GV100/GA100/GH100/GB200 WP + PTX ISA 8.x "mma" 허용 dtype 목록

3 Dense TC peak (SXM / W=TDP) 16·19·31

TF @ dtype	A100	H100	B200
FP64 TC	19.5	67	40
TF32	156	495	1100
FP16/BF16	312	989	2250
FP8	—	1979	4500
FP4	—	—	9000

단위 TF/s · dense (2:4 sparsity 아님) · 출처 A100/H100 WP Table 2 · B200 GTC'24 keynote 공식 슬라이드

4 2:4 Sparsity

2:4 4 원소 중 2 개가 0 인 structured sparsity. A100+ TC 가 HW 적으로 ×2 가속. compile-time layout 필요.

A100·H100·B200 지원
FP8/FP16 모두 적용 가능 (Hopper+)
실측 ×1.3~1.5 (×2 이론 상한)

출처 GA100 WP §2.3 "Sparsity"

5 세대별 의미론 변화

gen	변경점	소프트웨어 영향
1st(V100)	warp-level WMMA	C++ API 만
3rd(A100)	PTX `mma.sync`	thread fragment 직접
3rd+cp.async	A100 G→Smem 직결	double buffer 단순화
4th(H100)	`wgmma` async + smem operand	warp-group 프로그래밍
5th(B200)	micro-scale + per-block scale	dtype 선언에 scale block 포함

6 accumulator 규칙

FP16/BF16/FP8 → accum FP32
FP16 × FP16 → FP16 accum 도 있음 (정확도 ↓)
TF32 → FP32 accum (항상)
FP64 TC → FP64 accum

PTX ISA 8.x §9.7.13 "Matrix multiply-accumulate"

7 쉐이프·피연산자 배치

Ampere mma: A·B·C·D 모두 register fragment
Hopper wgmma: A=reg or smem, B 항상 smem, D=reg
자세한 layout ↗ V03 §7 · V04 §7

1 Die 구성 ★

spec	값
프로세스	TSMC N7 (7 nm)
die 크기	826 mm²
트랜지스터	54.2 B
GPC / TPC / SM (full)	8 / 64 / 128
A100 enabled SM	108 (SXM4)
FP32 core / SM	64
TC / SM	4 (3rd gen)

출처 NVIDIA GA100 WP v1.0 §2.1 / Table 2

2 총 연산 자원 (A100 SXM4)

자원	값
FP32 core	6,912
FP64 core	3,456
Tensor Core	432
base clock	1095 MHz
boost clock	1410 MHz

3 메모리 계층 요약

level	cap	BW/latency
Reg/SM	256 KB	~1 cyc
L1+Smem/SM	192 KB	~20 cyc
L2	40 MB	~200 cyc
HBM2e	40/80 GB	1.555 / 2.039 TB/s

출처 GA100 WP Table 2 · latency는 public micro-benchmark 권장값 ↗ V02 §14

4 Ampere 신기능 ★

cp.async — G→Smem 직결, reg 경유 제거
3rd-gen TC: TF32/BF16/FP64/INT8
2:4 Sparsity — ×2 가속
MIG — 최대 7 instance
Async copy barrier (mbarrier 초기 버전)
3rd-gen NVLink 600 GB/s

출처 GA100 WP §2

5 연산 peak 요약 (SXM4)

dtype	dense TF	2:4 TF
FP64	9.7	—
FP64 TC	19.5	—
FP32	19.5	—
TF32 TC	156	312
FP16/BF16 TC	312	624
INT8 TC	624 TOPS	1248 TOPS

출처 A100 WP Table 2 · boost clock 기준

6 A100 variant

model	mem	TDP	form
A100 SXM4 40	40 GB	400 W	SXM4
A100 SXM4 80	80 GB	400 W	SXM4
A100 PCIe 40	40 GB	250 W	PCIe Gen4
A100 PCIe 80	80 GB	300 W	PCIe Gen4

NVIDIA A100 Datasheet (2021)

7 GA100 vs GA10x

GA100: FP64 1:2 of FP32, Smem 164 KB, NVLink
GA102 (RTX 3090): FP64 1:64, 128 KB L1+Smem, RT Core
자세한 분화 ↗ V02 §17

1 Die 구성 ★

spec	값
프로세스	TSMC 4N
die 크기	814 mm²
트랜지스터	80 B
GPC / TPC / SM (full)	8 / 72 / 144
H100 SXM5 enabled SM	132
FP32 core / SM	128
TC / SM	4 (4th gen)

출처 NVIDIA GH100 WP v1.04 §2.1 / Table 2

2 총 연산 자원 (H100 SXM5)

자원	값
FP32 core	16,896
FP64 core	8,448
Tensor Core	528
base clock	1590 MHz
boost clock	1980 MHz

3 메모리 계층 요약

level	cap	BW/latency
Reg/SM	256 KB	~1 cyc
L1+Smem/SM	256 KB	~23 cyc
L2	50 MB (2×25)	~260 cyc
HBM3	80 GB	3.35 TB/s
NVLink 4.0	18 link	900 GB/s

출처 GH100 WP Table 2 · latency는 NVIDIA H100 micro-benchmark (nvidia GTC'22)

4 Hopper 신기능 5 ★★ TWCDM

TMA: Tensor Memory Accelerator
WGMMA: warp-group async MMA
Cluster: Thread Block Cluster
DSM: Distributed Shared Memory
TE: Transformer Engine (FP8)

상세 ↗ V02 §9 · V04 §1–§7

5 연산 peak (H100 SXM5)

dtype	dense TF	2:4 TF
FP64	34	—
FP64 TC	67	—
FP32	67	—
TF32 TC	495	989
FP16/BF16 TC	989	1979
FP8 TC	1979	3958
INT8 TC	1979 TOPS	3958 TOPS

출처 H100 WP Table 2 · SXM5 / boost clock

6 H100 variant

model	mem	TDP
H100 SXM5	80 GB HBM3	700 W
H100 PCIe	80 GB HBM2e	350 W
H100 NVL	94 GB HBM3	400 W
H200 SXM	141 GB HBM3e	700 W

H100 datasheet / H200 datasheet (2024)

7 DPX instruction Dynamic Programming eXt

vimax3·vimin3: 3-input SIMD max/min + relu
Smith-Waterman/Needleman-Wunsch 가속 (유전체)
compare-and-add 결합 감축 1 cycle

출처 GH100 WP §2.6 "DPX"

1 TMA — Tensor Memory Accelerator ★

정의 전용 HW copy engine. global↔smem 간 다차원 tile 을 1 thread 발행으로 이동. 좌표·stride·OOB 처리 자동.

descriptor: cuTensorMapEncodeTiled host 생성
PTX: cp.async.bulk.tensor
mode: tile / im2col
OOB 정책: zero / clamp / NaN

출처 GH100 WP §2.4 · PTX 8.x §9.7.8 · 상세 ↗ V04 §4·§5

2 왜 TMA 가 필요한가

issue	sm_80 cp.async	sm_90 TMA
발행	32 thread 협력	1 thread
주소 계산	SW	HW (descriptor)
OOB	branch	HW fill
multi-D	manual	native
multicast	—	cluster-wide

3 WGMMA — Warpgroup MMA ★

정의 warpgroup = 4 warp = 128 thread 단위의 async Tensor Core 명령. 발행 즉시 다음 instruction, 완료는 fence/wait.

shape: m64nNk16 (FP16/BF16), m64nNk32 (FP8), N ∈ {8,16,…,256}
A: register or smem / B: smem 필수 / D: register
operand: 64-bit matrix descriptor (smem base+stride+swizzle)
throughput: H100 1 WGMMA/cyc/SM (FP16 m64n256k16)

출처 PTX ISA 8.x §9.7.13.2 "wgmma" · CUTLASS 3.x Hopper kernel

4 Cluster — Thread Block Cluster ★

정의 새 스코프: grid > cluster > block > warp > thread. 같은 GPC 내 SM 들의 block 이 하나의 unit 으로 동기화·통신.

portable 최대 8 block/cluster (opt-in 16)
__cluster_dims__(x,y,z) 또는 launch attribute
cluster.sync / cluster.arrive

출처 GH100 WP §3.3 · CUDA C PG §7.27

5 DSM — Distributed Shared Memory ★

정의 cluster 내 block 간 remote smem access. cluster 전체 smem 이 단일 가상 주소공간. SM-to-SM network 경유.

PTX: st.shared::cluster, ld.shared::cluster
atomic: atom.shared::cluster
8 block cluster: 228×8 = ~1.78 MB 공유 가능

출처 PTX ISA 8.x §9.7.12 · GH100 WP §3.3

6 TE — Transformer Engine ★

정의 FP8 동적 scaling SW+HW stack. layer별 scale history 를 유지하고 E4M3/E5M2 중 적절 dtype 을 자동 선택.

E4M3: weight/activation (dynamic range ↓, 정밀 ↑)
E5M2: gradient (range ↑, 정밀 ↓)
amax history 로 scale 업데이트
PyTorch/JAX 통합 라이브러리 제공

출처 GH100 WP §2.3 "Transformer Engine" · NVIDIA TE github docs

7 5 기능 상호관계

TMA  ─────┐         ┌─── smem ─── WGMMA
          ├─(mbar)──┤
producer  │         │   consumer
warps     ↓         ↓   (warp-specialized)
Cluster + DSM: 여러 CTA 간 smem 공유
TE: FP8 layer 주변 wrapper

1 Die 구성 ★

spec	B200 (GB100 dual-die)
프로세스	TSMC 4NP
die 크기	2 × 800 mm² (reticle limit)
트랜지스터	208 B (total)
die 간 link	NV-HBI 10 TB/s
SM (enabled)	2 × 80 = 160
FP32 core / SM	128
TC / SM	4 (5th gen)

출처 NVIDIA GB200 NVL72 WP · GTC'24 Keynote spec slide

2 메모리

level	B200
Reg/SM	256 KB
L1+Smem/SM	256 KB
L2/die	~60 MB
HBM3e	192 GB (8 stack)
HBM BW	8.0 TB/s
NVLink 5.0	1800 GB/s

출처 GB200 WP §2 · B200 공식 데이터시트 (2024)

3 신기능 ★ FP4·NV-HBI·TE2

5th-gen TC: FP4 E2M1 · FP6 E3M2/E2M3
Micro-scale block FP: per-block scale factor HW 지원
2nd-gen Transformer Engine: FP4/FP6 자동 선택
NV-HBI 10 TB/s dual-die fusion
NVLink 5 1800 GB/s / GPU
5th-gen RAS: Dedicated RAS engine

출처 GB200 NVL72 WP §2.2 · GTC'24

4 연산 peak (B200 공식)

dtype	dense PF	2:4 PF
FP64 TC	0.04	—
TF32	1.1	2.2
FP16/BF16	2.25	4.5
FP8	4.5	9
FP6	4.5	9
FP4	9	18

단위 PF/s (PetaFLOP/s) · 출처 NVIDIA B200 공식 spec sheet · dense = non-sparsity

5 3세대 비교 종합표 ★★ SM·HBM·NVL·TC

metric	A100	H100	B200
process	N7	4N	4NP
transistor	54 B	80 B	208 B
SM	108	132	160
FP32 core	6912	16896	20480
TC gen	3rd	4th	5th
Reg/SM	256K	256K	256K
Smem/SM	164K	228K	228K
L2	40 MB	50 MB	~120 MB
HBM	80 HBM2e	80 HBM3	192 HBM3e
BW	2.04 TB/s	3.35 TB/s	8.0 TB/s
NVLink	600 GB/s	900 GB/s	1800 GB/s
TDP (SXM)	400 W	700 W	1000 W
FP16 TC	312 TF	989 TF	2250 TF
FP8 TC	—	1979 TF	4500 TF

출처 A100/H100/GB200 공식 WP · SXM form · boost clock · dense

1 스케줄러 기본 동작 select → issue

역할 partition 내 resident warp 중 eligible 하나를 골라 1 cycle 에 1 instruction dispatch. eligible = dependency 해결 + function unit free + resource OK.

GV100 이후 warp pool: partition 당 최대 16
선택 정책: NVIDIA 비공개 (근사 GTO / LRR heuristic)
queue empty ≠ 항상 bad: bubble 이 다른 warp 으로 메꿔짐

2 eligible 조건

PC 의 instruction fetched
source operand 준비 (scoreboard clear)
destination reg slot free
functional unit 사용 가능
barrier/mbar 없음

하나라도 실패 → warp stall, 다른 warp 이 이슈

3 Stall Reason 분류 ★ L·S·W·D·MIO·T·B

reason	원인
Long Scoreboard	global / local / surface load 대기
Short Scoreboard	smem / TC pipeline 대기
Wait	fixed-latency inst 완료 대기
Drain	exit 경로 동기화
MIO Throttle	LSU/atomic/mem queue 포화
Tex Throttle	Tex unit 대기
Barrier	`__syncthreads`·mbarrier 대기
Dispatch Stall	issue slot 경쟁
No Instruction	I-cache miss
Not Selected	eligible 이나 선택 안 됨

출처 Nsight Compute "Warp State Statistics" 섹션 · 상세 실측 해석 ↗ V18 §6

4 대표 원인 → 대처 매핑

Long Scoreboard ↑ → coalesce · tiling · cp.async/TMA
Short Scoreboard ↑ → smem bank · TC fragment 재사용
Barrier ↑ → block size ↓ / pipeline 깊게
Not Selected ↑ → 이미 occupancy 充分 (딴 bottleneck 탐색)

5 Latency Hiding 공식

warps_needed = ⌈L × R / N⌉
L : latency (cyc), R : instruction throughput/SM/cyc, N : instruction 당 warp 수 (=1 for standard)

HBM load L=400 cyc, SM LSU issue R=4/cyc → 필요 warp ≈ 1600 / 32 (thread) 단위? 근사: 12 warp 만으로도 충분 (32 B/load, ILP 2).

출처 Little's law 적용. NVIDIA GTC'16 "GPU Performance Analysis" 참고.

6 Issue Slot 수식

peak_IPC/SM = partitions × 1 = 4
utilization = issued / (4 × active_cycle)

Nsight Compute: sm__inst_issued section
FMA FP32 peak: partitions × (FP32_core / 16 lane / cyc) = 동일 식

7 Hopper 개선

async instruction (WGMMA/TMA): issue 후 바로 다음
scoreboard 수 증가로 ILP 확장
warp specialization 으로 stall domain 분리

1 5 단계 abstract pipeline F·D·OC·EX·WB

[F]  fetch         I-cache → IB
[D]  decode        opcode + operand spec
[OC] operand coll. RF bank read staging
[EX] execute       ALU / LSU / TC / SFU
[WB] write-back    RF write

각 단계는 여러 cycle 일 수 있음 (HW 세부 비공개)
pipeline 깊이는 public parameter 아님. 여기는 논리 모델.

출처 Greg Smith "Fermi SM" GTC'10 · Volta/Turing 백서 종합.

2 실행 파이프 분류

pipe	예 instruction	latency (cyc, A100)
FP32 FMA	FFMA	4
INT IMAD	IMAD	4
SFU	RCP·RSQ·EX2	16~32
TC MMA	HMMA	16~32
LSU smem	LDS	~20
LSU L2	LDG.L2	~200
LSU HBM	LDG	~400

latency 는 NVIDIA micro-benchmark (Jia et al. 2019 / 2021) 통계 권장값 · 세대별 변동 ↗ V02 §14

3 Hazard 종류

RAW (read-after-write): 이전 write 결과를 read. scoreboard 로 stall.
WAW: same reg 에 연속 write. HW in-order 로 해소.
WAR: write 이전에 old read. in-order 로 해소.
Structural: 같은 unit 동시 요청. dispatch 막힘.

4 Scoreboard HW 의존성 추적

scoreboard warp 당 pending instruction 의 read/write reg 를 bit mask 로 관리. dependency 해결 시 eligible bit set.

short (fixed-latency) vs long (variable-latency) 스코어보드 분리
SASS control code 에 scoreboard alloc/wait 인코딩

SASS 관련 ↗ V04 §14·§15

5 Bypass / Forwarding

EX → EX direct forwarding: register write 전에 다음 instruction 에 값 전달
FMA chain: back-to-back FFMA 사이 bypass 로 dependent-step latency ↓
HW 구현 세부는 비공개. 공식 모델은 4-cycle FMA 을 기본 가정.

6 In-order Issue · Dual Issue

in-order 같은 warp 내 instruction 은 program order 로 이슈. but 다른 warp 과 interleave 는 가능.

dual-issue: partition 내 다른 pipe 에 동시 dispatch 조건 만족 시
compiler (ptxas) 가 pair 가 될 수 있게 스케줄

7 Async pipeline (Hopper)

WGMMA / TMA / cp.async 는 issue 후 즉시 retire queue 로
완료는 mbarrier 또는 wgmma.wait_group
pipeline 상에서 별도 async proxy 경유 → generic proxy 와 fence 필요

Async proxy 상세 ↗ V04 §10

1 기본 모델 relaxed + scope

모델 CUDA 는 weakly-ordered (relaxed). 하드웨어는 program order 재배열 허용. ordering 은 명시적 fence/acquire·release 로 제어.

sm_70+ 부터 scoped atomic / fence 명시
CUDA Memory Model은 PTX ISA 에 공식화됨 (§8)

출처 PTX ISA 8.x §8 "Memory Consistency Model" (2021~)

2 Scope 계층 ★ cta·cluster·gpu·sys

scope	범위
`.cta`	block 내 thread
`.cluster`	Hopper cluster 내 block
`.gpu`	전체 device
`.sys`	host + 여러 device (UM)

scope 은 작은 → 큰. .sys 는 가장 비싸며 UM/peer 동기화에서 필요.

3 Ordering qualifier

qualifier	의미
`.relaxed`	atomicity 만, ordering 없음
`.acquire`	후속 memory op 재배열 금지 (read 쪽)
`.release`	선행 memory op 재배열 금지 (write 쪽)
`.acq_rel`	둘 다
`.sc`	sequential consistency (acq_rel + total order)

PTX ISA 8.x §8.4 "Memory Operation Ordering"

4 Acquire·Release 패턴

// producer
st.global         data, %v;
st.release.gpu    flag, 1;

// consumer
ld.acquire.gpu    %f, flag;
@p bra spin_if_zero;
ld.global         %r, data; // see v

scope 안 맞추면 다른 SM 의 consumer 가 stale data 보게 될 수 있음. 같은 GPU 면 .gpu, peer/host 통신이면 .sys.

5 Fence 명령

명령	의미
`fence.acq_rel.cta`	block 내 ordering
`fence.acq_rel.gpu`	device 전체
`fence.sc.gpu`	SC + device
`fence.proxy.async`	generic ↔ async proxy
`membar.gl`	legacy global fence
`__threadfence_block`/`system`	C runtime wrapper

6 Proxy 구분 (Hopper)

proxy generic proxy = 일반 load/store. async proxy = TMA/WGMMA 등 HW 비동기 unit. 같은 주소를 두 proxy 로 건드리면 fence.proxy.async 로 순서 확보 필요.

TMA load 후 일반 ld 로 읽기 전에 fence
WGMMA 결과를 store 하기 전에 fence

PTX ISA 8.x §9.7.13.3 "async proxy fence"

7 Atomic 종류

atom.global.add.s32 · ·.min·.max·.and·.or·.xor·.exch·.cas
scope: .cta/.gpu/.sys
float atomic: atom.add.f32 · sm_60+ f64

상세 PTX atomic ↗ V03 §11

1 연산 latency (cyc) 4·20·200·400

op	V100	A100	H100
FFMA (FP32)	4	4	4
IMAD	4	4	4
FP64 FMA	8	8	8
SFU RCP	~18	~16	~16
HMMA (FP16)	~22	~16	~16

출처 Jia et al. "Dissecting GPU Architectures" (2019, 2021) · 권장값. HW 세부는 NVIDIA 미공개.

2 메모리 access latency (cyc)

access	V100	A100	H100
LDS (smem)	~23	~22	~23
LDG L1 hit	~28	~30	~33
LDG L2 hit	~193	~200	~260
LDG L2 miss (HBM)	~400	~450	~490
Atomic L2	~270	~260	~300

출처 Jia 2019/2021 · NVIDIA GTC'22 micro-benchmark 슬라이드. clock cycle 단위.

3 HBM latency 의 의미

time_HBM = cyc_HBM / f_clock
A100: 450 / 1.41 GHz ≈ 320 ns
H100: 490 / 1.98 GHz ≈ 247 ns clock 이 빨라져도 절대 시간은 유사

HBM 물리 access ~100 ns + NoC + L2 + MC overhead
cycle 수 ≠ 절대 시간 (clock 세대마다 다름)

4 Warp 단위 hiding 예

A100 HBM ~450 cyc, LSU issue 1/cyc/partition → 필요 warp ≈ 450 / 4 partitions × 1 = ~112 warp. ILP 2·벡터 load → ~12 warp 로 축소.

Little's law · 공식 논문 Volkov 2010

5 TC async 명령 (Hopper)

op	H100 cyc
`wgmma.mma_async m64n256k16`	~32 (issue+retire)
`cp.async.bulk.tensor` 128B	~100+
`mbarrier.try_wait`	~1 (polling)

출처 CUTLASS 3.x Hopper kernel timing · NVIDIA GTC'22

6 Rule of Thumb

smem ≈ 20 cyc
L2 ≈ 200~260 cyc
HBM ≈ 400~500 cyc
FFMA ≈ 4 cyc
HMMA ≈ 16 cyc

7 주의

위 모든 수치는 공표된 HW 스펙이 아닌 micro-benchmark 추정치. 실측 해석은 ↗ V18 §6. 커널 튜닝에서 cycle 단위 1:1 결정은 금물.

1 계층별 이론 BW ★

level	A100	H100	B200
Reg/SM	~14 TB/s	~31 TB/s	~36 TB/s
Smem/SM	~19 TB/s	~33 TB/s	~33 TB/s
L2 total	~4 TB/s	~5.5 TB/s	~13 TB/s
HBM	2.04 TB/s	3.35 TB/s	8.0 TB/s

출처 A100/H100/GB200 WP + GTC'22, GTC'24 공식 슬라이드. SM 내부 BW 는 per-SM × SM 수 합산.

2 Peak vs Sustained

sustained 실효 BW = 연속 커널로 측정한 평균. peak 대비 70~90% 수준이 일반적. ECC·row-open·coalescing·L2 hit 에 영향.

완전 coalesced, 대용량 stream 에서 HBM peak 의 80~90% 달성
random access, stride ≠ 1 면 절반 이하로 추락

3 Intra-SM 대역 계산

Smem_BW/SM = 32 bank × 4 B × clock
A100: 32·4·1.41e9 ≈ 180 GB/s/SM
total: 180·108 ≈ 19 TB/s partition별 32-lane shmem access 가정

4 Roofline 입력 FLOP/Byte

GPU	peak FP16 TC	HBM	ridge (FLOP/B)
A100	312 TF	2.04 TB/s	153
H100	989 TF	3.35 TB/s	295
H100 FP8	1979 TF	3.35 TB/s	591
B200 FP8	4500 TF	8.0 TB/s	562

ridge = 모든 workload 가 compute-bound 이기 위한 최소 AI · ↗ V18 §2

5 PCIe Gen4 vs Gen5

gen	x16 per-dir	GPU
Gen4	32 GB/s	A100 PCIe
Gen5	64 GB/s	H100/B200 PCIe

64 GB/s per direction = 128 GB/s bidirectional aggregate (표기 관행 차이)

6 NVLink 세대 요약

gen	link 수	총 BW
NVLink 3 (A100)	12	600 GB/s
NVLink 4 (H100)	18	900 GB/s
NVLink 5 (B200)	18	1800 GB/s

NVSwitch 와 NVL 위상 ↗ V15 §3

7 실효 ↓ 주 원인

stride access → sector 낭비
partial warp load → merge 없음
row thrashing → HBM row-open/close
ECC off 가 가능하더라도 권장 X

1 TDP 세대

GPU	form	TDP
V100 SXM2	SXM2	300 W
A100 SXM4	SXM4	400 W
A100 PCIe	PCIe	250~300 W
H100 SXM5	SXM5	700 W
H100 PCIe	PCIe	350 W
H200 SXM	SXM5	700 W
B200 SXM	SXM6	1000 W
GB200 Superchip	liquid	2700 W (GPU+CPU)

출처 각 세대 datasheet · GB200 NVL72 WP §3.1

2 DVFS 동적 주파수·전압

DVFS power 상한·온도·load 에 따라 clock/voltage 자동 조절. Nsight 에 SM clock 지표로 노출.

boost clock 은 ideal case 상한
지속 100% load 에서 base clock 근처로 수렴
nvidia-smi -lgc 로 clock 고정 가능 (DC)

3 Thermal Throttling

상태	trigger	반응
T<T_slowdown	—	boost 유지
T≥T_slowdown	HW sensor	clock 단계적 ↓
T≥T_shutdown	critical	drop 또는 halt

T_slowdown (A100/H100 ≈ 88~90°C)
liquid-cooled GB200 은 50~55°C 수준 유지 목표

출처 NVIDIA DCGM user guide · datasheet thermal spec

4 Power Instrument

nvidia-smi --query-gpu=power.draw,power.limit
NVML API: nvmlDeviceGetPowerUsage
sampling rate 수 Hz (ms 단위 이벤트 불가)

5 Power efficiency 지표

GFLOP/W = FLOPs_produced / Joule
H100 FP16 peak: 989 TF / 700 W ≈ 1.41 TFLOP/W
B200 FP4 peak: 9000 TF / 1000 W ≈ 9 TFLOP/W dense · vendor 공시

6 Clock Gating

idle pipe 의 clock 자동 중단 → 누수전력 ↓
특정 warp 가 SFU 만 쓰면 FP64 pipe 비활성
SW 수준에서 조작 불가, HW 자동

7 프로파일링 상 의미

power 포화 = DVFS throttling 가능성 → 커널별 clock 차이 발생
짧은 burst kernel 은 boost 유지, long-running 은 base 근처
커널 단독 profile 과 실제 cluster run 에서 clock 이 다를 수 있음

실측 해석/Nsight ↗ V18 §7

1 GA100 vs GA102 ★

항목	GA100 (A100)	GA102 (RTX 3090)
대상	DC / HPC / AI	Gaming / Workstation
프로세스	N7	Samsung 8N
SM	108	82
FP32/SM	64	128
FP64 ratio	1:2	1:64
Smem/SM max	164 KB	100 KB
L2	40 MB	6 MB
NVLink	600 GB/s	없음 (3090 개인은 제한)
RT Core	—	2nd gen
ECC HBM	✓	—
MIG	✓	—

출처 GA100 WP · GA102 WP (2020)

2 H100 vs L40 (Ada DC)

항목	H100 SXM	L40S
arch	Hopper GH100	Ada AD102
SM	132	142
FP64 TC	67 TF	1.4 TF
FP16 TC	989 TF	362 TF
FP8 TC	1979 TF	733 TF
Memory	80 GB HBM3	48 GB GDDR6
BW	3.35 TB/s	864 GB/s
NVLink	900 GB/s	—
RT Core	—	3rd
TDP	700 W	350 W

출처 H100/L40S 공식 datasheet

3 왜 DC 는 FP64·NVLink 특화 ★

HPC (CFD/기후/원자) 는 FP64 요구
대형 모델/분산 학습 은 NVLink·NVSwitch BW 가 병목
RTX 카드에서 FP64 는 시장 구분선

4 주요 lineage

arch	DC chip	RTX chip
Pascal	GP100	GP102
Volta	GV100	— (Turing 분리)
Turing	T4 (TU104)	TU102
Ampere	GA100	GA102
Ada	AD102 (L40)	AD102 (RTX 40)
Hopper	GH100	— (DC only)
Blackwell	GB100/GB200	GB202 (RTX 50)

arch 명은 same·chip(die)만 다름. PMPP 4e App.

5 DC / RTX 공통

PTX ISA / CUDA API 거의 동일
compute capability 로 기능 분기
RTX 도 TC 탑재 (같은 세대)

6 실무 주의

EULA 경고: GeForce 드라이버는 DC 환경에서 라이선스 제한. 실제 배포 시 Tesla/H100 등 DC 카드 필수.

개발 PC 에서 RTX 로 prototype → DC 로 배포
sm_86 (Ampere RTX) vs sm_80 (A100) PTX 호환되나 최적 수치 다름

1 MIG 정의 slice = GPC + L2 + HBM

MIG 하나의 물리 GPU 를 최대 7 개의 독립 인스턴스로 HW 분할. 각 인스턴스는 자체 SM/L2 slice/HBM 영역/엔진을 가지며 완전히 격리된다.

A100: 최대 7 slice · H100: 최대 7 slice
H100 PCIe / H200 / A100 에서 지원. RTX 불가.

출처 NVIDIA MIG User Guide (2023) · A100/H100 WP "MIG" 섹션

2 GPU Instance (GI) 프로파일 (A100 80GB)

profile	SM	mem	slot
MIG 1g.10gb	14	10 GB	1
MIG 2g.20gb	28	20 GB	2
MIG 3g.40gb	42	40 GB	3
MIG 4g.40gb	56	40 GB	4
MIG 7g.80gb	98	80 GB	7

출처 NVIDIA A100 MIG User Guide Table 2

3 H100 MIG 프로파일

profile	SM	mem (80GB)
1g.10gb	16	10 GB
1g.20gb	16	20 GB
2g.20gb	32	20 GB
3g.40gb	60	40 GB
4g.40gb	60	40 GB
7g.80gb	132	80 GB

출처 NVIDIA H100 MIG User Guide · 2024

4 Compute Instance (CI)

CI GI 안에서 compute 단위를 더 나눌 수 있는 sub-partition. GI 7g 안에 CI 1c+1c+… 분할.

GI = memory 격리 · CI = SM 그룹 분할
같은 GI 내 CI 는 L2/HBM 공유

5 격리 수준 ★

SM 자체 분리: 다른 MIG 가 stall 시켜도 영향 X
L2 slice 분리: cache pollution 차단
HBM 주소 분리: memory BW 분할
QoS: bandwidth 보장
단, NVLink/NVDEC 등 일부 자원은 공유 가능

출처 A100 MIG User Guide §2.3

6 사용 시나리오

시나리오	이점
serving multi-tenant	QoS · 격리
small batch infer	GPU 활용률 ↑
Dev + Prod 분리	안전
CI/CD 테스트	병렬 queue

7 한계

MIG 에서는 cluster/DSM 사용 제약 — cluster 는 같은 GPC 에 제한되므로 MIG slice 를 가로지르는 cluster 는 불가.

재구성은 kernel 실행 중 불가 (nvidia-smi MIG idle 상태에서만)
slice 내 peak FLOPS = 전체 × (SM fraction) 근사

1 GPU arch 계보 ★ K·M·P·V·T·A·Ada·H·B

arch	연도	CC	대표칩
Kepler	2012	3.x	K20·K80
Maxwell	2014	5.x	M40
Pascal	2016	6.x	P100 (GP100)
Volta	2017	7.0	V100 (GV100)
Turing	2018	7.5	T4 · RTX 20
Ampere	2020	8.0/8.6	A100 · RTX 30
Ada	2022	8.9	L40 · RTX 40
Hopper	2022	9.0	H100 · H200
Blackwell	2024	10.0	B200 · GB200 · RTX 50

출처 NVIDIA CUDA C PG App. H 세대 표 · 각 WP 출시 연도

2 세대별 "한 줄" 특징

Volta: 1st Tensor Core, NVLink 2
Turing: INT TC + RT Core
Ampere: 3rd TC, TF32, MIG, cp.async
Ada: RTX 전용 (DC=L40), shader execution reorder
Hopper: TMA·WGMMA·Cluster·FP8·TE
Blackwell: FP4/FP6·dual-die·NVLink 5

3 Memory 한눈에

level	cap	lat	BW
Reg	256 KB/SM	~1	full
Smem	164/228 KB	~20	~19/33 TB/s
L1	shared w/ smem	~30	—
L2	40/50 MB	~200	~4/5.5 TB/s
HBM	80/192 GB	~400	2/3.35/8 TB/s

A100 / H100 / B200 순 · 단위 cycle 또는 TB/s

4 Hopper 니모닉

Hopper 5 기능: TWCDT (TMA · WGMMA · Cluster · DSM · Transformer Engine)

Stall 7 대표: LSWDMBN (Long · Short · Wait · Drain · MIO · Barrier · NoInst)

5 compute capability 퀵 룩업

arch	-arch string
Ampere A100	`sm_80`
Ampere RTX 30	`sm_86`
Ada L40/RTX 40	`sm_89`
Hopper H100	`sm_90` / `sm_90a`
Blackwell B200	`sm_100` / `sm_100a`

TMA/WGMMA는 sm_90a 필수 (아키별 특수). 상세 ↗ V04 §1

6 3세대 종합 ★★

row	A100	H100	B200
SM	108	132	160
TC gen	3rd	4th	5th
HBM	80 2e	80 3	192 3e
BW TB/s	2.0	3.35	8.0
NVLink	600	900	1800
FP16 TC	312	989	2250
FP8 TC	—	1979	4500
FP4 TC	—	—	9000
TDP W	400	700	1000
Process	N7	4N	4NP

7 실수 방지 10선

sm_90 만으로는 TMA/WGMMA 불가 → sm_90a
reg 256 KB/SM 이지만 1 thread 상한 255
smem 동기화는 __syncthreads, TMA 는 mbarrier
cluster ≤ 8 block portable / 16 opt-in
DSM 은 같은 cluster 내 에서만
L2 persistence window 는 stream 수명 따라감
MIG 는 NVLink 일부 공유 · cluster 제약
ECC on 기준 peak BW 가 공식값
FP16 accum FP16 ≠ FP32 (정확도 차이)
Cycle 추정은 micro-benchmark, 절대 시간 아님

8 학습 경로

다음: PTX ISA ↗ V03
Hopper 심화: ↗ V04
SM 수치 → 커널 설계: ↗ V01 §4·V06
측정·검증: ↗ V18

GPU Architecture Quantitative

목차

범례

1 SM = 4 processing partition Ampere/Hopper 공통 4 quad

2 Partition 내부 unit

3 SM 블록도 ASCII

4 Dual-issue 조건 partition 1 cyc 2 이슈

5 Instruction / Constant Cache

6 RT Core 참고 (DC용 GA100/GH100에는 없음)

7 핵심 공식: dispatch 상한

1 Register File 용량 ★ 65536 × 4B

2 Partition 내 sub-bank

3 Operand Collector 정의

4 FFMA 3-operand 수집 흐름

5 Register Spill 비용 ★

6 Occupancy 제약 공식

7 Hopper setmaxnreg ★

8 실무 가이드

1 L1/Smem 통합 ★ unified SRAM

2 Carve-out 옵션 (GA100)

3 Smem Bank 구조 ★

4 Hopper swizzle WGMMA 용

5 Distributed Shared Memory ★ DSM = cluster smem

6 DSM access 경로

7 Smem 사용 정석 3종

1 L2 용량 · 분할 ★

2 L2 partition 구조 ★ NUMA-like

3 L2 Persistence Window ★

4 set-aside 카빙

5 Compression 무손실 CCS

6 L2 위치가 중요한 이유

7 흔한 실수

1 HBM 세대 비교 ★ 2e→3→3e

2 GPU별 HBM 구성

3 Memory Controller 계층

4 Coalescing 조건 복습

5 ECC Overhead ★

6 GPU Memory Partition =GPC · uGPU

7 PCIe / NVLink 외부 링크

1 세대 개요

2 dtype 지원 매트릭스 ★★ 16→BF→TF→8→4

3 Dense TC peak (SXM / W=TDP) 16·19·31

4 2:4 Sparsity

5 세대별 의미론 변화

6 accumulator 규칙

7 쉐이프·피연산자 배치

1 Die 구성 ★

2 총 연산 자원 (A100 SXM4)

3 메모리 계층 요약

4 Ampere 신기능 ★

5 연산 peak 요약 (SXM4)

6 A100 variant

7 GA100 vs GA10x

1 Die 구성 ★

2 총 연산 자원 (H100 SXM5)

3 메모리 계층 요약

4 Hopper 신기능 5 ★★ TWCDM

5 연산 peak (H100 SXM5)

6 H100 variant

7 DPX instruction Dynamic Programming eXt

1 TMA — Tensor Memory Accelerator ★

2 왜 TMA 가 필요한가

3 WGMMA — Warpgroup MMA ★

4 Cluster — Thread Block Cluster ★

5 DSM — Distributed Shared Memory ★

6 TE — Transformer Engine ★

7 5 기능 상호관계

1 Die 구성 ★

2 메모리

3 신기능 ★ FP4·NV-HBI·TE2

4 연산 peak (B200 공식)

5 3세대 비교 종합표 ★★ SM·HBM·NVL·TC

1 스케줄러 기본 동작 select → issue

2 eligible 조건

3 Stall Reason 분류 ★ L·S·W·D·MIO·T·B

4 대표 원인 → 대처 매핑

5 Latency Hiding 공식

6 Issue Slot 수식

7 Hopper 개선

1 5 단계 abstract pipeline F·D·OC·EX·WB

7 Hopper `setmaxnreg` ★