CUDA ONEBOOKS · T6 METHODOLOGY · A4 LANDSCAPE · 14p

Roofline & Profiling 단권화

Roofline Model · Nsight Metric Framework · Stall Reason · Bound-type Decision Tree 메트릭의 의미와 해석 프레임만

Volume V18/18

Tier T6 방법론

선행 V02 (GPU HW)

용도 프로파일링 판독 지도 정지된 지도

범례

핵심 용어 (노랑)

매우 중요·표 헤더

정의·수식 박스

예시 박스

빨강오해·함정

★실무 핵심

(!)니모닉

AIArithmetic Intensity FLOP/Byte

SOLSpeed of Light (peak 대비 %)

↗ V##타 권 cross-ref

Out of Scope · Nsight UI 조작법 · 실측 벤치마크 수치 · 튜닝 사례 · trace 분석 감각 (코드 영역)

Roofline (Williams 2009) · Nsight Compute metric spec · NVIDIA profiler guide — 14 pages

1 Roofline Equation 성능 상한 모델 P = min(peak, AI·BW)

P(AI) = min( P_peak, BW · AI ) P : attainable FLOP/s P_peak : compute roof BW : memory bandwidth AI : arithmetic intensity

의미 kernel 성능은 compute roof와 memory roof 중 낮은 쪽으로 bound된다. AI가 작으면 BW·AI (사선), 크면 P_peak (수평) 이 지배.

Williams, Waterman, Patterson "Roofline: An Insightful Visual Performance Model" (CACM 2009)

2 Arithmetic Intensity 정의 ★

AI = FLOPs / Bytes_HBM Bytes_HBM : DRAM traffic only · on-chip reuse는 제외

FLOPs = 유효 부동소수 연산 수 fused ops는 FMA=2로 카운트
Bytes = 메모리 계층 간 실제 전송량 기준 레벨에 따라 달라짐
AI는 알고리즘 + 구현 결합 특성 — 같은 수식도 tile 크기로 바뀜

3 Ridge Point 분기점 AI* = peak/BW

AI^* = P_peak / BW [FLOP/Byte] AI < AI* → memory-bound AI ≥ AI* → compute-bound

예 A100 FP16: 312 TF / 1.5 TB/s ≈ 208 FLOP/B
H100 FP16: 989 TF / 3.35 TB/s ≈ 295 FLOP/B
H100 FP8 : 1979 TF / 3.35 TB/s ≈ 591 FLOP/B

수치 출처: NVIDIA A100/H100 whitepapers · cross-ref ↗ V02 §8·§9

4 Log-Log Plot Template ★

log FLOP/s
  ▲
  │              ┌──────── P_peak (compute roof)
  │             /
  │  slope=BW  /
  │           /
  │          /  ← memory roof (BW·AI)
  │         /
  └────────┼─────────────────► log AI
          AI*
  kernel point (AI, P_obs):
    below both roofs → headroom
    on memory roof   → memory-bound
    on compute roof  → compute-bound

5 해석 규칙 3원칙

수직 거리 = peak 대비 얼마나 loss % 로 환산 가능
가로축 위치 = AI를 바꿔야 ceiling 상승 알고리즘/tile 문제
roof 선택 = 해당 dtype의 peak (FP32 roof ≠ Tensor Core roof)

6 Roof 종류 dtype별

roof	지표
FP64 peak	scientific
FP32 CUDA core	general
FP16/BF16 TC	training
FP8 TC	inference
INT8 TC	quantized

dtype별로 서로 다른 roof line — 같은 plot에 여러 수평선

7 한계 모델이 잡지 않는 것

latency (occupancy 부족) — Roofline은 steady-state throughput 만 본다
atomic contention, bank conflict로 인한 effective BW 감소
kernel launch overhead, sync cost

함정: "memory roof 아래 = 자동으로 memory-bound" 아님. latency-bound도 memory roof 아래에 찍힘.

1 왜 Multi-tier인가 ★

문제 단일 "HBM roof" 만으로는 on-chip reuse가 좋은 kernel을 판단 못 함. cache hit가 높으면 effective BW는 HBM 이상.

같은 AI_HBM 라도 L2 hit↑ 이면 L2 roof 아래에서도 돌아감
계층별 BW가 다르므로 ceiling도 계층별
kernel 하나에 여러 point를 찍어 어느 level이 병목인지 식별

2 계층별 Bandwidth 랭킹 Ampere 기준

level	BW (per SM·cycle or agg.)	AI* (FP16 TC)
HBM	~1.5 TB/s	~208
L2	~5 TB/s	~62
SMEM (per SM agg.)	~19 TB/s	~16
Register	~TB/s 단위 실효	—

order-of-magnitude · A100 whitepaper / NVIDIA Tuning Guide · ↗ V02 §15

3 Hierarchical Plot ★

log FLOP/s
  ▲
  │            ──────── P_peak
  │           /
  │          / SMEM roof (19 TB/s · AI_smem)
  │         /  ← 기울기 가장 급
  │        /
  │       / L2 roof (5 TB/s · AI_L2)
  │      /
  │     /
  │    / HBM roof (1.5 TB/s · AI_HBM)
  │   /
  └──┼─────────────────────────► log AI
     각 level의 AI = FLOPs / Bytes_level
     level별로 kernel 좌표 다름

4 Level별 AI 정의

AI_HBM = FLOPs / Bytes_HBM
AI_L2 = FLOPs / Bytes_L2
AI_SMEM = FLOPs / Bytes_SMEM 같은 kernel이라도 level별 Bytes 다름 → AI 다름 → plot 좌표 다름

5 판독 규칙 어느 level이 병목

kernel point가 HBM roof에 붙음 → HBM-bound (reuse 없음)
HBM roof는 여유, L2 roof에 붙음 → L2-bound (hit 낮거나 L2 BW 한계)
HBM·L2 여유, SMEM roof에 붙음 → SMEM BW 또는 bank conflict
모든 memory roof 위, P_peak에 붙음 → compute-bound (목표 상태)

6 GEMM 단계별 이동

예 naive GEMM → HBM roof
+ shmem tiling → L2/SMEM roof 방향으로 우측 이동
+ Tensor Core → compute roof 근접
cross-ref ↗ V06 §3 GEMM tiling 단계

7 Multi-tier의 함정

AI_SMEM 기준으로만 plot 하면 SMEM bank conflict에 의한 effective BW 감소가 숨겨진다 — conflict 지표 별도 확인 필요 (§6).

1 GEMM FLOPs/Bytes ★

C = A·B, A:M×K, B:K×N, C:M×N
FLOPs = 2·M·N·K FMA=2
Bytes_naive = sizeof · (M·K + K·N + M·N) tile 없이 매 inner product마다 A·B 재로드하면 K배 이상 증폭됨

2 GEMM AI (이상적 단일 패스)

AI_ideal = 2MNK / ( 2·(MK + KN + MN) ) M=N=K=d 라면 AI ≈ d/3 · FP16(2B) → AI_FLOP/B = d/6

예 M=N=K=4096, FP16 FLOPs = 2·4096³ ≈ 1.37×10¹¹
Bytes = 2·(3·4096²) ≈ 10⁸
AI ≈ 1365 FLOP/B >> AI*_H100=295 → compute-bound

3 GEMM 형상별 AI 표

M=N=K	AI (FP16)	region
128	~21	memory-bound
512	~85	mem/comp 경계
1024	~170	compute 근처
4096	~683	compute-bound
8192	~1365	deep compute

AI ≈ d/6 근사 · ridge AI_H100=295 대비

4 Tall-Skinny / Short-Fat

M 작고 K 큰 GEMV-류 → AI 낮음 (memory-bound)
decoder bs=1 inference → GEMV, ridge 왼쪽
reuse 낮음이 본질적 → tile 늘려도 한계

Split-K / Stream-K 선택 ↗ V06 §5

5 FA Kernel AI ★

FA: seq=N, head=d, block B_r·B_c
FLOPs ≈ 4·N²·d QK + PV 각 2N²d
Bytes_HBM ≈ sizeof·(N·d + N·d + N·d) = 3·N·d·s S=QK^T matrix는 SMEM에서만 — HBM에 안 씀

AI_FA 근사 AI ≈ 4N²d / (3Nd·s) = 4N/(3s)
FP16 → AI ≈ 2N/3 · N=2048 → ≈ 1365
→ compute-bound (긴 시퀀스일수록)

cross-ref ↗ V07 §4 FA AI 유도

6 Elementwise / Softmax AI

op	AI (FP16)
add, mul	~0.25
RMSNorm	~0.5
softmax (1 pass)	~0.5
LayerNorm	~1.0

항상 memory-bound → fusion 만이 유일 해결책 ↗ V08

7 AI 개선 경로 일반 규칙

reuse 증가: tile ↑, register blocking
redundant load 제거: fusion
Bytes 감소: dtype ↓ (FP8/INT8)
recompute: bytes ↓ 대신 FLOPs ↑

1 NCU Section 계층 ★

정의 Nsight Compute (NCU) 리포트는 Section 단위로 구성. 각 Section = 관련 metric 묶음 + 해석 rule.

GPU Speed of Light (SOL) — 상위 요약, peak 대비 %
Memory Workload Analysis — DRAM/L2/L1/SMEM 트래픽
Compute Workload Analysis — pipe 별 utilization
Scheduler Statistics — warp issue/eligible
Warp State Statistics — stall 원인
Source Counters — 소스 라인 attribution

NVIDIA Nsight Compute Documentation · Section 정의만 참조

2 판독 순서 top-down SOL→Mem→Comp→Sched→Warp→Src

SOL 으로 bound type 1차 판별
Memory / Compute Workload 로 근거 확인
Scheduler / Warp State 로 원인 범주
Source View 로 라인 단위 attribution

3 Metric 명명 규칙

prefix	의미
`sm__`	SM 단위 카운터
`smsp__`	SM sub-partition
`l1tex__`	L1/TEX unit
`lts__`	L2 slice
`dram__`	HBM 컨트롤러
`gpc__`	GPC (SM 묶음)

4 Metric Suffix 측정 방식

suffix	의미
`.avg`	평균
`.sum`	합계
`.pct_of_peak`	peak 대비 %
`.per_cycle_active`	active cycle당
`.ratio`	비율

5 Section ↔ 해석 질문 매핑

section	답하는 질문
SOL	compute vs memory 어디?
Memory	어느 계층이 포화?
Compute	어느 pipe가 포화?
Scheduler	warp가 충분히 준비?
Warp State	왜 stall?
Source	어느 라인?

6 Counter 종류

Throughput counter — byte/s, FLOP/s, inst/s
Utilization counter — peak 대비 %
Event counter — 누적 횟수 (cache miss 등)
Ratio counter — derived (hit rate 등)

7 해석 원칙

단일 metric에 매달리지 말 것. Section 통째로 보고 교차 확인. SOL Memory 만 보고 "memory-bound" 단정은 오진 가능.

1 SOL 개념 ★

정의 SOL = 해당 unit이 peak 대비 몇 %로 돌아갔는가. kernel 전체 runtime 에 대해 평균한 값.

SOL_unit = observed_unit / peak_unit · 100 [%] peak은 HW 이론치 (whitepaper) · observed는 실측 counter

2 SOL 3대 지표

지표	peak 기준
SOL SM	SM compute pipe 포화
SOL Memory	HBM BW 포화
SOL TEX/L1	L1/TEX 단위 처리량
SOL L2	L2 cache 처리량

3 해석 규칙 ★

조건	판정
SOL Mem > SOL SM (차이 큼)	memory-bound 후보
SOL SM > SOL Mem (차이 큼)	compute-bound 후보
둘 다 낮음 (< 60%)	latency-bound 의심
둘 다 높음	balanced, 목표 상태

NVIDIA Nsight Compute Kernel Profiling Guide — SOL Section 해석 rule

4 SOL SM 세부

의미 SM 내부 모든 pipe (FMA, ALU, LSU, Tensor, SFU) 중 최대값.

어느 pipe가 bottleneck 인지는 §9 Compute Workload
Tensor Core 전용 kernel이면 Tensor pipe가 SM SOL을 지배

5 SOL Memory 세부

의미 memory 하위 시스템 (DRAM, L2, L1, SMEM) 중 최대 처리량 %.

SOL Memory 80%+ → HBM BW saturate, 추가 BW 확보 불가
계층별 세부는 Memory Chart 에서 (§6)

6 SOL TEX / L1

L1 unit이 포화되면 LSU bottleneck
shared memory load/store 도 L1 unit 경유 → 영향
uniform memory instruction 많으면 MIO throttle 과 연관

7 SOL 해석 함정 (!) 니모닉

SOL 판독 3Q: CMB (Compute·Memory·Balance)

함정: SOL SM 이 낮아도 compute-bound 일 수 있다 — warp divergence 때문에 pipe가 놀고 있지만 실제로는 compute waste 중인 경우.

1 Memory Chart 구조 ★

의미 DRAM ↔ L2 ↔ L1/TEX ↔ SMEM ↔ Register 간 데이터 흐름을 화살표·숫자로 표기. 각 단계별 throughput + hit rate.

  DRAM
   ↓ bytes_r / bytes_w
    L2  (hit rate %)
   ↓
  L1/TEX (hit rate %)
   ↓
  SMEM (load/store · bank conflict)
   ↓
  Register file

2 DRAM Throughput

DRAM_% = dram__bytes.sum / (peak_BW · time) · 100 peak_BW : HBM 이론치 · ↗ V02 §5

read/write 분리 표시 — 불균형 시 특정 패턴 의심
ECC overhead는 peak 에 이미 반영 (whitepaper 수치)

3 Cache Hit Rate ★

level	hit rate 의미
L1 hit	block 내 reuse
L2 hit	SM 간 공유 / prefetch 성공
DRAM (miss)	HBM 까지 내려감

Hit rate 낮음 = traffic 증폭 = DRAM BW 낭비

4 Sector / Request

sectors / request = 실제 32B sector 수 / memory instruction 수 이상 coalesced load = 4 sectors/request (128B access) · 이상 값이 크면 uncoalesced

global load: sectors/req 1~4 정상
sectors/req > 4 → uncoalesced 의심

5 Shared Memory Load/Store

지표	해석
SMEM load throughput	reuse 활용도
SMEM store throughput	write 빈도
bank conflict count	access pattern 문제
wavefronts per request	1 이상 = 충돌

6 Bank Conflict

정의 warp 내 스레드가 같은 bank (32 bank)에 동시에 접근 → serialization. N-way conflict = N cycle.

wavefronts / request 값이 핵심 지표
해결 범주: padding, swizzle, ldmatrix ↗ V03 §8

7 L2 Hit 전략 범주

block 배치로 working set 축소
L2 persistence ↗ V02 §4
tile 순서 변경 (z-order, swizzle)

함정: L1 hit 100%인데 DRAM throughput 여전히 높음 → 다른 block이 같은 데이터 재로드 (L2 miss).

1 Stall Reason 개념 ★

정의 매 cycle warp scheduler가 sample한 각 warp의 상태. "준비 안 된" warp들의 왜를 집계한 것이 stall reason.

분포는 cycle count 가중 %로 표시
top 1~2 reason 이 진짜 bottleneck
원인 ≠ 해법 — 아래 표는 범주만

NVIDIA Nsight Compute · Warp State Statistics Section · ↗ V02 §11

2 판독 원칙

상위 2~3 reason 누적 %
나머지는 백색 잡음으로 간주
reason 은 증상일 뿐, 원인 위치는 Source View

3 Stall Reason 8종 정의 매트릭스 ★★

이름	의미	전형적 원인 범주	해법 범주 프레임 수준
Long Scoreboard	global/local memory load 결과 대기	HBM latency, reuse 부족, prefetch 없음	shmem tiling, `cp.async` pipelining, L2 hit ↑
Short Scoreboard	SMEM load / MIO 결과 대기	shared memory 지연, bank conflict	swizzle, `ldmatrix`, padding
Wait	이전 instruction의 fixed latency 대기	dependency chain, low ILP	unroll, 독립 op 삽입, ILP 확보
Not Selected	ready 인데 scheduler가 다른 warp 선택	eligible warp 과다 (좋은 신호)	통상 건드리지 않음 — ILP 확보 여지
Drain	block exit 시점 작업 drain 대기	tail effect, block 수 vs SM 배수 부정합	grid 크기 조정, persistent kernel
Barrier	`__syncthreads` / cluster barrier 대기	sync 빈도 과다, work imbalance	sync 횟수 ↓, warp specialization
MIO Throttle	Memory I/O unit queue 포화	uniform load 쏠림, LDG 폭주	load 분산, vectorize (LDG.128)
Tex Throttle	TEX/L1 unit 포화	texture/L1 요청 과다	request 분산, cache 활용 재검토

"해법"은 프레임워크 수준 범주일 뿐 — 구체 튜닝 사례·UI 조작은 단권화 범위 밖 (out-of-scope)

4 Stall 그룹핑 bound 추론

지배 stall	추정 bound
Long Scoreboard ↑↑	HBM latency/BW bound
Short Scoreboard ↑↑	SMEM bound
Barrier ↑↑	sync bound (work imbalance)
Wait ↑↑ + Math pipe ↑↑	compute bound
Not Selected ↑↑	여유 있음 — bottleneck은 다른 곳

1 Warp Scheduler 요약 ★

역할 매 cycle 각 scheduler sub-partition에서 eligible warp 하나를 골라 instruction을 issue. Ampere/Hopper는 dual issue 가능.

SM 내부 구조 ↗ V02 §1·§11

2 핵심 3지표

metric	의미
Active Warps	assigned warp 수 (평균/SM)
Eligible Warps / cycle	issue 가능 warp 수
Issued Warps / cycle	실제 issue된 warp 수

3 상태 전이

Active → Eligible → Issued
  ↓         ↓         ↓
stalled   not-selected  success

4 Issue Slot Utilization

slot_util = issued_warps_per_cycle / max_schedulers max_schedulers = SM 내 scheduler 수 (Ampere/Hopper = 4 sub-partition)

100% = 매 cycle issue 성공
낮으면 stall 또는 eligible 부족

5 Eligible Warps 판독 ★

조건	진단
Active 높음, Eligible 낮음	전원 stall — latency hiding 실패
Eligible 높음, Issued 낮음	issue 단의 pipe 제약
Eligible > 1	ILP 여유 있음 (Not Selected ↑)
Eligible < 1	warp 수 부족, occupancy ↑ 검토

6 Latency Hiding 조건

warps_필요 ≥ L · T / S L : instruction latency (cycle) T : throughput (inst/cycle) S : scheduler 수

latency 높을수록 warp 많이 필요
HBM latency ~400~600 cycle → warp 대기 비용 큼
↗ V02 §14 latency table

7 Scheduler ↔ Occupancy

연결 Eligible Warps 부족이 반복되면 occupancy 부족 가능성. 단, occupancy 높다고 eligible이 보장되지 않음 (전원 stall 가능).

Occupancy 상세 ↗ p11 §10

8 함정

오해: Issue Slot Util 100% = 최적. 사실은 특정 pipe만 포화된 상태일 수 있음. Compute Workload (§9) 로 pipe별 확인 필요.

1 Pipe 구조 ★

개념 SM 내부는 기능별 pipe로 나뉨. Compute Workload Section은 pipe 별 utilization %를 보여줌.

FMA, ALU, FP64, LSU, Tensor, SFU, XU 등
pipe마다 별도 throughput · 직접 경합 없음

Pipe 구성 세대별 차이 ↗ V02 §1

2 주요 Pipe 정의 표

pipe	담당 연산
FMA	FP32 fused multiply-add
ALU	int / logical / bitwise
FP64	double precision
LSU	load/store unit
Tensor (TC)	MMA, WGMMA
SFU	sin, cos, rcp, rsqrt, exp
XU (Ampere+)	branch, predicate, misc

3 해석 규칙 ★

포화 pipe	bound 의미
Tensor ↑↑	TC-bound — 목표 상태 (GEMM)
FMA ↑↑	FP32 compute-bound
ALU ↑↑	int 연산 포화 (indexing 과다?)
LSU ↑↑	load/store 집중 (SMEM traffic)
SFU ↑↑	초월함수 병목 (exp, softmax)
모든 pipe 낮음	compute 아닌 다른 bound

4 Tensor Pipe Active %

TC% = mma_inst · shape_FLOP / (cycle · SM · TC_/SM · FLOP_/TC) GEMM 최적 커널은 60~80%+ · FA forward는 70%+ 가능

5 Pipe 합산의 함정

함정: 여러 pipe util을 단순 합산하면 100% 초과. pipe들은 병렬이므로 각각 독립 %.

6 Tensor vs non-Tensor

TC-enabled kernel: TC pipe 높고 FMA 낮음이 정상
TC 낮음 + FMA 높음 → TC 경로 누락 dtype / ldmatrix 확인
TC + FMA 둘 다 낮음 → memory/latency bound

7 Pipe 별 개선 범주

포화 pipe	개선 방향 (범주)
ALU 과다	indexing 단순화, const folding
LSU 과다	vectorized load, SMEM tiling
SFU 과다	fast-math, approx 함수
TC 낮음	shape/dtype 재검토 ↗ V06

1 Occupancy 정의 ★

Occupancy = active_warps_per_SM / max_warps_per_SM max : Ampere 64 / Hopper 64 warp per SM

occupancy = latency hiding 여력의 상한
성능 ≠ occupancy — 충분조건 아님

2 Theoretical vs Achieved

종류	의미
Theoretical	HW 자원 한계로 정해지는 상한
Achieved	실행 중 실제 평균

Achieved < Theoretical → tail effect, sync imbalance

3 Theoretical 제한 요인 ★

자원	제약
Register / thread	reg · threads · warp32 ≤ 256KB/SM
Shared Mem / block	blocks · smem/block ≤ SMEM/SM
Threads / block	blocks · threads ≤ max_threads
Blocks / SM	≤ 16 (Ampere) / 32 (Hopper)

상수 ↗ V02 §2·§3

4 제한 요인 식별

셋 중 가장 작은 한계가 occupancy를 결정
__launch_bounds__ 로 reg 상한 힌트
NCU Occupancy Section 에 Limiter 명시됨

5 Occupancy ↑ ≠ 성능 ↑ ★

핵심 오해: occupancy 100% 추구하다가 reg 부족 → spill → local memory (HBM) traffic 증가 → 성능 악화.

compute-bound kernel 은 occupancy 50%로도 충분
memory-bound kernel 은 occupancy ↑ 도움되는 편
sweet spot 은 kernel마다 다름

6 Achieved < Theoretical 원인

tail: grid 크기가 SM 수 배수 아님
work imbalance: 일부 block 조기 종료
barrier: block 내 sync imbalance

7 판독 절차

Theoretical limiter 확인 (reg / smem / block)
Achieved 와 gap 확인
gap이 크면 tail/barrier 의심
Eligible Warps 와 교차 검토 (§8)

Occupancy 3한: R·S·B (Register · SMEM · Block count)

1 Source View 목적 ★

의미 Section 지표를 per-line으로 분해. CUDA C/PTX/SASS 소스 각 줄별 sampled warp state와 instruction 카운트를 매핑.

Nsight Compute · Source Counters Section

2 3계층 표시

계층	표시
CUDA C	원본 라인
PTX	가상 ISA (컴파일 중간)
SASS	실제 실행 명령 ↗ V04 §12

C ↔ PTX ↔ SASS 대응은 컴파일러가 주석으로 제공 — 1:1 아닐 수 있음

3 Per-line Metric 종류

metric	의미
Sampled Warp State	이 라인에서 관측된 stall reason 분포
Instructions Executed	누적 실행 횟수
Predicated-Off	predicate false로 무효 처리
Live Registers	이 지점 live reg 개수
Branch Taken / NotTaken	분기 통계

4 Sampled Warp State ★

의미 매 sample cycle 해당 PC에서 관측된 warp의 stall reason. Warp State Statistics Section을 PC별로 분해한 것.

특정 라인에 Long Scoreboard 집중 → 그 라인의 load가 HBM 대기
특정 라인에 Barrier 집중 → 그 앞의 sync 문제

5 Live Register Count

의미 해당 PC에서 동시 live인 register 수. max live 가 occupancy 제한 요인이 됨.

function body 중간에 spike 있으면 전체 reg/thread 상승
spill 방지 힌트로 __launch_bounds__

6 Stall Attribution 순서 ★

Kernel 전체 Warp State 에서 top reason 확인
Source View 에서 그 reason 이 집중된 라인 식별
해당 라인 앞의 긴 latency 원인 추적 (load, sync, SFU)
범주적 해법 (§7 표) 에 맵핑

7 한계

sampling 기반 — 드물게 실행되는 라인은 통계적 유의성 낮음. 또 compiler 최적화로 C 라인과 SASS 매핑이 흩어질 수 있음.

1 Nsys vs NCU ★

	Nsight Systems	Nsight Compute
scope	system-wide timeline	single kernel deep-dive
단위	μs~s	cycle
질문	"언제·어디서"	"왜 느린가"
대상	CPU+GPU 협업	GPU kernel 내부

2 Trace 구성요소

CUDA API — cudaMalloc, Memcpy, launch
Kernel — 실제 GPU 실행 span
Memcpy — H2D / D2H / D2D
NCCL / NVLink — comm span ↗ V15
NVTX range — 사용자 정의 span
CPU stack — host thread sampling

3 Timeline Lane 구조

CPU thread 0 |  cudaMemcpyAsync | launch | sync |
CUDA HWQ     |  H2D            | kernelA | kernelB |
NVLink       |                 | allreduce       |
NVTX         |  [forward               ][backward]

lane별 독립 타임라인 — 겹침(overlap) 확인에 사용

4 Overlap 판정 ★

패턴	해석
kernel + Memcpy 겹침	stream 병렬 활용 OK
compute + comm 겹침	comm hiding 성공
kernel 직후 긴 host gap	launch overhead / sync 대기
GPU idle span	host 병목

5 NVTX Range 용도

사용자가 nvtxRangePush/Pop 으로 구간 태깅
의미 단위 (step, layer, forward) 를 lane에 표시
kernel 집합을 semantic하게 묶음

6 판독 질문 5종

GPU idle 구간이 있는가
H2D / D2H 가 compute 와 겹치는가
allreduce 가 backward 와 겹치는가
kernel 간 gap은 normal 한가
CPU thread bottleneck이 있는가

7 분산 / Multi-GPU 확장

확장 multi-rank trace 는 rank별 lane이 추가됨. NCCL coll 구간이 모든 rank에서 정렬되는지 확인.

NCCL 통신 상세 ↗ V15 §5·§6

1 Top-down Decision Tree ★

Start
 │
 ├─ SOL Mem > SOL SM (큰 차이)? ── Y → [Memory Bound]
 │       N
 ├─ SOL SM > SOL Mem (큰 차이)? ── Y → [Compute Bound]
 │       N
 ├─ 둘 다 < 60%? ─────────────── Y → [Latency Bound]
 │       N
 ├─ Warp Divergence 크다? ────── Y → [Divergence Bound]
 │       N
 └─ Eligible < 1? ────────────── Y → [Occupancy Bound]
         N → Balanced (목표)

2 우선순위 원칙 ★

최상위 bottleneck 먼저 — 나머지 개선은 mask 됨
SOL 차이 크기 순서로 attack
범주적 해법 → 측정 → 재판독 루프 단, 측정·재판독은 코드 영역

3 Bound Type × Metric × 원인 매트릭스 ★★

bound type	우선 검사 metric	가능한 원인 범주
Memory Bound	SOL Memory, DRAM throughput, L2/L1 hit rate, sectors/req, Long Scoreboard %	reuse 부족, uncoalesced access, hit rate 낮음, working set > L2, prefetch 없음
Compute Bound	SOL SM, pipe utilization (FMA/TC/SFU), Wait stall, Math pipe %	peak 도달 (목표), 또는 특정 pipe 편중 (SFU 초월함수, ALU indexing)
Latency Bound	Eligible Warps / cycle, Issue Slot Util, Wait stall, Active Warps	warp 수 부족, ILP 부족, dependency chain, register spill
Divergence Bound	Warp Execution Efficiency, Branch Efficiency, SM active cycles	data-dependent branch, uneven work, lane 별 다른 path
Occupancy Bound	Achieved Occupancy, Theoretical limiter (reg/smem/block), Active Warps	reg/thread 과다 (spill 가능), smem/block 과다, block 수 부족
Sync / Barrier Bound	Barrier stall %, Drain stall %, Achieved vs Theoretical gap	__syncthreads 과다, work imbalance, tail effect
SMEM Bound	Short Scoreboard %, SMEM throughput, wavefronts/request (bank conflict)	bank conflict, swizzle 부재, ldmatrix 누락, SMEM 용량 과다

범주적 원인만 · 구체 tuning trick / benchmark 수치 / UI 조작은 out-of-scope

4 진단 루프 요약

Nsys trace → kernel 식별 언제
NCU SOL → bound type 무엇
Section (Mem/Comp/Sched/Warp) → 원인 범주 왜
Source View → 라인 attribution 어디
범주적 해법 선정 → 구현 (실측 루프는 코드)

5 Cross-ref 지도

bound	관련 권
HW 구조 근거	↗ V02
GEMM tuning frame	↗ V06
FA tuning frame	↗ V07
Distributed overlap	↗ V15
Inference latency	↗ V16

6 니모닉 총정리 (!)

판독 3Q: RWO (Roofline · Warp stall · Occ vs Throughput)

Section 6: SMCSWS (SOL · Memory · Compute · Scheduler · Warp · Source)

Bound 5: MCLDO (Memory · Compute · Latency · Divergence · Occupancy)

단권화 범위: 이 책은 "메트릭의 의미"만. 실제 튜닝 의사결정은 측정-수정-재측정 루프이므로 코드·실습에서만 체화된다.

Roofline & Profiling 단권화

목차

범례

1 Roofline Equation 성능 상한 모델 P = min(peak, AI·BW)

2 Arithmetic Intensity 정의 ★

3 Ridge Point 분기점 AI* = peak/BW

4 Log-Log Plot Template ★

5 해석 규칙 3원칙

6 Roof 종류 dtype별

7 한계 모델이 잡지 않는 것

1 왜 Multi-tier인가 ★

2 계층별 Bandwidth 랭킹 Ampere 기준

3 Hierarchical Plot ★

4 Level별 AI 정의

5 판독 규칙 어느 level이 병목

6 GEMM 단계별 이동

7 Multi-tier의 함정

1 GEMM FLOPs/Bytes ★

2 GEMM AI (이상적 단일 패스)

3 GEMM 형상별 AI 표

4 Tall-Skinny / Short-Fat

5 FA Kernel AI ★

6 Elementwise / Softmax AI

7 AI 개선 경로 일반 규칙

1 NCU Section 계층 ★

2 판독 순서 top-down SOL→Mem→Comp→Sched→Warp→Src

3 Metric 명명 규칙

4 Metric Suffix 측정 방식

5 Section ↔ 해석 질문 매핑

6 Counter 종류

7 해석 원칙

1 SOL 개념 ★

2 SOL 3대 지표

3 해석 규칙 ★

4 SOL SM 세부

5 SOL Memory 세부

6 SOL TEX / L1

7 SOL 해석 함정 (!) 니모닉

1 Memory Chart 구조 ★

2 DRAM Throughput

3 Cache Hit Rate ★

4 Sector / Request

5 Shared Memory Load/Store

6 Bank Conflict

7 L2 Hit 전략 범주

1 Stall Reason 개념 ★

2 판독 원칙

3 Stall Reason 8종 정의 매트릭스 ★★

4 Stall 그룹핑 bound 추론

1 Warp Scheduler 요약 ★

2 핵심 3지표

3 상태 전이

4 Issue Slot Utilization

5 Eligible Warps 판독 ★

6 Latency Hiding 조건

7 Scheduler ↔ Occupancy

8 함정

1 Pipe 구조 ★

2 주요 Pipe 정의 표

3 해석 규칙 ★

4 Tensor Pipe Active %

5 Pipe 합산의 함정

6 Tensor vs non-Tensor

7 Pipe 별 개선 범주

1 Occupancy 정의 ★

2 Theoretical vs Achieved

3 Theoretical 제한 요인 ★

4 제한 요인 식별

5 Occupancy ↑ ≠ 성능 ↑ ★

6 Achieved < Theoretical 원인

7 판독 절차

1 Source View 목적 ★

2 3계층 표시

3 Per-line Metric 종류

4 Sampled Warp State ★

5 Live Register Count

6 Stall Attribution 순서 ★

7 한계

1 Nsys vs NCU ★

2 Trace 구성요소