CUDA 18-VOL SERIES · V01 · CONTENT-FIRST · A4 LANDSCAPE · 18p

PMPP 기초 재정리

Programming Model · Indexing · Warp · Memory Hierarchy · Parallel Patterns

Volume 01 / 18

Tier T1 HW & ISA

선행 없음

용도 18권 전체의 geometry — 이후 V02+ 의 모든 용어 좌표계

범례

핵심 용어 (노랑 배경)

매우 중요 · 표 헤더

정의 · 공식 박스

예시 · 워크드 박스

빨강주의 · 실수하기 쉬움

★시험 · 실무 핵심 (페이지당 ≤3)

(!)니모닉 (권당 ≤5)

↗다른 권 cross-ref

→인과 · 흐름

∵∴이유 · 결론

Out of Scope Nsight 실측 ↗ V18 · 튜닝 의사결정 ↗ V18 · 세대별 트릭 ↗ V02

Source: PMPP 4e (Hwu, Kirk, Hajj) Ch 1–14 · A100/H100 기준 · Ctrl+P A4 landscape

1 Heterogeneous Parallel Computing 정의

정의 Heterogeneous = CPU (latency-oriented) + GPU (throughput-oriented) → sequential은 CPU, data-parallel은 GPU로 역할 분담.

2003 이후 clock scaling 종료 (열·전력 벽) → multi-core / many-thread 전환
Multi-core: sequential 성능 유지 · ILP · OoO · large cache
Many-thread: throughput 극대화 · massive thread · relaxed memory
von Neumann 모델 · thread = 순차 실행 단위 (독립 PC · register)

2 CPU vs GPU 설계 축

축	CPU (latency)	GPU (throughput)
목표	단일 thread 지연 최소	총 FLOPs 최대
ALU	소수 · 대형 · 저지연	다수 · 소형 · 장지연 허용
캐시	L1/L2/L3 대용량	작음 · BW 버퍼
제어	branch predict · OoO	in-order · SIMT mask
메모리	강한 일관성	relaxed · ~10× BW
쓰레드	수~수십	수만

경제학: latency 반감 비용 ≫ throughput 두 배 비용. ∴ 대규모 병렬 워크로드는 GPU가 유일해.

3 Amdahl's Law ★ 1/((1-p)+p/s)

Speedup(p, s) = 1 / ((1 − p) + p/s)
lim_s→∞ Speedup = 1 / (1 − p) p : 병렬 가능 fraction ∈ [0,1] · s : 병렬부 speedup · (1−p) : sequential bottleneck

p=0.30, s=100 → 1/(0.70 + 0.003) ≈ 1.42× · 상한 1.43
p=0.90, s=100 → 1/(0.10 + 0.009) ≈ 9.2× · 상한 10
p=0.99, s=100 → 1/(0.01 + 0.0099) ≈ 50× · 상한 100
p=0.999, s=∞ → 상한 1000×

∴ p 를 99%+ 로 끌어올리는 것이 s 키우기보다 훨씬 중요.

4 Gustafson 보정 weak scaling

Speedup_G(p, N) = (1 − p) + p · N 문제 크기가 N에 따라 선형 증가 → sequential fraction이 상대적으로 작아져 확장성 ↑

Amdahl = strong scaling (고정 문제, rank ↑)
Gustafson = weak scaling (rank 당 일감 고정, rank ↑ → 총 문제 ↑)
↗ V15 §16 strong/weak scaling · α-β 모델 상세

5 A100 vs CPU 정량 baseline ★

지표	A100 SXM4	24-core CPU
FP64 peak	9.7 TF	~0.33 TF
FP32 peak	19.5 TF	~0.66 TF
TF32 TC	156 TF	—
FP16 TC	312 TF	—
BW	2.0 TB/s	~0.2 TB/s
Thread	~221K (108 SM × 2048)	48 HT

Source: NVIDIA A100 Datasheet v1.3 · 출처 고정값만 표기 · 실측 아님

6 병렬 프로그래밍 4대 난제 알메인동

Algorithmic complexity: 병렬화로 work 증가 가능 → work-efficient 확보
Memory-bound: 대부분 kernel이 BW 제약 → tiling 필수 ↗ §7
Input-sensitive: skewed 분포 → load imbalance
Synchronization: barrier · atomic · 결합성

4대 난제: 알메인동 (알고리즘 · 메모리 · 입력편향 · 동기화)

7 Programming Interface 지형

모델	범위	특징
CUDA	single node	explicit control · C/C++ ext
OpenCL	cross-vendor	CUDA와 유사 · JIT
OpenMP	shared-mem	pragma · GPU offload
MPI	cluster	message passing ↗ §16

1 Data Parallelism SPMD

정의 데이터의 독립 partition에 동일 연산 적용 → SPMD (Single Program Multiple Data). task parallelism과 구분 (서로 다른 task 동시).

예: vector add · 픽셀 변환 · GEMM · FFT
GPU가 빛나는 축: 각 thread 독립 · 동일 코드

2 CUDA 프로그램 5단계 할복실복해

할당 cudaMalloc(&d_A, n*sizeof(float))
복사 H→D cudaMemcpy(d_A, h_A, n, H2D)
실행 kernel<<<G, B>>>(d_A, ...)
복사 D→H cudaMemcpy(h_A, d_A, n, D2H)
해제 cudaFree(d_A)

5단계: 할복실복해 (할당·복사·실행·복사·해제)

3 함수 한정자 (Qualifiers)

한정자	호출측	실행측	비고
`__host__`	host	host	기본값
`__global__`	host (+dyn)	device	반드시 `void`
`__device__`	device	device	inline 기본
`__host__ __device__`	양쪽	양쪽	공통 유틸

4 Vector Add ★ canonical kernel

__global__ void vecAdd(const float* A,
                       const float* B,
                       float* C, int n){
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) C[i] = A[i] + B[i];
}
// host launch
int T = 256;
int G = (n + T - 1) / T;   // ceil div
vecAdd<<<G, T>>>(d_A, d_B, d_C, n);

if (i < n) boundary check 필수. n이 T 배수 아닐 때 OOB 발생.

5 Kernel 실행 구성 <<<G, B, shmem, stream>>>

grid = block 집합 (1D/2D/3D)
block = thread 집합 (≤1024 thread, 3D)
block 간 독립 실행 가정 → 동기화 불가 (transparent scalability)
block 내부 → __syncthreads(), shared memory OK
optional: dynamic shmem 크기 · CUDA stream 지정

grid[ block(0,0) block(1,0) ... ]
 └─ block[ t(0,0) t(1,0) ... t(31,0)  ← warp 0
            t(0,1) ...  t(31,1)       ← warp 1 ]

6 내장 변수 (Built-in)

변수	의미	최대
`gridDim`	grid 내 block 수	2³¹-1, 65535, 65535
`blockDim`	block 내 thread 수	1024, 1024, 64 (곱 ≤ 1024)
`blockIdx`	block의 grid 좌표	—
`threadIdx`	thread의 block 좌표	—

7 Error Handling 패턴

#define CUDA_CHECK(x) do{ \
  cudaError_t e = (x); \
  if (e != cudaSuccess){ \
    fprintf(stderr, "CUDA %s:%d: %s\n", \
      __FILE__,__LINE__, cudaGetErrorString(e)); \
    exit(1); \
  } \
} while(0)

CUDA_CHECK(cudaMalloc(&d, n));
kernel<<<G,B>>>(d);
CUDA_CHECK(cudaGetLastError());    // launch
CUDA_CHECK(cudaDeviceSynchronize()); // async

kernel 호출 자체는 async — sync 후에야 kernel 내부 오류 확정.

8 Compilation Flow .cu→PTX→SASS

foo.cu
 └─ nvcc frontend
     ├─ host C++ → g++/clang → host object
     └─ device code
         └─ cicc → NVVM IR → libNVVM → PTX (.ptx)
             └─ ptxas → SASS (cubin, sm_80/sm_90)
                 └─ fatbin → host object 링크

PTX = forward-compatible virtual ISA ↗ V03 §1
SASS = architecture-specific 실제 opcode ↗ V04 §12
JIT: 런타임에 driver가 PTX → 신 SASS 생성 가능
-arch=sm_80 Ampere · -arch=sm_90 Hopper

9 Memory Management 보조 API

API	용도
`cudaMallocHost`	pinned host · H↔D BW ↑
`cudaMallocManaged`	Unified Memory · page fault migrate
`cudaMemcpyAsync`	stream 기반 async copy
`cudaMemPrefetchAsync`	UM page를 GPU로 pull
`cudaMemset`	device 메모리 초기화

흔한 실수 3종: (a) host 포인터 kernel에 전달 → segfault · (b) memcpy 방향 플래그 혼동 · (c) launch 직후 kernel 오류 무시 (cudaGetLastError 누락).

1 왜 Multi-D ?

이미지 · 행렬 · volume · tensor = 본질적으로 다차원
threadIdx·blockIdx 모두 (x, y, z) → 자연스런 mapping
index 계산 오류 ↓ · 가독성 ↑

dim3 block(16, 16);                        // 256 thread
dim3 grid((W+15)/16, (H+15)/16);     // ceil
kernel<<<grid, block>>>(...);

2 1D Indexing ★

i = blockIdx.x · blockDim.x + threadIdx.x
stride = gridDim.x · blockDim.x grid-stride loop: for(i; i<N; i+=stride) — fixed grid로 큰 N 처리

for(int i = blockIdx.x*blockDim.x + threadIdx.x;
         i < N;
         i += gridDim.x*blockDim.x){
  out[i] = f(in[i]);
}

3 2D Indexing ★ row·col

row = blockIdx.y·blockDim.y + threadIdx.y
col = blockIdx.x·blockDim.x + threadIdx.x
idx = row · W + col (row-major)

혼동 주의: .x = col (너비축), .y = row (높이축). 반대로 쓰면 coalescing 붕괴.

4 3D Indexing volume

x = bx·Bx + tx
y = by·By + ty
z = bz·Bz + tz
idx = z · (H · W) + y · W + x

dim3 blk(8, 8, 8);          // 512 thread
dim3 grd((W+7)/8,
         (H+7)/8,
         (D+7)/8);

MRI · CT · fluid sim · 3D stencil에 쓰임.

5 Row-major Linearization

M[r][c] = M_flat[r · W + c]
V[d][r][c] = V_flat[d·(H·W) + r·W + c]
T[n][c][h][w] = T_flat[n·(C·H·W) + c·(H·W) + h·W + w] C/CUDA = row-major · Fortran/MATLAB = col-major

PyTorch 기본 = row-major (C contiguous)
stride 벡터로 일반화 ↗ V06 §8 CuTe Layout

6 2D Kernel Skeleton ★

__global__ void rgb2gray(uchar* out,
                         const uchar* in,
                         int W, int H){
  int c = blockIdx.x*blockDim.x + threadIdx.x;
  int r = blockIdx.y*blockDim.y + threadIdx.y;
  if (r < H && c < W){
    int i = r*W + c;
    const uchar* p = &in[3*i];
    out[i] = 0.21f*p[0]
           + 0.72f*p[1]
           + 0.07f*p[2];
  }
}

7 Naive Matrix Multiplication

__global__ void matMul(float* C,
                        const float* A,
                        const float* B, int N){
  int r = blockIdx.y*blockDim.y + threadIdx.y;
  int c = blockIdx.x*blockDim.x + threadIdx.x;
  if (r < N && c < N){
    float s = 0.0f;
    for (int k = 0; k < N; ++k)
      s += A[r*N + k] * B[k*N + c];
    C[r*N + c] = s;
  }
}

CGMA = 2 FLOP / 8 B = 0.25 FLOP/B — memory-bound. Tiling 필요 ↗ §7.

8 Block Size 경험칙 32의 배수

thread 수 = 32의 배수 (warp 단위)
128 / 256 / 512 에서 탐색 시작
16×16 = 256 (warp 8) · 32×32 = 1024 (max)
너무 작음 → occupancy ↓ · 너무 큼 → register 부족
__launch_bounds__(maxT, minB)로 컴파일러 힌트

9 Boundary Pattern

grid 계산 = ⌈N/B⌉ 올림 division
kernel 진입 직후 if (r<H && c<W) return;
padding 로드 시 0 또는 neutral element

W=1920, H=1080, block=16×16 → grid = 120 × 68 = 8,160 blocks, 총 thread = 2,088,960 (≈ 2.09 M).

1 실행 5-level ★

Grid
 └─ (Hopper+) Cluster        ← cluster.sync · DSMEM
     └─ Block (CTA)          ← __syncthreads
         └─ Warp = 32 thread ← __syncwarp · shfl
             └─ Thread       ← register · PC

Cluster는 sm_90+ 전용 ↗ V02 §10
Block 간 독립 가정 → transparent scalability
Warp = HW scheduling 단위 (scalar thread 아님)

2 SM 내부 구조 A100 예

SM (108개)
 ├─ 4 processing block (partition)
 │   ├─ warp scheduler × 1
 │   ├─ INT32 unit × 16
 │   ├─ FP32 unit × 16
 │   ├─ FP64 unit × 8
 │   ├─ Tensor Core (3rd gen) × 1
 │   ├─ LD/ST unit × 8
 │   └─ SFU × 4     (sin/exp/rsqrt)
 ├─ Register File (64K × 32b = 256KB)
 ├─ L1 / Smem 통합 (192 KB, carveable)
 ├─ L1 instruction cache
 └─ Constant cache (read-only)

Source: NVIDIA A100 GPU Architecture Whitepaper v1.0. Hopper 내부 상세 ↗ V02 §1.

3 SM Count 비교

GPU	SM	FP32 core	TC
V100 (sm_70)	80	5120	640 (1st)
A100 (sm_80)	108	6912	432 (3rd)
H100 (sm_90)	132	16896	528 (4th)

4 Warp ★ 정의 32 thread · SIMT

정의 Warp = 32 thread 한 묶음. SM 이 lock-step으로 실행. thread 들은 같은 PC · 서로 다른 data → SIMT (Single Instruction Multiple Thread).

block → warp 자동 분할, x축 우선으로 32씩
blockDim이 32 배수가 아니면 마지막 warp가 partial (비활성 lane)
Independent Thread Scheduling (Volta+) → thread별 PC 가능 → __syncwarp() 필수

5 SIMT vs SIMD

항목	SIMD	SIMT
레인	고정 vector width	thread ID로 주소·분기
분기	명시적 mask 코드	HW 자동 mask / serialize
메모리	aligned gather	lane별 독립 주소
모델	data-parallel	SPMD
예	AVX-512 · ARM SVE	NVIDIA warp · AMD wavefront

6 Warp Scheduler State E·I·S

Eligible: 의존성 해소, 실행 준비 완료
Issued: 이번 cycle에 dispatch 됨
Stalled: barrier / long-latency / scoreboard 대기

SM 매 cycle = pick one eligible warp per scheduler
A100 SM = 4 scheduler → 최대 4 warp / cycle issue stall reason 분류 ↗ V18 §7

7 Thread Hierarchy 매트릭스

단위	통신	동기	한계
Thread	register	—	255 reg
Warp	shfl · vote	`__syncwarp`	32 thread
Block	smem	`__syncthreads`	1024 thread · 48+ KB smem
Cluster	DSMEM	`cluster.sync`	≤16 block (sm_90+)
Grid	global	kernel 경계	2³¹ block

8 Transparent Scalability ★

원리 block 간 실행 순서·동시성을 가정하지 않는다 → 동일 바이너리가 SM 수 더 많은 미래 GPU 에서 자동으로 빨라짐.

∴ block 간 동기화 API가 의도적으로 부재
grid-level 동기는 kernel 분할로 달성
또는 cooperative groups grid_group::sync() (persistent kernel ↗ V05 §9)

9 Warp Formation 규칙

block (Bx, By, Bz) → 1D linearize t = tx + ty·Bx + tz·Bx·By
warp id = t / 32 · lane id = t % 32
같은 warp ⟺ t / 32 동일
∴ 16×16 block → warp 0 = (0..15, 0), warp 1 = (0..15, 1) 등

Compute 5핵심: SWDLT (SM · Warp · Divergence · Latency hiding · Transparent scale)

1 Control Divergence ★ uniform warp

정의 같은 warp 내 thread 가 서로 다른 경로로 분기 → path 별 순차 실행 + 반대 쪽 thread mask → throughput ÷ path 수.

if (tid % 2 == 0)    // ❌ warp ½ mask
  even_work();
else
  odd_work();             // 두 경로 직렬

해법: 분기 조건을 warp 경계 에 맞춰라. if (tid / 32 ...) 처럼 warp 전체가 같은 path.

2 Divergence 비용 공식

T_div = Σ_path T_path(active_path)
vs T_uniform = max path 분기 수 = path 수 · 비용은 path 길이 합

단, 모든 thread 가 skip 하면 whole-warp skip (cost 0)
early-return 이 mid-loop break 보다 저렴 (compiler 가 predicate)

3 분기 회피 패턴

조건 산술화: y = c ? a : b → y = a*c + b*(1-c)
boundary check는 가장 바깥 block 에서만
warp 별 uniform branch 유지 (tid/32 기준)
predicate instruction: compiler가 짧은 분기를 @p instr 로 변환

4 Synchronization 3종

API	scope	용도
`__syncthreads()`	block	smem 가시성
`__syncwarp(mask)`	warp	ITS 후 필수
`__threadfence()`	device	global 순서 강제
`__threadfence_block()`	block	smem 순서 강제
`cluster.sync()`	cluster	sm_90+

__syncthreads() 를 if 안쪽 에 두면 일부 thread 만 도달 → deadlock. 조건을 uniform 하게 만든 뒤 호출.

5 Latency Hiding ★

원리 한 warp가 long-latency op (DRAM load) 대기 → scheduler 가 다른 eligible warp 실행. 충분한 warp → latency 은폐.

warps_needed ≥ latency · issue_rate / ops_per_warp
Little's Law: in-flight bytes = BW × latency A100: 2.0 TB/s × 400 cyc @ 1.4 GHz ≈ 572 KB in-flight · ILP · MLP 와 결합

6 Occupancy 정의 ★ active / max

Occupancy = warps_active / warps_max/SM
A100: max_warps/SM = 64 (= 2048 thread / 32)

block=256 (8 warp) · SM당 4 block → 32 active warp
Occupancy = 32/64 = 0.50 (50%)

7 Occupancy 제약 4요소 B·T·R·S

자원	A100 /SM	H100 /SM	영향
Blocks	32	32	block 수 상한
Threads	2048	2048	warp 수 상한
Regs	65,536	65,536	thread당 reg
Smem	164 KB	228 KB	block당 shared

warps_limit = min(
max_warps,
max_blocks · warps/block,
reg_SM / (reg_thr·32),
smem_SM / smem_block)

Occupancy 4제약: BTRS (Block · Thread · Register · Shared)

8 Occupancy ≠ Performance

높은 occupancy = latency hiding 여력
너무 높이면 register spill → local memory (DRAM) → 오히려 ↓
GEMM · attention 은 reg 많이 씀 → 25~50% 가 정상
ILP / MLP 로 보완 가능 (한 thread 가 여러 독립 op 동시 in-flight)

목표는 충분한 occupancy (50%+) , 100% 아님.

9 `__launch_bounds__`

__global__
__launch_bounds__(256, 4)   // T, minBlocks/SM
void k(...){ ... }

컴파일러에게 "이 block 크기와 최소 occupancy 맞춰라" 지시
reg 사용을 제한해 지정 occupancy 달성
-Xptxas -v 로 실제 reg/smem 확인

1 메모리 계층 정량표 ★★ 전상지공레텍

종류	scope	latency	크기	선언
Register	thread	1 cyc	255/thr	auto
Shared / L1	block	~30 cyc	164/228 KB	`__shared__`
DSMEM	cluster	~50 cyc	≤16 SM	sm_90+
Constant	device RO	~L1 cache	64 KB	`__constant__`
L2	device	~200 cyc	40/50 MB	auto
Texture	grid RO	spatial cache	—	tex API
Local	thread	~DRAM	stack spill	auto
Global (HBM)	grid · host	~400 cyc	40/80 GB	`__device__`

Source: A100 / H100 Whitepaper. latency는 근사치, 세대별 실측 ↗ V02 §14.

2 Bandwidth 비교

level	A100 BW	H100 BW
Reg file /SM	—	—
Smem /SM	~19 TB/s	~33 TB/s
L2	~4 TB/s	~5.5 TB/s
HBM	2.0 TB/s	3.35 TB/s
NVLink	600 GB/s	900 GB/s
PCIe Gen4 x16	64 GB/s	—
PCIe Gen5 x16	—	128 GB/s

3 A100 vs H100 정량 비교 ★★ 10+ metric

지표	A100 SXM4 80GB	H100 SXM5 80GB
Compute Cap	8.0	9.0
SM 수	108	132
FP32 core	6,912	16,896
Tensor Core	432 (3rd)	528 (4th)
Register / SM	256 KB	256 KB
Smem max / SM	164 KB	228 KB
L2 cache	40 MB	50 MB
HBM	80 GB HBM2e	80 GB HBM3
Mem BW	2.0 TB/s	3.35 TB/s
FP64 peak	9.7 TF	34 TF
FP32 peak	19.5 TF	67 TF
TF32 TC	156 TF	495 TF
FP16 / BF16 TC	312 TF	989 TF
FP8 TC	—	1,979 TF
NVLink	600 GB/s	900 GB/s
TDP	400 W	700 W

세대별 신기능 표 ↗ V02 §7, §8.

4 Local Memory ⚠ 이름이 속임수

Local memory 는 물리적으로 HBM. "지역" 은 scope 만 의미. register spill · 큰 자동 배열 → local 로 떨어짐 → global latency.

compiler 가 register 부족 시 자동 allocate
-Xptxas -v 로 "X bytes stack frame" 확인
해법: loop fuse, 배열 크기 compile-time 상수화

5 PTX Storage Class 매핑

CUDA C	PTX	backing
auto var	`.reg`	register file
spill	`.local`	HBM (thread)
`__shared__`	`.shared`	SRAM (block)
`__constant__`	`.const`	const cache
`__device__`	`.global`	HBM
tex / surf	`.tex`	read-only L1
kernel arg	`.param`	const bank

PTX state space 상세 ↗ V03 §3.

6 Compute / Memory Ratio ★ CGMA · AI

CGMA = FLOPs / Bytes_HBM
ridge point = peak_FLOPS / peak_BW

A100: 156 TF / 2.0 TB/s = 78 FLOP/B (TF32)
H100: 989 TF / 3.35 TB/s = 295 FLOP/B (FP16)
AI < ridge → memory-bound · AI ≥ ridge → compute-bound

Roofline 상세 공식과 multi-tier 차트 ↗ V18 §1~3.

1 Tiling 핵심 ★★

정의 큰 행렬을 TILE × TILE 조각으로 분할 → shared memory 에 한 번 로드 후 block 내부 thread 들이 재사용 → DRAM 방문 횟수 ÷ TILE 배 감소.

Load phase: A 타일 · B 타일 → smem (협력 로드)
__syncthreads()
Compute phase: smem 에서 내적 부분합 누적
__syncthreads() · 다음 phase

Tiling 4step: 로싱계싱 (로드·sync·계산·sync)

2 Reuse Factor 분석

naive: DRAM loads/thr = 2N · 8B (A, B)
tiled: DRAM loads/thr = 2N/TILE · 8B
reuse = TILE
CGMA(tile) = 2·TILE FLOP / (2·8B) = TILE/8 FLOP/B TILE=32 → CGMA 4 FLOP/B · BW 요구 1/32

N=1024, TILE=32 · thread당 DRAM load: naive 2048 → tiled 64. compute 동일 → AI × 32 개선.

3 Tile Size Trade-off

TILE	smem/block	CGMA	Occupancy
8	0.5 KB	1 F/B	높음
16	2 KB	2 F/B	좋음
32	8 KB	4 F/B	감소
64	32 KB	8 F/B	낮음

4 Tiled Matmul ★★ shared mem full listing

#define TILE 32

__global__ void matMulTiled(
    float* C, const float* A,
    const float* B, int N){

  __shared__ float As[TILE][TILE];
  __shared__ float Bs[TILE][TILE];

  int bx = blockIdx.x, by = blockIdx.y;
  int tx = threadIdx.x, ty = threadIdx.y;

  int row = by*TILE + ty;
  int col = bx*TILE + tx;
  float acc = 0.0f;

  // phase 순회 (K 방향)
  for(int ph = 0; ph < (N+TILE-1)/TILE; ++ph){

    // ── 1. 협력 로드 ──
    int aCol = ph*TILE + tx;
    int bRow = ph*TILE + ty;
    As[ty][tx] = (row < N && aCol < N)
               ? A[row*N + aCol] : 0.0f;
    Bs[ty][tx] = (bRow < N && col < N)
               ? B[bRow*N + col] : 0.0f;
    __syncthreads();

    // ── 2. 부분 내적 ──
    #pragma unroll
    for(int k = 0; k < TILE; ++k)
      acc += As[ty][k] * Bs[k][tx];

    __syncthreads();
  }

  if(row < N && col < N)
    C[row*N + col] = acc;
}

5 Double Buffering 개념

개념 두 smem 버퍼 A/B 번갈아 — A 계산 중 B 로드 → compute · memory 병렬 (pipeline).

iter i   : load → B      compute ← A
iter i+1 : load → A      compute ← B
iter i+2 : load → B      compute ← A

smem 2배 필요 (occupancy ↓ 가능)
Ampere cp.async 로 자동화 ↗ V03 §6
Hopper TMA + mbarrier ↗ V04 §4

6 Thread Coarsening

정의 한 thread 가 출력 여러 개 (예 2×2) 담당 → smem 재사용 ↑, 중복 로드 ↓, register 사용 ↑.

coarsening factor K → CGMA · K 배 개선
단, register 사용 · K 배 → occupancy 감소 상충 일반 matmul: K = 4~8 가 sweet spot

7 Boundary Tile

N 이 TILE 배수 아닐 때 out-of-range → 0.0f 로드 (neutral)
출력 쓰기 직전 if(r<N && c<N)
zero padding → 수치적 안전 (summation 에 0 기여)

두 번째 __syncthreads() 생략 금지. 다음 phase 의 로드가 아직 쓰이는 As/Bs 를 덮어써 race condition.

1 Coalesced Access ★★ warp → 1 trans

정의 warp 32 thread 가 정렬된 128 B segment 에 접근 → HW 가 1 transaction 으로 묶음. 흐트러지면 최대 32 trans.

✓ coalesced:  t0→a[0],  t1→a[1],  ... t31→a[31]   (128 B)
✗ strided  :  t0→a[0],  t1→a[N],  ...             (N trans)
✗ random   :  t0→a[?]                              (최대 32 trans)

2 Coalescing 조건

warp 전체가 동일 명령 (not divergent)
주소 = base + stride·lane, stride = element size
base 가 128 B 정렬 (preferred)
elements 합 = 32 · 4B = 128 B (or 256B for 8B type)

efficiency = requested_bytes / transferred_bytes
strided(stride=2): 50% · stride=32: 3.1% H/W 가 aligned 128 B 단위로만 읽기 때문

3 Row vs Col Major 접근

접근	row-major A[r][c]	결과
tid→c	A[r][tid]	✓ coalesced
tid→r	A[tid][c]	✗ strided
transpose	A[c][tid]	strided 읽기 + coalesced 쓰기

Matmul B 접근 (col-major) 이 strided 될 때 → smem transpose 로 해결.

4 AoS vs SoA

// AoS (Array of Struct) — ❌ strided per-field
struct Atom { float x, y, z, m; };
Atom A[N];
__global__ void k(Atom* A){
  int i = ...;
  float x = A[i].x;      // stride 16B · 25% eff
}

// SoA (Struct of Array) — ✓ coalesced
struct Atoms {
  float* x; float* y;
  float* z; float* m;
};
float x = A.x[i];         // stride 4B · 100% eff

LLM tensor layout 은 자연스럽게 SoA (각 tensor 가 별도 배열).

5 Shared Memory Bank 구조

정의 shared mem = 32 bank · 각 bank 는 4B stripe · 동일 bank 에 서로 다른 word 동시 접근 → N-way conflict (직렬화 N배). 같은 word → broadcast (free).

bank(addr) = (addr / 4) mod 32 Kepler+: 32 bank, 4B word · 64-bit 모드 선택 가능 (sm_80+)

6 Bank Conflict 예시

s[tid]       : bank(tid) → 모두 다른 bank ✓ OK
s[2*tid]     : bank 0,2,4,...,62 → 2-way conflict
s[32*tid]    : 모두 bank 0        → 32-way (완전 직렬)
s[tid ^ pad] : XOR swizzle        → conflict 회피

A[32][32] col 접근 = 32-way conflict
A[32][33] padding → conflict 0 (한 칸 offset)
swizzle: XOR-based 재매핑 ↗ V06 §11

7 Padding Trick

// ❌ 32-way on column read
__shared__ float A[32][32];

// ✓ no conflict · 1 col wasted
__shared__ float A[32][33];

비용: smem 3% 낭비 (32→33)
이득: 32× throughput
vectorized load (float4) 시에는 padding 전략 달라짐

8 Vectorized Load/Store

float4 load = 128 B / 32 thread (1 tx)
int4, double2 동일 효과
alignment 필수: __align__(16) 또는 reinterpret_cast<float4*>
PTX ld.global.v4.f32 → 4-way vectorized

transactions = requests / 4 (float → float4) inst 수 ↓ · issue 여력 ↑ · BW 동일

9 Latency Hiding 역산

필요 parallelism = BW × latency
A100: 2.0 TB/s × 400 ns ≈ 800 KB in-flight Little's Law · 32 B / load → 25,000 동시 load 필요

Mem 최적화 5: 병코재분레 (병렬 · 코alesce · 재사용 · 분기 · 레지스터 균형)

1 1D · 2D Convolution 정의

y[i] = Σ_k=−R..R x[i+k] · F[k+R]
Y[r][c] = Σ_i=−R..R Σ_j=−R..R X[r+i][c+j] · F[i+R][j+R] F = filter/mask · R = radius = (K−1)/2 · K = kernel size

각 출력점 = 주변 (2R+1) 개 입력의 가중합
F 는 작고 (5×5, 7×7) 전 thread 가 재사용

2 Naive 1D Convolution

__global__ void conv1d(float* y,
    const float* x, const float* F,
    int N, int R){
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if(i >= N) return;
  float s = 0.0f;
  for(int k = -R; k <= R; ++k){
    int j = i + k;
    if(j >= 0 && j < N)
      s += x[j] * F[k + R];
  }
  y[i] = s;
}

CGMA: (2R+1) FLOP / ((2R+1)·4B) = 0.25 F/B — memory-bound.

3 3단계 최적화 ★ 상타할

상수 memory: F 를 __constant__ 로 (broadcast cache)
타일링: input tile 을 smem 에 로드
할로: tile + R 테두리 확장 로드 (neighbor 접근)

Conv 3: 상타할 (상수 · 타일 · 할로)

4 Constant Memory 패턴

__constant__ float F[25];    // 5×5 filter

// host
cudaMemcpyToSymbol(F, h_F, sizeof(F));

// device
__global__ void conv(...){
  ...
  s += X[...] * F[k + R];   // broadcast
  ...
}

64 KB 총량 · warp 전체가 같은 주소 읽으면 1 cycle
서로 다른 주소 → 직렬화
cached → 두 번째 접근부터 L1-cache 수준

5 Tiled + Halo 구조 ★

tile: TILE=16, R=2
┌──────── tile + halo ────────┐
│ halo ▒▒ │  inner tile  │ ▒▒ │
│      R  │   TILE       │  R │
└─────────┴──────────────┴────┘
smem 크기 = (TILE + 2R)
output 범위 = TILE
input 범위 = TILE + 2R

block thread 수 = TILE (또는 TILE+2R 협력)
halo thread 가 경계 예외 처리
경계 밖 → zero (또는 mirror/clamp/wrap)

6 Halo 로드 패턴

__shared__ float sX[TILE + 2*R];
int tx = threadIdx.x;
int gi = blockIdx.x*TILE + tx;

// 중앙 로드
sX[tx + R] = (gi < N) ? x[gi] : 0;

// 좌측 halo
if(tx < R){
  int li = gi - R;
  sX[tx] = (li >= 0) ? x[li] : 0;
}
// 우측 halo 유사
__syncthreads();

7 Boundary 처리 모드

mode	규칙	용도
zero	경계 밖 = 0	기본 · 수치 안전
clamp	가장자리 복제	edge detection
mirror	반사	영상
wrap	순환 (periodic)	FFT · physics

8 Separable Convolution

정의 2D filter F = f_row ⊗ f_col (외적 분해) → 2 개의 1D pass 로 대체. O(K²) → O(2K).

Gaussian · box · Sobel 등 분리 가능
임시 버퍼 T = x * f_row, y = T * f_col
K = 5 → 25 → 10 연산 (2.5× 절약)

9 CGMA 개선 예 2D 5×5

Naive: 25 FLOP / 25·4B = 0.25 F/B
Tiled 16×16 + halo R=2: input 20×20 = 400 · 4B = 1600B
output 16×16 = 256 thread → 25 FLOP/thr · 6.25 B/thr
CGMA = 4.0 F/B (16× 개선)

최신 GPU (Ampere+) 는 L2 hit rate 가 높아 naive + constant mem 조합으로도 양호. halo tile 은 큰 K 에서 이득 큼.

1 Stencil 정의

정의 격자 점이 자기 자신 + 이웃 고정 패턴 만 참조해 시간 step 진행. PDE · 유체 · 열 · CA.

u^t+1[i,j] = c₀·u^t[i,j] + c₁·(u^t[i±1,j] + u^t[i,j±1]) 5-point 2D · 7-point 3D · 9-point · 13-point 등

2 3D Tiled Stencil

3D volume → z-axis slab 순회
slab[z]        : smem (xy-plane, tile + halo)
slab[z-1, z+1] : register (z-neighbor)
                 → smem 절약
┌──────┐
│ halo │   reg: u[z-1], u[z], u[z+1]
│ xy   │   smem: tile(z) + halo
│ halo │
└──────┘

brick / 3.5D blocking 기법
read-read-write pattern 유지
thread coarsening (2~4개 z 담당) 효과 큼

3 안정성 조건 (PDE)

CFL: Δt ≤ Δx / max|v| (advection)
CFL: Δt ≤ Δx² / (2·D) (diffusion) 미달 시 수치적 발산 — kernel 수치 안정성 제약

double buffer: u_old ↔ u_new 교대
halo 초기화 = Dirichlet / Neumann / periodic 경계 조건

4 Histogram 정의

정의 입력 원소를 bin 으로 분류 → 각 bin 카운트 증가. output 공유 → race condition 필연 → atomic 필요.

__global__ void histNaive(
    int* h, const uchar* x, int N){
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if(i < N) atomicAdd(&h[x[i]], 1);
}

skewed input (거의 동일 bin) → global atomic contention → 수십× 저하.

5 Privatization ★ 사프머

block 마다 private smem hist 소유
thread 가 smem atomic 으로 카운트 (빠름)
__syncthreads()
block 결과 → global atomic merge

contention: N threads → N/G × B/B_block (≈ 1/G) G = block 수 · 하드웨어 원자 경합 분산

6 Privatized Histogram Kernel

__global__ void histP(int* h,
    const uchar* x, int N, int B){
  __shared__ int s[256];
  for(int t = threadIdx.x; t < B; t += blockDim.x)
    s[t] = 0;
  __syncthreads();

  // 각 thread grid-stride
  for(int i = blockIdx.x*blockDim.x + threadIdx.x;
       i < N;
       i += gridDim.x*blockDim.x)
    atomicAdd(&s[x[i]], 1);
  __syncthreads();

  for(int t = threadIdx.x; t < B; t += blockDim.x)
    atomicAdd(&h[t], s[t]);
}

7 Aggregation 기법

thread local counter 유지, bin 이 같으면 누적
bin 이 바뀔 때만 atomic
image spatial locality 클수록 이득 ↑
MoE / sort-then-reduce 계열로 확장 ↗ V05 §5

int curBin = -1, count = 0;
for(i = ...; i < N; i += stride){
  int b = x[i];
  if(b == curBin) ++count;
  else{
    if(curBin >= 0)
      atomicAdd(&s[curBin], count);
    curBin = b; count = 1;
  }
}

8 Large Bin 전략

bin 수 > smem 용량 (예 65536) → 단일 block 에 못 넣음
해법 1: block 별 부분 bin 담당 (bin 범위 분할)
해법 2: 2-pass (coarse hist → fine hist)
해법 3: register privatization (적은 bin)

9 Atomic API 요약

API	용도
`atomicAdd`	+ (int/float/double)
`atomicMin/Max`	최소·최대
`atomicAnd/Or/Xor`	비트 연산
`atomicExch`	swap
`atomicCAS`	compare-and-swap

PTX scope qualifier (.cta / .gpu / .sys) ↗ V03 §11.

1 Reduction 정의

정의 배열 전체를 결합연산 (+, max, min, ⊕) 으로 단일 값으로 축소. 결합법칙 · 교환법칙 있어야 tree reduction 가능.

work = O(N), depth = O(log N) 이상: work-efficient · log-depth · bank conflict 0 · divergence 0

부동소수는 결합법칙 근사적 성립 — 순서 다르면 미세 차이. ↗ V09 §7

2 Stage 1 — Naive Divergent ★

// ❌ warp 내 절반씩 mask → divergence ↑
__shared__ float s[BS];
s[tid] = input[i]; __syncthreads();
for(int st = 1; st < blockDim.x; st *= 2){
  if(tid % (2*st) == 0)
    s[tid] += s[tid+st];
  __syncthreads();
}

% 조건 → warp 절반씩 꺼짐. 모든 warp 가 divergent 상태.

3 Stage 2 — Interleaved Addressing

// 연속 thread가 동작 — divergence ↓
for(int st = 1; st < blockDim.x; st *= 2){
  int idx = 2*st*tid;
  if(idx < blockDim.x)
    s[idx] += s[idx+st];
  __syncthreads();
}

tid 연속 → warp 앞쪽만 활성
단 bank conflict 발생 (stride 2, 4, ... 증가)

4 Stage 3 — Sequential Addressing ★

// ✓ divergence 0 + bank conflict 0
for(int st = blockDim.x/2; st > 0; st >>= 1){
  if(tid < st)
    s[tid] += s[tid+st];
  __syncthreads();
}

active thread 가 warp 의 앞쪽에 몰림 → divergence 없음
stride = 1 단위 → bank conflict 없음
thread 절반은 idle (해결 Stage 4)

5 Stage 4 — First Add on Load

// load 단계에서 2개 합산
int i = blockIdx.x*(blockDim.x*2)
      + threadIdx.x;
s[tid] = input[i] + input[i + blockDim.x];
__syncthreads();
// 이후 sequential reduction

grid 크기 ½ → block 수 반감
thread 0 번째 load 에서 ½ work 이미 수행
DRAM throughput ↑

6 Stage 5 — Unroll Last Warp

// warp (32 thread) 는 implicit sync
if(tid < 32){
  volatile float* vs = s;
  vs[tid] += vs[tid+32];
  vs[tid] += vs[tid+16];
  vs[tid] += vs[tid+8];
  vs[tid] += vs[tid+4];
  vs[tid] += vs[tid+2];
  vs[tid] += vs[tid+1];
}

Volta+ ITS: volatile 만으로 안전하지 않음 — __syncwarp() 또는 shuffle 사용.

7 Stage 6 — Warp Shuffle ★★

// smem 불필요 · 레지스터만
float v = s[tid];
if(tid < 32){
  v += __shfl_down_sync(0xffffffff, v, 16);
  v += __shfl_down_sync(0xffffffff, v, 8);
  v += __shfl_down_sync(0xffffffff, v, 4);
  v += __shfl_down_sync(0xffffffff, v, 2);
  v += __shfl_down_sync(0xffffffff, v, 1);
  if(tid == 0) out[blockIdx.x] = v;
}

lane 간 direct register exchange
smem 사용 0 · __syncthreads 불필요
__shfl_down_sync(mask, v, offset)

8 Stage 7 — Multi-block / Library

block level: warp shuffle 로 partial sum → smem
block 0 (warp 0) 이 block sum 들을 다시 reduce
또는 atomicAdd(&out[0], blockSum)
실무: cub::DeviceReduce / thrust::reduce

Cooperative Groups:
cg::reduce(tile, val, plus<>()) grid 전체 cooperative kernel ↗ V05 §2

9 7-Stage 진화 요약표

Stage	Div	Bank	Idle
1 Naive	대	—	—
2 Interleave	↓	대	—
3 Sequential	0	0	½
4 +LoadAdd	0	0	½
5 Unroll last warp	0	0	↓
6 Shuffle	0	n/a	↓
7 Multi-blk/CUB	0	n/a	0

Reduction 7: 나인시로언셔다 (나이브·인터리브·시퀀셜·로드합·언롤·셔플·다중블록)

1 Scan 정의 ★

정의 Inclusive: y[i] = x[0] ⊕ x[1] ⊕ ... ⊕ x[i]
Exclusive: y[i] = x[0] ⊕ ... ⊕ x[i−1] · y[0] = identity

x = [3, 1, 7, 0, 4, 1, 6, 3]
inc = [3, 4, 11, 11, 15, 16, 22, 25]
exc = [0, 3, 4, 11, 11, 15, 16, 22]

⊕ = 임의의 결합 연산 (+, max, *, string concat …)
stream compaction · radix sort · sparse 의 기반

2 Kogge-Stone work = N log N

__shared__ float s[BS];
s[tid] = x[i]; __syncthreads();
for(int o = 1; o < BS; o *= 2){
  float v = (tid >= o) ? s[tid - o] : 0;
  __syncthreads();
  if(tid >= o) s[tid] += v;
  __syncthreads();
}

work = N · log N, depth = log N 중복 작업 많음 · 병렬성 ↑ · 구현 단순

3 Brent-Kung ★ up·down

Up-sweep: pairwise 합을 binary tree 위로 (reduction 과 동일)
tree 맨 위 원소를 identity (= 0) 로 초기화
Down-sweep: 왼쪽·오른쪽 합 교환·누적하며 내려옴

work = 2N − 2, depth = 2 log N work-efficient · Kogge-Stone 보다 work ½ · depth 2배

4 Brent-Kung Up-Sweep

for(int st = 1; st < BS; st *= 2){
  int idx = (tid + 1) * 2 * st - 1;
  if(idx < BS)
    s[idx] += s[idx - st];
  __syncthreads();
}
// 맨 위 원소 초기화 (exclusive 의 경우)
if(tid == 0) s[BS-1] = 0;

5 Brent-Kung Down-Sweep

for(int st = BS/2; st > 0; st /= 2){
  __syncthreads();
  int idx = (tid + 1) * 2 * st - 1;
  if(idx < BS){
    float t = s[idx - st];
    s[idx - st] = s[idx];
    s[idx] += t;
  }
}

6 3-Phase Hierarchical Scan

각 block 이 local scan (shared mem)
block sum 수집 → aux[]
aux 를 scan
block i 의 모든 원소에 aux[i−1] 더함

┌ blk0 ┐┌ blk1 ┐┌ blk2 ┐
 local   local   local
   Σ0      Σ1      Σ2
   └────── scan aux ─────┘
   +0      +Σ0     +Σ0+Σ1

7 Scan 알고리즘 비교표 ★

알고리즘	work	depth	특징
Sequential	N	N	기준
Hillis-Steele	N log N	log N	단순 · 짧은 N
Kogge-Stone	N log N	log N	work ↑ · 병렬 ↑
Brent-Kung	2N	2 log N	work-efficient ★
Sklansky	(N/2)·log N	log N	divide & conquer

고급 변종 (segmented · warp-level · CUB) ↗ V05 §4.

8 Scan 응용 ★

Stream compaction: flag[i] ∈ {0,1} scan → 밀집 output index
Radix sort: bit 별 scan + scatter
CSR build: row 길이 → row_ptr (exclusive scan) ↗ §14
quicksort partition · histogram CDF
LLM MoE token permutation ↗ V08 §2

9 Warp-level Scan

// inclusive scan within warp
float v = x;
for(int o = 1; o < 32; o *= 2){
  float t = __shfl_up_sync(
             0xffffffff, v, o);
  if(lane >= o) v += t;
}

Scan 4 key: 인엑업다 (인clusive · 엑clusive · 업-sweep · 다운-sweep)

결합성 필수. 부동소수 합 순서 바뀌면 값 미세 변화 → 재현성 주의. 수치 안정성 ↗ V09 §7.

1 Parallel Merge 정의

정의 두 정렬 배열 A (|A|=m) , B (|B|=n) → 정렬된 C (|C|=m+n). 핵심: co-rank 함수로 각 thread 가 담당할 (i, j) 시작점을 binary search 로 찾음.

co-rank(k) = (i, j) such that i + j = k
and A[i−1] ≤ B[j] && B[j−1] < A[i] k = thread 담당 출력 index 범위의 시작

2 Co-rank Binary Search

int coRank(int k, const T* A, int m,
                   const T* B, int n){
  int iLow = max(0, k - n);
  int iHigh = min(k, m);
  while(1){
    int i = (iLow + iHigh)/2;
    int j = k - i;
    if(i > 0 && j < n && A[i-1] > B[j])
      iHigh = i - 1;
    else if(j > 0 && i < m && B[j-1] > A[i])
      iLow = i + 1;
    else return i;
  }
}

O(log(m+n)) per thread · total O((m+n) log(m+n)) work.

3 Merge Path

A →
│A[0] A[1] A[2] A[3]
│  ╱╱╱
B↓ ╱╱╱
B[0]  → 대각선 = merge path
B[1]
B[2]

각 대각선 위의 점 → 하나의 출력 구간
thread block 이 하나의 tile 담당
block 간 완전 독립 → grid scale

4 Sort 방식 비교

방법	복잡도	특징
Merge sort	O(N log N)	병합 기반 (§1-3)
Radix sort	O(N·k)	bit scan + scatter
Bitonic	O(N log²N)	정렬 네트워크 · 고정 size
Sample sort	O(N log N)	pivot 기반 · 분산

실무: cub::DeviceRadixSort · thrust::sort. 직접 구현은 특수 요건일 때만.

5 Radix Sort 원리 ★

LSB (또는 MSB) 부터 b-bit 씩
각 pass: 각 key 의 그 b-bit 추출
histogram (2^b bin) + scan → scatter index
key 를 output 위치에 scatter
상위 bit 로 반복

pass 수 = ⌈bits / b⌉ · 4-bit radix → 8 pass (32-bit int) 안정 정렬 필수 → 같은 bin 내 순서 유지

6 Radix Sort 핵심 Kernel

// 1 pass of b-bit radix
int key = in[i];
int digit = (key >> shift) & MASK;

// block-level hist + scan
atomicAdd(&hist[digit], 1);
__syncthreads();
blockExclusiveScan(hist);

// scatter
int dst = hist[digit]
          + blockOffset(digit);
out[dst] = key;

7 Bitonic Sort

정의 bitonic 수열 = 단조 증가 + 단조 감소 연결. log²N 단계의 고정 comparator 네트워크 로 정렬.

SIMT 친화적 (branch-free)
작은 N · power-of-2 · warp 단위 sort 에 유리
top-k selection 기반 구현

bitonic network log²N step
stage 1: compare (i, i±1)
stage 2: compare (i, i±2)
...
stage log N: compare (i, i±N/2)
각 stage = log N substep

8 Sort 선택 결정 트리

integer key · N 큼 → radix sort
float32 · stable 필요 없음 → radix (IEEE 754 비트 tweaking)
일반 compare-sort · 작은 N → bitonic / block sort
top-k · 큰 K → radix top-k ↗ V05 §6
top-k · 작은 K → bitonic top-k

9 Implementation Notes

라이브러리	특징
`cub::DeviceRadixSort`	keys / key-value · stable
`cub::BlockRadixSort`	block 내 radix
`thrust::sort`	STL-style · radix backend
`cub::DeviceMergeSort`	merge 기반 · custom compare

1 Sparse Matrix Format 비교 ★

format	구성	장단
COO	(row, col, val)	단순 · 랜덤 access ×
CSR	row_ptr, col_idx, val	표준 · SpMV · row-major
CSC	col_ptr, row_idx, val	col access 유리
ELL	고정 폭 padded	균일 row · waste
JDS	정렬 + jagged	load balance ↑
HYB	ELL + COO	혼합 sparsity

2 CSR 구조

A = [1 0 2 0]     rp=[0,2,3,5,6]
    [0 0 3 0]     ci=[0,2,2,1,3,0]
    [0 4 5 0]     val=[1,2,3,4,5,6]
    [6 0 0 0]
V=4, nnz=6
row r 의 원소 = ci, val [rp[r]..rp[r+1])

row_ptr 크기 = V+1 (exclusive scan of row 길이)
col_idx 크기 = nnz
val 크기 = nnz

3 SpMV (CSR)

y = A · x
y[r] = Σ_{k=rp[r]..rp[r+1]} val[k] · x[ci[k]] row 길이 불균등 → load imbalance 주의

__global__ void spmv_csr(
    float* y, const int* rp,
    const int* ci, const float* val,
    const float* x, int V){
  int r = blockIdx.x*blockDim.x + threadIdx.x;
  if(r >= V) return;
  float s = 0;
  for(int k = rp[r]; k < rp[r+1]; ++k)
    s += val[k] * x[ci[k]];
  y[r] = s;
}

vector · CSR-stream · merge-based kernel 전략 ↗ V05 §13.

4 CSR Build = Scan !

step 1: row 별 nnz 카운트
step 2: exclusive scan → row_ptr
step 3: (r, c) 쌍을 row_ptr 위치로 scatter
§12 scan 직접 활용 ★

row_ptr[r+1] = Σ_i≤r nnz(row_i)

5 Graph = Sparse Matrix

대응 adjacency matrix A: A[u][v] = 1 iff edge (u,v) 존재. sparse → CSR 직접 재사용. row_ptr[v] = v 의 이웃 범위.

V=5, E=7 (directed)
row_ptr: [0, 2, 3, 5, 6, 7]
col_idx: [1,2, 3, 0,4, 4, 2]
(v=0 neighbors: 1,2 · v=1: 3 · v=2: 0,4 · ...)

6 BFS — Level-Sync ★

source 로부터 level 단위 동심원 확장
각 level = 1 kernel launch (grid-level sync = kernel 경계)
depth = O(diameter) ≪ V

frontier_L+1 = {u | ∃v ∈ frontier_L, (v,u)∈E, level[u]=∞} level[u] = L+1 로 세팅

7 Vertex-Centric BFS (naive)

__global__ void bfsV(
    const int* rp, const int* ci,
    int* lvl, int L, int* done){
  int v = blockIdx.x*blockDim.x + threadIdx.x;
  if(lvl[v] == L){
    for(int k = rp[v]; k < rp[v+1]; ++k){
      int u = ci[k];
      if(lvl[u] == INF){
        lvl[u] = L + 1;
        *done = 0;
      }
    }
  }
}

degree 편차 큼 → load imbalance. 고차 vertex 가 warp 독점.

8 Frontier-Based BFS ★ 프론

현재 frontier 큐 유지 (크기 ≪ V)
thread 당 frontier 원소 → 이웃 탐색
새 vertex 는 atomic으로 next frontier push
frontier swap · level++

work = O(V+E) · 실제 access = Σ|frontier_L| 모든 vertex 검사 naive 보다 압도적 빠름

9 Direction-Optimizing (Beamer)

정의 frontier 크면 bottom-up (모든 미방문 vertex가 부모 있는지 확인), 작으면 top-down (기존 push).

social graph (small-world) 에서 큰 이득
heuristic: |frontier| / V 임계값 (≈ 1/20)
Privatization → block smem next-frontier → global atomic 1회

BFS 4 key: 레프엣방 (레벨 · 프론티어 · 엣지 중심 · 방향 전환)

1 학습 파이프라인

Forward: x → (Conv / FC / Norm / Act)… → ŷ
Loss: L(ŷ, y)
Backward: ∂L/∂w · ∂L/∂x (chain rule)
Update: w ← w − η · ∂L/∂w (SGD / Adam)

mini-batch = data-parallel 병렬 축 · distributed 학습은 ↗ V17

2 FC Layer = GEMM

Y = X · W + b
X: [B, K] · W: [K, M] · Y: [B, M] B = batch · K = in feature · M = out feature

§7 tiled matmul 그대로 사용
cuBLAS / CUTLASS ↗ V06
Tensor Core FP16/BF16 input · FP32 accumulate

3 Convolution Layer

Y[n,m,r,c] = Σ_k,i,j X[n,k,r+i,c+j] · W[m,k,i,j] + b[m] n = batch · m = out ch · k = in ch · (i,j) = 필터 위치

5-nested loop → naive kernel CGMA 낮음 → 재구성 필요.

4 im2col + GEMM ★ 펴서 곱하기

변환 conv = unfold input → 2D column matrix → W reshape 후 GEMM.

Conv = (unfold X) × (reshape W)
X_col : [C·Kh·Kw, N·Oh·Ow]  ← patch 중복 저장
W_row : [M, C·Kh·Kw]
Y     : W_row · X_col       ← GEMM!

X_col 중복 저장 → memory ↑, GEMM 재사용 큰 이득
cuDNN 이 내부 알고리즘 자동 선택

5 Conv 알고리즘 선택

방법	유리 조건
Direct	큰 C · 작은 spatial
im2col+GEMM	범용 · 안정
Implicit GEMM	memory ↓ · indexing 복잡
Winograd	3×3 · F(2,3), F(4,3)
FFT	큰 필터 (≥7×7)

cuDNN findAlgorithm 자동 벤치.

6 Backward = 3 GEMM

dW = X^T · dY
dX = dY · W^T
db = Σ_b dY (reduction) 세 GEMM · 동일 tile kernel 재사용

conv backward: transposed conv 로 구현
recompute (gradient checkpoint) → memory ↓ ↗ V17 §7

7 Mixed Precision ★

정의 FP16/BF16 저장 · 연산 + FP32 accumulate. A100 FP16 TC 312 TF (FP32 의 ~16×) · H100 989 TF.

BF16: FP32 와 동일 exponent → loss scaling 불필요
FP16: loss scaling 필요 (underflow 방지)
FP8: per-tensor amax history · delayed scaling
dtype 상세 ↗ V09 §2~5

accumulate 는 FP32 유지. 이유: 부분합이 길어 오차 누적.

8 Normalization Kernel

LayerNorm: y = γ · (x − μ) / √(σ² + ε) + β
RMSNorm: y = γ · x / √(mean(x²) + ε)

2-pass (mean → var) → 1-pass Welford
warp reduction 으로 row-wise stats
residual · dropout fused 가능 ↗ V08 §8

9 Softmax · Activation

softmax(x)_i = exp(x_i − max) / Σ exp(x_j − max) subtract max → overflow 방지 (log-sum-exp)

2-pass → online softmax 1-pass ↗ V07 §2
ReLU / GELU / SwiGLU: element-wise · fuse into prev GEMM epilogue ↗ V06 §16
attention 은 softmax + matmul 의 합성 kernel ↗ V07

DL primitive 3: 젬임믹 (GEMM · im2col · mixed precision)

1 왜 Multi-GPU?

단일 노드 GPU 메모리 한계 (A100/H100 80 GB)
대형 과학 시뮬 · LLM 학습 · 추론
노드 간: MPI (message passing)
노드 내: CUDA + NCCL (shared · NVLink)

Node0 [GPU0 GPU1 ...] ─┐
Node1 [GPU0 GPU1 ...] ─┼─ IB / NVLink Switch
...                    │
  └─ MPI_COMM_WORLD ───┘

2 Device Management

int n;
cudaGetDeviceCount(&n);        // 노드 내 GPU 수
cudaSetDevice(rank % n);       // rank → GPU 매핑

cudaDeviceProp p;
cudaGetDeviceProperties(&p, id);
// p.name, p.totalGlobalMem, p.multiProcessorCount ...

프로세스당 1 GPU 권장 (context 충돌 회피)
SLURM / PBS 가 rank → GPU 배정

3 Peer Access

int canAccess;
cudaDeviceCanAccessPeer(&canAccess,
                          src, dst);
if(canAccess){
  cudaSetDevice(src);
  cudaDeviceEnablePeerAccess(dst, 0);
  // 이후 dst 포인터 직접 read/write OK
}
cudaMemcpyPeer(d_dst, dst, d_src, src, n);

NVLink 통한 P2P DMA · PCIe 도 가능 (BW 낮음).

4 MPI 핵심 API

API	의미
`MPI_Init`	초기화
`MPI_Comm_rank`	내 rank
`MPI_Comm_size`	총 rank 수
`MPI_Send/Recv`	P2P blocking
`MPI_Isend/Irecv`	non-blocking + Wait
MPI_Allreduce	모든 rank reduce 결과
`MPI_Bcast`	rank 0 → all
`MPI_Barrier`	동기화
`MPI_Reduce_scatter`	reduce + scatter

5 Domain Decomposition ★

정의 격자를 rank 수만큼 분할 · 각 rank가 자신 영역 담당. 경계는 ghost / halo를 인접 rank와 교환.

┌──────┬──────┬──────┐
│rank0 │rank1 │rank2 │
│  ▒▒▒ │▒▒▒▒▒▒│ ▒▒▒  │  ← halo 교환
└──────┴──────┴──────┘
1D · 2D · 3D 분할 (surface/volume ratio 고려)

6 전형 Loop

for(step){
  MPI_Isend(halo_right, right_rank, ..., &req[0]);
  MPI_Irecv(halo_left,  left_rank,  ..., &req[1]);

  compute_interior<<<>>>(...);   // overlap

  MPI_Waitall(2, req, status);
  compute_boundary<<<>>>(...);
}

통신·계산 overlap 이 strong scaling 핵심. Isend/Irecv 필수.

7 CUDA-aware MPI ★

GPU 포인터를 MPI_Send 에 직접 전달 가능
내부적으로 GPUDirect RDMA → CPU 경유 ×
OpenMPI · MVAPICH2 · HPC-X 지원
컴파일 확인: ompi_info --all | grep mca_btl_openib_cuda

effective BW ≈ min(NVLink, IB/Ethernet) NVLink 600~900 GB/s · IB HDR 25 GB/s → inter-node 병목

8 NCCL vs MPI

항목	MPI	NCCL
범위	일반 HPC	DL 특화
원소	임의 datatype	GPU tensor
topology	수동/자동	자동 detect
API	많음 (수십)	좁음 (collective 중심)
사용처	과학 시뮬	PyTorch DDP · FSDP

collective · topology · ring/tree 상세 ↗ V15.

9 Scaling 개념

종류	고정	확장
Strong	총 문제 크기	rank ↑ → time ↓ (Amdahl)
Weak	rank당 크기	rank ↑ → 문제 ↑ (Gustafson)

α-β 통신 모델 · ring allreduce 알고리즘 ↗ V15 §3
TP / PP / DP / EP / SP / CP 병렬화 축 ↗ V15 §7~13

MPI+CUDA 4: 도할오어 (도메인 · 할로 · 오버랩 · 어웨어 MPI/NCCL)

1 7 Parallel Pattern 1-liner

패턴	핵심 1-line
Map	thread 1개 → output 1개 · embarrassingly parallel
Stencil	이웃 참조 · halo + smem tile
Reduction	tree · shuffle · multi-block
Scan	Brent-Kung · up/down-sweep
Histogram	privatization · atomic merge
Convolution	const filter + tiled halo
Sort/Merge	co-rank · radix scan/scatter

2 Indexing 공식 5종 ★

1D: i = bx·Bx + tx
2D row: r = by·By + ty · c = bx·Bx + tx · idx = r·W + c
3D: z·(H·W) + y·W + x
grid-stride: for(i; i<N; i += gridDim·blockDim)
warp id = tid/32 · lane = tid%32

3 Occupancy 3제약

warps_SM ≤ min(
max_warps = 64,
reg_SM/(reg_thr·32),
smem_SM/smem_block)
Occupancy = warps_active/64

4 메모리 6종 1-line

scope	특징
Register	1 cyc · thread · 255 상한
Shared	30 cyc · block · 32 bank
L2	200 cyc · device · 40/50 MB
Global	400 cyc · grid · 80 GB
Constant	broadcast · 64 KB
Texture	spatial cache · RO

5 흔한 실수 10선 ★

host 포인터를 kernel에 전달 → segfault
cudaMemcpy 방향 플래그 혼동
launch 직후 오류 체크 누락 (cudaGetLastError)
boundary if 빠짐 → OOB
__syncthreads() 를 분기 안쪽
shared mem 초기화 race (sync 누락)
strided global access → BW 낭비
shared mem bank conflict (32-way)
auto 배열 → local spill → DRAM
FP16 accumulate → overflow · 결과 망가짐

6 수식 Cheat

Amdahl: 1/((1−p) + p/s)
CGMA: FLOP / Byte_HBM
Roofline: min(peak, AI·BW)
Little: in-flight = BW · latency
Tiling reuse = TILE
Scan BK: work 2N, depth 2 log N
Reduction: work N, depth log N

7 니모닉 총정리

CUDA 5단계: 할복실복해
Compute 5: SWDLT
메모리 6: 전상지공레텍
Occupancy 4: BTRS
튜닝 5: 병코재분레

8 패턴 뱃지

Map Stencil Hist Reduce Scan Merge Sort SpMV BFS Conv GEMM

공통 변환: tile + smem + coalesce + atomic 최소화 + scan 기반.

9 18권 전체 의존 지도

V01 (PMPP)
 │
 ├─► V02 (HW)    ─► V03 (PTX) ─► V04 (Hopper)
 │                                   │
 ├─► V05 (Primitive)                 │
 │     │                             │
 │     └─► V06 (GEMM) ─► V07 (Attn)
 │                            │
 ├─► V09 (Numeric) ─► V10 (Quant)
 │                                   │
 └─► V15 (Distrib)                   │
           ├─► V16 (Serving)         │
           └─► V17 (Training) ◄──────┘

V11–V14: Triton · MLIR · Inductor · XLA/TVM
V18: Profiling (모든 권 뒤)

10 다음 권으로 이동

실리콘 정량 · 세대별 기능 → ↗ V02
PTX ISA · mma · cp.async → ↗ V03
Hopper TMA · WGMMA · SASS → ↗ V04
병렬 primitive 고급 · Stream-K → ↗ V05
CUTLASS · CuTe · layout algebra → ↗ V06
FlashAttention v1/v2/v3 · Paged → ↗ V07
MoE · RoPE · RMSNorm · speculative → ↗ V08
FP32/TF32/BF16/FP16/FP8/FP4 → ↗ V09
GPTQ · AWQ · SmoothQuant · KV quant → ↗ V10
Triton compiler · Layout inference → ↗ V11
MLIR · NVPTX backend → ↗ V12
TorchInductor · FX graph → ↗ V13
XLA · TVM · Polyhedral → ↗ V14
NCCL · TP/PP/DP/EP/SP/CP → ↗ V15
vLLM · SGLang · TRT-LLM → ↗ V16
FSDP · ZeRO · checkpointing → ↗ V17
Roofline · stall · metric → ↗ V18

11 완성 체크리스트 ★

점검 항목	OK
warp 정렬 분기 (div ↓)	□
Coalesced global access	□
Shared mem reuse (tile)	□
Bank conflict 회피 (pad/swizzle)	□
Occupancy 50%+ · reg 과다 ×	□
atomic → smem 먼저	□
boundary if 빠짐 없음	□
`__syncthreads` uniform 위치	□
CUDA_CHECK · memcheck 통과	□

실무: CUB · thrust · cuBLAS · cuSPARSE · cuDNN · CUTLASS 먼저 고려. 직접 구현은 특수 케이스 · fused activation/norm 수준만.