CUDA 18 · T1 HW & ISA · CONTENT-FIRST · A4 LANDSCAPE · 18p

PTX ISA 완전판 단권화

Ampere PTX: mma · ldmatrix · cp.async · Memory Model

Volume V03/18

Tier T1 HW & ISA

선행 V01, V02

용도 PTX 읽기·inline asm 이해의 정지된 지도

범례

핵심 용어 (kw · 노란 배경)

매우 중요 (kw2 · 분홍)

정의 / 수식 박스

예시 / PTX snippet

빨강주의 · 함정

★Ampere 핵심 (페이지당 ≤3)

(!)니모닉 (권당 ≤5)

↗cross-volume reference

→인과 · 흐름

∵∴이유 · 결론

인쇄 A4 가로 / 여백 없음 / 배경 그래픽 포함 · Ctrl(⌘)+P

Source: PTX ISA 8.4+ · CUDA C PG Appendix · CUTLASS inline asm

1 Virtual ISA 가상 명령어 집합

정의 PTX = Parallel Thread Execution. NVIDIA GPU용 architecture-independent 중간 표현. 실제 HW는 SASS (architecture-specific binary)에서 돈다.

PTX는 stable contract — 세대 건너 forward-compat
SASS는 세대마다 opcode·encoding 변경 ↗ V04 §12
LLVM IR과 유사하나 GPU-specialized (warp·shared·predicate)

2 Compilation Flow ★ .cu→PTX→SASS

.cu  --(nvcc front)-->  .ptx
.ptx --(ptxas)-->        .cubin (SASS)
.cubin ──pack──>         fatbinary
fatbin + embedded PTX ──>  application

// runtime (JIT) path:
driver.loadModule(ptx) → ptxas(JIT) → SASS

Source: CUDA C PG Compilation 섹션 (13.1)

3 Arch String -arch / -code

문자열	의미	대상
`compute_80`	PTX virtual arch	Ampere GA100
`sm_80`	SASS real arch	A100
`sm_86`	SASS real arch	GA10x 소비자
`sm_90`	SASS real arch	Hopper H100
`sm_90a`	arch-specific	TMA/WGMMA 용

4 Forward Compatibility

원칙 오래된 PTX는 새 GPU에서 JIT로 재-컴파일 가능. 반대로 새 PTX feature를 옛 GPU에서는 에뮬 불가.

fatbin = SASS(여러 arch) + PTX(1 virtual arch) 번들
일치하는 SASS 없으면 PTX가 드라이버에서 JIT
JIT 캐시: ~/.nv/ComputeCache

5 PTX Version ↔ Arch

PTX ver	도입 Arch	대표 기능
7.0	sm_80	cp.async, mma BF16/TF32
7.1	sm_86	ldmatrix 확장
7.5	sm_86	일반 개선
7.8	sm_90	TMA, WGMMA, cluster
8.0	sm_90	tensormap
8.3+	sm_90a	arch-specific 확장

cf. PTX ISA 8.4 (CUDA 12.4) 기준

6 왜 PTX를 읽는가

inline asm로 intrinsic 없는 명령 호출
컴파일러 출력 검증 (의도한 fma가 났는가)
CUTLASS 등 라이브러리 커널의 실제 동작 추적
register pressure·unroll 결과 확인

7 PTX ↔ SASS 대응

PTX 1 instr → SASS 0~N instr ptxas가 register 할당·instruction scheduling·peephole 최적화 수행. 1:1 대응 아님.

PTX mad.lo.s32 → SASS IMAD
PTX ld.global → SASS LDG.E.128
PTX mma.sync → SASS HMMA.16816.F32
상세 매핑 ↗ V04 §13

8 PTX 생성 트리거

# .ptx 파일 직접 보기
nvcc -ptx kernel.cu -arch=sm_80

# PTX + SASS 모두
nvcc -keep kernel.cu -arch=sm_80
# → kernel.ptx, kernel.cubin 생성

# cubin에서 SASS 역)
cuobjdump --dump-sass kernel.cubin

9 핵심 용어 한 줄

PTX: virtual ISA text (human-readable)
SASS: real ISA binary (GPU가 실제 실행)
ptxas: PTX → SASS compiler
fatbinary: multi-arch 번들
JIT: driver가 runtime에 PTX 컴파일

흔한 오해: PTX는 실제 ISA가 아니다. GPU는 SASS만 실행한다. 성능 튜닝은 SASS 관찰이 최종 근거.

1 파일 골격 ★ ver · target · addr · func

// PTX skeleton
.version 7.8
.target  sm_80
.address_size 64

.visible .entry mykernel(
  .param .u64 mykernel_param_0,
  .param .u32 mykernel_param_1
)
{
  .reg .b32 %r<8>;
  .reg .b64 %rd<4>;
  .reg .pred %p<4>;
  // body
  ret;
}

2 Directive . 으로 시작

directive	역할
`.version`	PTX ISA 버전
`.target`	대상 sm_XX
`.address_size`	32 / 64
`.entry`	kernel 함수
`.func`	device 함수
`.visible`	external 노출
`.param`	인자 선언
`.reg`	register 선언
`.shared`	shared mem 변수

3 Instruction 형식 ★

[@pred] op.modifier.type dst, src1, src2, …; pred: predicate guard · modifier: state space / rounding / cache hint · type: operand 폭·부호·부동

// 예
@%p1 add.s32 %r3, %r1, %r2;
     mad.lo.s32 %r5, %r3, %r4, %r5;
     ld.global.nc.f32 %f1, [%rd1];
     st.shared.v4.f32 [%rs0], {%f0,%f1,%f2,%f3};

4 Type Modifier (!) uspbf

type	크기	해석
`.u8 .u16 .u32 .u64`	1~8B	unsigned int
`.s8 .s16 .s32 .s64`	1~8B	signed int
`.b8 .b16 .b32 .b64 .b128`	1~16B	opaque bit
`.f16 .f32 .f64`	2~8B	IEEE float
`.bf16`	2B	brain float
`.tf32`	19-bit	TC-only, 32b 폭
`.f16x2 .bf16x2`	4B	packed 2-way
`.pred`	1-bit	predicate

5 Register Naming convention

prefix	type	관례
`%r`	.b32 / .u32 / .s32	general 32-bit
`%rd`	.b64 / .u64	address 64-bit
`%f`	.f32	float
`%fd`	.f64	double
`%h`	.f16 / .b16	half
`%p`	.pred	predicate
`%rs`	.b16	short

compiler 생성 PTX의 관례. inline asm은 자유.

6 Special Register %tid %ctaid %ntid %nctaid

sreg	대응
`%tid.{x,y,z}`	threadIdx
`%ntid.{x,y,z}`	blockDim
`%ctaid.{x,y,z}`	blockIdx
`%nctaid.{x,y,z}`	gridDim
`%laneid`	lane 0~31
`%warpid`	SM내 warp
`%smid`	SM 번호
`%clock %clock64`	cycle counter

주의: %warpid, %smid는 volatile. 스케줄러가 warp를 이동시키므로 안정적 식별자 아님.

1 State Space 정의

정의 State Space = 주소가 속한 메모리 영역. PTX는 type-qualified address를 쓴다. C의 storage class와 일대일 대응 ↗ V01 §6.

2 전체 목록 ★ reg · sreg · param · local · global · shared · const · tex

space	범위	접근 명령
`.reg`	thread	직접 피연산자
`.sreg`	thread (RO)	special reg
`.param`	kernel / func	`ld.param`
`.local`	thread	`ld.local/st.local`
`.shared`	CTA	`ld.shared/st.shared`
`.global`	device	`ld.global/st.global`
`.const`	device (RO)	`ld.const`
`.tex`	device (RO)	`tex` (legacy)
`.generic`	통합 주소	`ld / st` (no suffix)

3 물리 매핑 실제 HW

state space	physical	latency
.reg	RF 64K×32b/SM	0 cyc
.shared	SMEM 192KB/SM	~20 cyc
.local	HBM (per-thr)	~400 cyc
.global	HBM	~400 cyc
.const	const cache	~4 cyc (hit)
.param	const bank	~4 cyc

A100 기준. 정확 수치 ↗ V02 §14

4 .param 상세

kernel 인자는 constant memory bank에 배치
최대 4KB (sm_70~), sm_90+는 32KB
device 함수의 .param은 ABI에 따라 stack 또는 reg

ld.param.u64 %rd1, [mykernel_param_0];
cvta.to.global.u64 %rd2, %rd1;
ld.global.f32 %f1, [%rd2];

5 Generic vs Specific Addr cvta · cvta.to

cvta.to.global %rd, %rd → specific | cvta.global %rd, %rd → generic Unified Virtual Addressing: generic pointer 1개로 global/shared/local 모두 주소. state space는 HW가 상위 비트로 판별.

// shared → generic
cvta.shared.u64 %rd_gen, %rd_sh;
// generic → global
cvta.to.global.u64 %rd_g, %rd_gen;

6 .shared 선언

.shared .align 16 .b8
  smem_buf[16384];

// dynamic (extern shmem)
.extern .shared .align 16
  .b8 smem_dyn[];

정적 shmem: __shared__ float buf[N] 대응
동적 shmem: extern __shared__ ..., 런치 3번째 인자로 크기
CTA 내 모든 thread가 같은 주소 공간

함정: .local은 이름과 달리 HBM. register spill이 이곳으로 가면 latency ×20 ↑. 컴파일러의 "register pressure too high" 경고는 이걸 의미.

1 정수 산술 add sub mul mad div rem

op	형태	비고
`add.s32`	d = a + b	wrap
`add.sat.s32`	saturate	clamp
`add.cc.s32`	CC.CF set	multi-word
`mul.lo.s32`	하위 32b	wrap 허용
`mul.hi.s32`	상위 32b	64-bit 확장
`mul.wide.s32`	→ .s64	확장 결과
`mad.lo.s32`	a·b + c	IMAD 대응
`div.s32`	비싸다	~수십 cyc
`rem.s32`	비싸다	div+mul

2 Float 산술 ★

op	의미
`add.f32`	a + b
`mul.f32`	a · b
`fma.rn.f32`	a·b + c (1 rounding)
`sub.f32`	a − b
`neg.f32`	−a
`abs.f32`	\|a\|
`min.f32 max.f32`	순서
`rcp.rn.f32`	1/a (근사)
`sqrt.rn.f32`	root
`rsqrt.approx.f32`	1/√a (fast)

3 Rounding Mode (!) rn · rz · rm · rp

suffix	mode
`.rn`	round to nearest even (IEEE)
`.rz`	round toward zero (truncate)
`.rm`	round toward −∞
`.rp`	round toward +∞

fma/mul/add는 rounding 필수
.ftz: flush subnormal to zero
.sat: clamp to [0,1] (f32) 또는 type range

4 FMA의 의미 핵심

fma.rn.f32 d, a, b, c ≡ round( a·b + c ) 1회 rounding. mul+add 분리 시 2회 rounding → 수치 차이. 수치 안정성·Kahan sum에 중요. 상세 ↗ V09 §3.

5 Logical & Bit and or xor not shl shr

op	의미
`and.b32`	bitwise AND
`or.b32 xor.b32`	OR / XOR
`not.b32`	~a
`shl.b32`	left shift
`shr.s32 shr.u32`	signed / unsigned
`popc.b32`	popcount
`clz.b32 bfind.u32`	count leading zero / find MSB
`brev.b32`	bit reverse

6 setp / selp ★ compare → pred → select

// compare
setp.lt.s32 %p1, %r1, %r2;
// select: %r3 = %p1 ? %r4 : %r5
selp.s32 %r3, %r4, %r5, %p1;

// predicated
@%p1 add.s32 %r3, %r1, %r2;

cmp suffix	의미
`eq ne`	=, ≠
`lt le gt ge`	< ≤ > ≥
`lo ls hi hs`	unsigned ver
`equ neu ltu …`	float unordered
`num nan`	NaN test

7 Type 변환 cvt

// f32 → s32 (truncate)
cvt.rzi.s32.f32 %r, %f;
// f16 → f32
cvt.f32.f16 %f, %h;
// u32 → f32 (round-nearest)
cvt.rn.f32.u32 %f, %r;

float ↔ int: rounding mode 필수
float wide→narrow: rounding + optional .sat
packed: cvt.rn.f16x2.f32 (두 f32 → f16x2)

함정: mul.lo.s32는 wrap. overflow detect 필요 시 mul.wide 후 상위 비교. .sat은 float에만 meaningful.

1 ld / st 형식 ★

ld.space.cache.vec.type dst, [addr]; space: global/shared/local/const/param · cache: hint · vec: .v2 .v4 · type: 데이터 폭

ld.global.f32 %f1, [%rd1];
ld.global.v4.f32 {%f0,%f1,%f2,%f3}, [%rd1];
ld.global.nc.f32 %f1, [%rd1];  // non-coherent
st.global.wt.f32 [%rd1], %f1;  // write-through
st.shared.v2.b32 [%rs1], {%r0,%r1};

2 Cache Modifier ★ load ca · cg · cs · lu · cv · nc

mod	의미
`.ca`	cache all (L1+L2) · default
`.cg`	cache at global (L2 only, L1 bypass)
`.cs`	cache streaming (evict first)
`.lu`	last use (discard after)
`.cv`	cache volatile (re-fetch)
`.nc`	non-coherent (read-only cache)

sm_80+: .nc는 __ldg intrinsic 대응

3 Cache Modifier store

mod	의미
`.wb`	write-back (default)
`.cg`	cache global (L2만)
`.cs`	streaming
`.wt`	write-through (L2 통과)

4 Vector Load ★ .v2 .v4

// 128-bit load (4 × f32)
ld.global.v4.f32
  {%f0,%f1,%f2,%f3}, [%rd1];
// 128-bit load (2 × f64)
ld.global.v2.f64
  {%fd0,%fd1}, [%rd1];
// 64-bit load
ld.global.v2.f32
  {%f0,%f1}, [%rd1];

peak bandwidth에 필수: 128-bit transaction
주소 align: v4.f32 → 16B align
destination regs는 연속 번호 권장

5 Alignment 규칙

type / vec	align (byte)
.b8 / .u8 / .s8	1
.b16 / .f16	2
.b32 / .f32	4
.v2.b32	8
.v4.b32 / .v2.b64	16
.b128	16

misaligned → 여러 transaction으로 분해, BW 손실

6 Volatile & Weak

ld.volatile.global: 최적화 불가, cache bypass 의도
ld.weak: relaxed consistency 명시 (PTX 6+)
일반 ld는 relaxed (acquire/release 아님)
ordering 필요 시 ld.acquire / fence 병행 ↗ §12

흔한 실수: ld.nc(non-coherent)는 kernel 내에서 write가 있는 메모리에 쓰면 stale data. read-only 입력에만 사용.

주의: ld.global.cg는 Ampere에서 L1 bypass. working set > L1 인 대량 스트리밍에 적합, small reuse에는 부적합.

1 cp.async 정의 ★ L1 bypass · ld→reg 거치지 않음

의미 global → shared memory 비동기 복사. register를 거치지 않음 → RF pressure 해소. Ampere (sm_80)에서 신설. Hopper의 TMA와 대응 ↗ V04 §5.

기존: ld.global → reg → st.shared (2-hop)
신규: cp.async (1-hop, DMA-like)
async → prefetch 파이프라인 구성 가능

2 기본 형태

// 16B copy, cache global (L1 bypass)
cp.async.cg.shared.global
  [smem_addr], [gmem_addr], 16;

// 8B copy, cache all
cp.async.ca.shared.global
  [smem_addr], [gmem_addr], 8;

// 4B copy
cp.async.ca.shared.global
  [smem_addr], [gmem_addr], 4;

byte size: 4 / 8 / 16만 허용
.ca: L1+L2 · .cg: L2만 (큰 복사에 유리)
16B 경우 align 16 필수

3 commit / wait ★

// 현재까지 issue한 cp.async들을
// 하나의 async-group으로 묶음
cp.async.commit_group;

// 가장 오래된 그룹부터 최대 N개만 남기고 대기
cp.async.wait_group N;

// 모든 그룹 완료 대기
cp.async.wait_all;

wait_group N ⇔ in-flight group 수 ≤ N N=0: 모두 완료. N=stages−2: 앞 stage만 대기, 뒤 stage는 in-flight 유지. multi-stage pipeline 핵심.

4 Byte Size × Cache 매트릭스

size	.ca	.cg	용도
4B	✓	✓	int / f32
8B	✓	✓	f64 / v2.f32
16B	✓	✓ ★	v4.f32 · 권장

warp 32 × 16B = 512B/issue, 128B 트랜잭션 4개로 퍼짐

5 Double Buffering Loop ★ 전체 pseudo-code

// STAGES = 2: double buffer
// STAGES = 3~: multi-stage
const S = 3;
.shared .align 16 .b8 sA[S][TILE_BYTES];
.shared .align 16 .b8 sB[S][TILE_BYTES];

// prologue: S-1 stage prefetch
for (s = 0; s < S-1; ++s) {
  cp_async_tile(sA[s], gA[s]);
  cp_async_tile(sB[s], gB[s]);
  cp.async.commit_group;
}

// main loop
for (k = 0; k < K_TILES; ++k) {
  // issue next stage (overlap)
  if (k + S - 1 < K_TILES) {
    cp_async_tile(sA[(k+S-1) % S],
                  gA[k+S-1]);
    cp_async_tile(sB[(k+S-1) % S],
                  gB[k+S-1]);
  }
  cp.async.commit_group;

  // wait so [k] is ready
  cp.async.wait_group S-2;
  bar.sync 0;  // __syncthreads

  // compute on stage k
  ldmatrix(rA, sA[k % S]);
  ldmatrix(rB, sB[k % S]);
  mma.sync(rD, rA, rB, rD);
}

// epilogue
cp.async.wait_all;
bar.sync 0;

함정 3: ①commit_group 빠뜨리면 wait가 영원히 풀리지 않음. ②__syncthreads는 in-flight async를 기다리지 않음. ③ byte size ≠ 4/8/16 이면 illegal.

1 mma.sync 의미 ★

의미 warp 32 thread 협력으로 D = A·B + C 1회 실행. 각 thread는 fragment (register tile) 1조각만 보유. warp-wide collaboration, .sync는 warp 내 barrier 포함.

mma.sync.aligned.m{M}n{N}k{K}.row.col..Dtype.Atype.Btype.Ctype M/N/K: 행렬 shape · row/col: A/B layout · Dtype/Atype/Btype/Ctype: 4개 operand의 element type

2 Ampere 기본 형태

mma.sync.aligned.m16n8k16.row.col
  .f32.f16.f16.f32
  {%f0,%f1,%f2,%f3},   // D (f32 × 4)
  {%r0,%r1,%r2,%r3},   // A (f16 × 8)
  {%r4,%r5},           // B (f16 × 4)
  {%f4,%f5,%f6,%f7};   // C (f32 × 4)

A100 throughput: 1 mma / cycle / SM (FP16 m16n8k16)

3 shape × dtype 완전 표 ★ Ampere sm_80

shape	A	B	C/D	비고
m16n8k8	f16	f16	f16	Volta-style
m16n8k8	f16	f16	f32	mixed
m16n8k16	f16	f16	f16	Ampere std
m16n8k16	f16	f16	f32	★ DL 기본
m16n8k16	bf16	bf16	f32	★ BF16 training
m16n8k8	tf32	tf32	f32	★ FP32 acc
m16n8k16	s8	s8	s32	INT8
m16n8k16	u8	u8	s32	UINT8
m16n8k32	s8	s8	s32	larger K
m16n8k32	u8	s8	s32	mixed sign
m16n8k64	s4	s4	s32	INT4
m16n8k128	b1	b1	s32	binary (xor.popc)
m8n8k4	f64	f64	f64	FP64 TC

.row.col: A row-major, B col-major (standard). 다른 조합은 데이터 재배치 필요.

4 Fragment Layout m16n8k16

A: 16 × 16 (row-major, f16)
  thread t 의 소유:
    row = (t / 4)          (0..7)
    col = (t % 4) * 2 + j  (j=0,1)
    plus  row+8 행의 같은 col
  → t당 8 × f16 = 4 × .b32

B: 16 × 8 (col-major, f16)
  row = (t / 4)
  col = (t % 4) * 2 + j
  → t당 4 × f16 = 2 × .b32

D: 16 × 8 (row-major, f32)
  row = (t / 4), col = (t%4)*2 + j
  plus row+8
  → t당 4 × f32

5 중요 규칙

.sync = warp 내 암묵적 barrier (activemask 기준)
.aligned = 모든 32 thread가 같은 instruction 도달 필수
divergent 상태에서 호출 금지
accumulator D==C 허용 (in-place 누적)
fragment는 thread-private register; shmem 아님 (Hopper WGMMA에서 바뀜)

함정: fragment layout을 모르고 shmem에서 바로 reg로 옮기면 element 순서 어긋남. ldmatrix가 바로 이 layout에 맞춰 로드 ↗ §8.

1 ldmatrix 정의 ★

목적 warp 32 thread가 협력해 shmem 상의 8×8 fragment 1~4개를 mma 입력 layout에 맞춰 register로 로드. transpose variant 포함.

ldmatrix.sync.aligned.x{1|2|4}[.trans].m8n8.shared.b16 x1/x2/x4: fragment 개수 · .trans: 전치 load · m8n8: 고정 shape · .b16: element 폭

2 기본 사용

// x4: 4 fragment 한 번에
ldmatrix.sync.aligned.x4.m8n8.shared.b16
  {%r0,%r1,%r2,%r3}, [%rs_base];

// x4 + transpose (B matrix용)
ldmatrix.sync.aligned.x4.trans
  .m8n8.shared.b16
  {%r0,%r1,%r2,%r3}, [%rs_base];

// x2 (A: 16x16 = 2 x 8x8)
ldmatrix.sync.aligned.x2.m8n8.shared.b16
  {%r0,%r1}, [%rs_base];

3 Thread → Row Address 매핑

x1 (1 fragment, 8 rows):
  thread 0~7   : row 0~7 base 제공
  thread 8~31  : 주소 무시

x2 (2 fragment, 16 rows):
  thread 0~7   : frag0 row 0~7
  thread 8~15  : frag1 row 0~7
  thread 16~31 : 무시

x4 (4 fragment, 32 rows):
  thread 0~7   : frag0
  thread 8~15  : frag1
  thread 16~23 : frag2
  thread 24~31 : frag3

4 각 thread 부담

variant	t당 결과	용도
x1	.b32 × 1	8×8 한 조각
x2	.b32 × 2	16×8 A 반쪽
x4	.b32 × 4	A full 16×16
x2.trans	.b32 × 2	B 전치 load
x4.trans	.b32 × 4	B full

5 mma와의 정합 ★

ldmatrix.x4 → A(16×16 f16) 4개 fragment
ldmatrix.x2.trans → B(16×8 f16) 2개 fragment
로드 결과는 그대로 mma.sync에 입력
별도 shuffle / permute 불필요

6 shmem Swizzle과의 결합

이유 naive row-major shmem + ldmatrix → bank conflict 8×. shmem을 swizzled 레이아웃으로 저장 → conflict 0.

// XOR swizzle (8-way)
col_swz = col ^ ((row & 7) << 3);
// 쓸 때와 읽을 때 같은 swizzle
// → ldmatrix가 conflict 없이 읽음

CUTLASS의 SmemLayoutSwizzle 참고. 상세 ↗ V06 §9.

7 Ampere .x4 패턴

// A 16x16 f16 → fragment 4개
// B 16x8 f16 → fragment 2개 (trans)
ldmatrix.x4     {a0,a1,a2,a3}, [sA];
ldmatrix.x2.trans {b0,b1},     [sB];
mma.sync.m16n8k16.row.col
  .f32.f16.f16.f32
  {d0,d1,d2,d3},
  {a0,a1,a2,a3},
  {b0,b1},
  {d0,d1,d2,d3};

주의: 주소는 thread별로 다름. 각 thread가 자기가 맡을 row의 base를 제공. 나머지 thread는 값 무시. %laneid로 분기.

1 shfl.sync 4가지 ★ idx · up · down · bfly

mode	C intrinsic	의미
`.idx`	__shfl_sync	임의 src lane
`.up`	__shfl_up_sync	lane − delta
`.down`	__shfl_down_sync	lane + delta
`.bfly`	__shfl_xor_sync	lane ^ mask

shfl.sync.down.b32
  %r_out | %p_valid,
  %r_in, 0x10, 0x1f, 0xffffffff;
// delta=16, clamp=31, mask=full

2 인자 해석

shfl.sync.mode.b32 d|p, a, b, c, membermask; a: source value · b: lane delta/idx/mask · c: clamp/segment · membermask: 참여 lane bitmap · d: result · p: src lane 유효 predicate

3 vote.sync

op	결과
`vote.all.pred`	모두 true면 1
`vote.any.pred`	하나라도 true면 1
`vote.uni.pred`	모두 동일하면 1
`vote.ballot.b32`	32-bit mask

vote.sync.ballot.b32
  %r_mask, %p_cond, 0xffffffff;
// == __ballot_sync(0xff..., cond)

4 activemask

activemask.b32 %r_mask;
// 현재 실행 중인 lane bitmap
// == __activemask()

divergent branch 내부에서 현재 active lane 파악
다음 shfl.sync의 membermask로 사용
주의: convergence 보장 아님

5 match.sync Volta+

// 같은 값 가진 lane mask
match.any.sync.b32 %r_m, %r_v, %mask;
match.all.sync.b32 %r_m | %p,
                   %r_v, %mask;

6 membermask의 역할 ★

원칙 Volta 이후 warp는 independent thread scheduling (ITS). 모든 warp-primitive는 membermask로 참여 lane을 명시.

0xffffffff: 전체 32 lane
divergent branch 내부: activemask 결과 사용
참여 lane에서 값 unspecified → UB
동기화 contract를 HW가 검증 (warning/error)

7 Warp Reduction 예

// sum reduction
for (d = 16; d > 0; d >>= 1) {
  shfl.sync.bfly.b32 %t|%p,
    %sum, %d, 0x1f, 0xffffffff;
  @%p add.f32 %sum, %sum, %t;
}
// lane 0 에 총합

5 단계 = log₂32, 총 32개 값 합산

구버전 호환: shfl.b32 (no .sync) 는 Volta 이전 전용. sm_70+ 에서는 항상 .sync + membermask 사용.

1 Predication ★ @p / @!p

정의 instruction 앞에 @p 또는 @!p를 붙여 predicate가 참/거짓일 때만 실행. branch 없이 분기 → warp divergence 회피.

setp.lt.s32 %p1, %r1, %r2;
@%p1  add.s32 %r3, %r1, %r2;
@!%p1 sub.s32 %r3, %r1, %r2;

2 언제 predication이 유리한가

시나리오	선택
양 분기 모두 짧음	predication
한 분기 매우 김	branch
분기 비용 > 양쪽 실행 비용	predication
lane 모두 같은 경로	branch (uniform)

ptxas가 heuristic으로 선택. -maxrregcount·__builtin_expect 영향.

3 Branch & Jump

instr	의미
`bra LBL`	unconditional
`@p bra LBL`	predicated
`bra.uni LBL`	warp-uniform 선언
`call .func`	device func call
`ret`	return
`exit`	thread 종료
`trap`	fatal abort
`brkpt`	debugger halt

4 Divergence & Reconverge

개념 warp 내 lane이 다른 경로를 택하면 serialized 실행. reconvergence point에서 다시 합쳐짐. Volta 이후 ITS로 자동 수렴 보장되지 않음 → __syncwarp / bar.warp.sync 사용.

// 권장 패턴 (divergent region)
@%p bra TAKEN;
    // not-taken path
    bra DONE;
TAKEN:
    // taken path
DONE:
bar.warp.sync 0xffffffff;
// reconverge 명시

5 Early Exit 패턴

// boundary guard: thread 단위 early exit
setp.ge.s32 %p, %tid, %N;
@%p exit;

// 전 lane이 exit 하면 warp 제거
// 일부만 exit → 남은 lane만 진행

exit: thread 종료, warp 수축
ret: 함수 복귀
kernel entry의 ret = kernel 종료

6 Uniform Branch 선언

bra.uni: 모든 lane이 같은 target임을 컴파일러가 보장
HW가 warp 전체를 한 번에 점프
unify 못 하면 실행은 맞지만 스케줄링 손실
@p bra.uni: 같은 predicate 값을 모두 가짐 전제

7 Branch Metadata

.branchtargets L0, L1, L2;
bra.sync LB, %r_idx;
// indirect branch

함정: predication이 항상 빠르지 않음. 한쪽 분기가 무겁고 다른 lane은 idle → 낭비. if (laneIdx < 8) 같은 lane-partial은 branch가 유리한 경우 많음.

1 atom 형식 ★

          atom.space.scope.op.type  d, [addr], b, c;
          space: .global / .shared · scope: .cta / .gpu / .sys · op: add/cas/exch/min/max/and/or/xor · type: 폭 · c는 CAS에만
        

atom.global.add.u32 %r_old,
  [%rd_addr], %r_val;

atom.global.cas.b32 %r_old,
  [%rd_addr], %r_cmp, %r_new;

atom.shared.max.s32 %r_old,
  [%rs_addr], %r_val;

2 지원 op 매트릭스

op	의미	type
`add`	*a += v	u32 s32 u64 s64 f32 f64 f16 bf16
`cas`	compare-and-swap	b32 b64 b16
`exch`	*a ← v	b32 b64
`min / max`	clamp	u32 s32 u64 s64 (f*: sm_90)
`and / or / xor`	비트	b32 b64
`inc / dec`	wrap count	u32

3 atom vs red

차이 atom = old value 반환. red = old value 버림 (reduction). red가 가볍다.

atom.global.add.u32 %r_old,
  [%rd], %r_val;
// vs
red.global.add.u32 [%rd], %r_val;
// no dst

global histogram: red 선호
lock / counter 시퀀싱: atom (old 필요)

4 Scope Qualifier ★ cta · gpu · sys

scope	범위
`.cta`	CTA 내부 (shmem 범위)
`.gpu`	동일 GPU의 모든 CTA
`.sys`	전체 시스템 (CPU · peer GPU 포함)
`.cluster`	cluster 내부 (sm_90+) ↗ V04 §2

scope가 좁을수록 latency / trafic ↓

5 CAS 패턴 ★

// lock-free update template
LOOP:
  ld.global.u32 %old, [%rd];
  // compute new from old
  ...
  atom.global.cas.b32 %cur,
    [%rd], %old, %new;
  setp.ne.b32 %p, %cur, %old;
@%p bra LOOP;

CAS 실패 → 재시도
contention ↑ → throughput ↓
atomic FP32 add 없는 타입 구현에 사용

6 Ordering Qualifier sm_70+

suffix	의미
`.relaxed`	순서 보장 없음 (default)
`.acquire`	이후 op는 이 뒤
`.release`	이전 op는 이 앞
`.acq_rel`	양쪽

atom.acq_rel.global.add.u32
  %r_old, [%rd], %r_val;

주의: atomicAdd(float*)는 FP32에서 sm_20+, FP64에서 sm_60+ 지원. FP16 atomic은 subnormal·NaN 처리 조건 확인. 상세 ↗ V09 §4.

1 membar vs fence ★

instr	세대	의미
`membar.cta`	전통	CTA 내 memory ordering
`membar.gl`	전통	GPU 전역
`membar.sys`	전통	system-wide
`fence.acq_rel.cta`	sm_70+	모델 기반 acq_rel
`fence.acq_rel.gpu`	sm_70+	GPU scope
`fence.acq_rel.sys`	sm_70+	system scope
`fence.sc.cta`	sm_70+	seq-consistent
`fence.sc.gpu`	sm_70+	sc GPU
`fence.sc.sys`	sm_70+	sc system

fence는 memory consistency 모델 공식화(PTX 6+).

2 bar.sync / barrier

instr	의미
`bar.sync 0`	__syncthreads()
`bar.sync N, cnt`	partial barrier (N: 0~15)
`bar.arrive N, cnt`	arrive only (non-blocking)
`bar.red.and.pred`	barrier + vote
`bar.warp.sync mask`	warp 내 재수렴
`barrier.sync 0`	바뀐 이름(PTX 7.0+)

16개 named barrier: 0~15. CTA 내 역할 분담 가능.

3 의미적 구분 ★

구분

fence: ordering만. thread는 대기 안 함.
bar.sync: arrive + wait. CTA-wide 합류 지점.
bar.arrive: arrive만. wait는 다른 곳에서.
mbarrier: async object (Hopper) ↗ V04 §6

4 Producer-Consumer 예

// Producer
st.global.u32 [buf], %data;
fence.release.gpu;
st.global.u32 [flag], 1;

// Consumer
LOOP: ld.global.u32 %f, [flag];
      setp.eq.u32 %p, %f, 0;
@%p   bra LOOP;
fence.acquire.gpu;
ld.global.u32 %data, [buf];

relaxed ld로 spin. acquire/release로 순서 보장.

5 scope 선택 결정

통신 상대	scope
같은 CTA 내 thread	.cta
다른 CTA 같은 GPU	.gpu
CPU / peer GPU	.sys
같은 cluster (sm_90+)	.cluster

6 축약 매핑

__syncthreads() ≡ bar.sync 0
__threadfence() ≡ membar.gl ≈ fence.sc.gpu
__threadfence_block() ≡ membar.cta
__threadfence_system() ≡ membar.sys
__syncwarp(mask) ≡ bar.warp.sync mask

함정: bar.sync는 arrive count가 CTA 내 모든 thread여야 deadlock 안 남. divergent branch에서 일부만 도달하면 영원히 대기.

1 PTX Memory Model ★

개요 PTX는 weak consistency 기반. CUDA thread는 C++11 style acquire / release / seq_cst semantics 지원 (sm_70+). 각 ordering은 scope 한정자와 함께 지정.

default ld/st: .relaxed
ordering 필요 시 ld.acquire · st.release 또는 fence
scope = 가시성 범위 (cta / gpu / sys / cluster)

2 Ordering 5종

ordering	적용	의미
`.weak / .relaxed`	ld st atom	순서 보장 없음
`.acquire`	ld atom	이후 op를 넘어 이동 금지
`.release`	st atom	이전 op를 넘어 이동 금지
`.acq_rel`	atom fence	양쪽 장벽
`.sc (seq-cst)`	fence	전역 순서 존재

3 Scope × Ordering 완전 매트릭스 ★

scope \ ord	relaxed	acquire	release	acq_rel	sc
.cta	ld/st atom.relaxed.cta	ld.acquire.cta atom.acquire.cta	st.release.cta atom.release.cta	atom.acq_rel.cta fence.acq_rel.cta	fence.sc.cta
.gpu	atom.relaxed.gpu	ld.acquire.gpu atom.acquire.gpu	st.release.gpu atom.release.gpu	atom.acq_rel.gpu fence.acq_rel.gpu	fence.sc.gpu
.sys	atom.relaxed.sys	ld.acquire.sys atom.acquire.sys	st.release.sys atom.release.sys	atom.acq_rel.sys fence.acq_rel.sys	fence.sc.sys
.cluster sm_90	atom.relaxed.cluster	ld.acquire.cluster	st.release.cluster	fence.acq_rel.cluster	fence.sc.cluster

scope 확장 시 latency ↑ (cta < gpu < sys). 꼭 필요한 최소 scope 선택.

4 Acquire / Release 짝

st.release.X [a] ⟶ happens-before ⟶ ld.acquire.X [a] release 이전 모든 op은 acquire 이후 모든 op에 대해 가시화. 두 명령이 같은 주소에 있고 같은 scope X일 때 성립.

5 Sequential Consistency (sc)

의미 fence.sc.X는 전역 total order 생성. 모든 thread가 동일 interleaving 관찰. 비용 최대. 필요 최소로 사용.

acq_rel만으로 표현 불가한 3-thread IRIW 같은 패턴
Dekker's algorithm류 동기화에 필요

6 PTX ↔ CUDA C++ 매핑

CUDA C++	PTX
`memory_order_relaxed`	.relaxed
`memory_order_acquire`	.acquire
`memory_order_release`	.release
`memory_order_acq_rel`	.acq_rel
`memory_order_seq_cst`	.sc
`thread_scope_block`	.cta
`thread_scope_device`	.gpu
`thread_scope_system`	.sys

핵심: ordering과 scope는 독립 축. acquire 하면서 scope가 좁으면 해당 scope 내에서만 acquire 효과. 예: ld.acquire.cta는 다른 CTA 간 동기화에 무력.

1 축소 맥락

현황 legacy 그래픽스 경로. 현대 CUDA DL 워크로드에서는 사용 축소. read-only cache와 hardware sampler 접근이 필요한 이미징 kernel에서만 잔존.

DL은 ld.global.nc + 수동 interpolation으로 대체
volumetric rendering / CT 재구성 등에는 유용
PTX는 지원 유지, 새 기능은 없음

2 Texture PTX 명령

instr	의미
`tex.1d / .2d / .3d`	sampling
`tex.a1d / .a2d`	array
`tex.cube / .acube`	cube map
`tld4`	gather 4 texel
`txq`	query descriptor

3 기본 사용

// 2D bilinear sample
tex.2d.v4.f32.f32
  {%f0,%f1,%f2,%f3},
  [%rd_tex, {%f_u, %f_v}];

// gather (R channel, 4 texel)
tld4.r.2d.v4.f32.f32
  {%f0,%f1,%f2,%f3},
  [%rd_tex, {%f_u,%f_v}];

4 Surface (읽기 쓰기 모두)

instr	의미
`suld.b`	surface load (bypass)
`sust.b`	surface store
`suld.p`	format-converted load
`sured`	surface reduction

surface = 2D/3D array 이미지. texture와 달리 write 가능.

5 Sampler 기능

bilinear / trilinear / anisotropic 보간 (HW)
clamp / wrap / mirror address 모드
normalize coord [0,1]
automatic border / out-of-range 처리
CUDA array에 국한 (일반 linear memory 제한)

6 Bindless Texture modern

// CUDA C: cudaTextureObject_t
tex.2d.v4.f32.f32
  {%f0,%f1,%f2,%f3},
  [%rd_texobj, {%f_u,%f_v}];
// register에 handle 보유 → 함수 전달 OK

CUDA 5+ bindless: handle을 reg / param으로 전달
legacy: texture reference = 전역 심볼

선택 원칙: DL kernel에는 쓰지 않음. 이미지 처리 / 3D 볼륨에서 HW 보간이 이득을 줄 때만. 그 외는 ld.global.nc + manual interpolation이 일반적.

1 asm() 형식 ★

asm volatile("template" : outputs : inputs : clobbers); template: PTX 문자열 · %0 %1 …: operand slot · volatile: 최적화 방지 · clobbers: "memory" 등

asm volatile(
  "mov.u32 %0, %%laneid;"
  : "=r"(lane));

2 Constraint 문자 (!) r · l · f · d · h · n

문자	대응
`"r"`	.u32 / .s32 / .b32 register
`"l"`	.u64 / .s64 / .b64 register
`"f"`	.f32
`"d"`	.f64
`"h"`	.u16 / .b16
`"n"`	compile-time immediate
`"="`	output prefix
`"+"`	in-out

3 operand 배치

// ldmatrix 인라인
asm volatile(
  "ldmatrix.sync.aligned.x4.m8n8"
  ".shared.b16 "
  "{%0,%1,%2,%3}, [%4];"
  : "=r"(x0), "=r"(x1),
    "=r"(x2), "=r"(x3)
  : "r"(smem_addr));

%0~%3 → output regs (순서대로)
%4 → input reg
PTX의 %%는 실제 % 이스케이프

4 volatile의 효과

필요성 nvcc는 asm()을 다른 op와 reorder/elide 할 수 있음. volatile은 이동·제거 금지를 명시.

side-effect 있는 op (I/O, atomic) → volatile 필수
pure compute → volatile 불필요 (컴파일러 최적화 허용)
"memory" clobber: 메모리 부작용 있음을 알림

5 shmem 주소 변환

// generic → .shared smem_addr
uint32_t smem_addr;
asm("cvta.to.shared.u64 %0, %1;"
  : "=l"(tmp64) : "l"(ptr));
asm("cvt.u32.u64 %0, %1;"
  : "=r"(smem_addr) : "l"(tmp64));

ldmatrix / cp.async에 shared addr은 32-bit 필요.

6 mma 인라인 ★

asm volatile(
  "mma.sync.aligned.m16n8k16"
  ".row.col.f32.f16.f16.f32 "
  "{%0,%1,%2,%3}, "
  "{%4,%5,%6,%7}, "
  "{%8,%9}, "
  "{%10,%11,%12,%13};"
  : "=f"(d0), "=f"(d1),
    "=f"(d2), "=f"(d3)
  : "r"(a0), "r"(a1),
    "r"(a2), "r"(a3),
    "r"(b0), "r"(b1),
    "f"(c0), "f"(c1),
    "f"(c2), "f"(c3));

f16 fragment는 .b32에 2개 pack → "r" constraint 사용.

7 cp.async 인라인

asm volatile(
  "cp.async.cg.shared.global "
  "[%0], [%1], 16;"
  :: "r"(smem_addr),
     "l"(gmem_ptr));
asm volatile(
  "cp.async.commit_group;");
asm volatile(
  "cp.async.wait_group %0;"
  :: "n"(STAGES-2));

wait_group의 count는 immediate (n constraint).

함정: inline asm의 operand 갯수/순서 잘못 → 런타임에 garbage. PTX assembler 경고를 반드시 확인. nvcc -Xptxas -v로 상세 출력.

1 nvcc 옵션 ★ ptx · keep · src-in-ptx · Xptxas

옵션	효과
`-ptx`	.ptx 만 생성
`-cubin`	.cubin (SASS) 생성
`-keep`	중간 파일 유지
`-lineinfo`	source line 매핑
`--generate-line-info`	위와 동일
`-src-in-ptx`	PTX에 .cu 소스 삽입
`-Xptxas -v`	reg / smem 사용량 출력
`-Xptxas -O3`	ptxas 최적화 레벨

2 전형적 flow

# .cu → .ptx 만 보기
nvcc -O3 -arch=sm_80 \
     -src-in-ptx -lineinfo \
     -ptx kernel.cu -o kernel.ptx

# reg/smem 사용량
nvcc -arch=sm_80 -Xptxas -v \
     -c kernel.cu

# 전 중간 파일 유지
nvcc -arch=sm_80 -keep kernel.cu

3 -Xptxas -v 해석

// 예시 출력
ptxas info    : Compiling entry function
  '_Z6kerneli' for 'sm_80'
ptxas info    : Function properties for _Z6kerneli
  0 bytes stack frame,
  0 bytes spill stores,
  0 bytes spill loads
ptxas info    : Used 32 registers,
  4096 bytes smem,
  376 bytes cmem[0]

registers: 32 → occupancy 상한 결정
smem: shared memory 사용량
spill stores/loads: >0이면 local mem 사용 → 감속 위험
stack frame: device 함수 호출 스택

4 Name Mangling

의미 C++ 함수명은 PTX에서 Itanium ABI mangled name으로 변환. _Z로 시작.

kernel<float, 256>  →
_Z6kernelIfLi256EEvPKT_Pi

c++filt -p _Z...로 역변환
extern "C" 선언 시 mangling 제거

5 PTX 읽는 순서 루틴

.entry 찾기 → target kernel 확인
.param 선언 → argument layout
.reg 선언 → 사용된 register 종류/수
kernel body의 hot loop 위치 탐색
loop 내 mma / cp.async / ld.global 비율
spill signal: .local 언급 / st.local

6 pragma & hint의 영향

hint	PTX에 미치는 영향
`__launch_bounds__`	`.maxntid` 생성
`#pragma unroll N`	loop 전개
`__restrict__`	alias 제거 → reorder 가능
`__forceinline__`	device 함수 inline
`__noinline__`	call 유지
`-maxrregcount N`	reg 상한 강제

검증 루틴: inline asm을 의도한 instruction으로 방출했는지 grep mma.sync kernel.ptx · grep cp.async kernel.ptx로 확인. 누락 시 -arch 버전 부족 가능성.

1 자주 쓰는 20 instruction ★

instr	한 줄 요약
`ld.global.f32`	HBM → reg f32
`ld.global.v4.f32`	128-bit vector load
`ld.global.nc.f32`	read-only cache
`st.global.f32`	reg → HBM
`ld.shared.f32`	SMEM → reg
`st.shared.f32`	reg → SMEM
`cp.async.cg.shared.global`	async copy (Ampere)
`cp.async.commit_group`	async 그룹 마감
`cp.async.wait_group N`	async 대기
`ldmatrix.x4`	SMEM → mma fragment
`mma.sync m16n8k16`	Tensor Core fma
`shfl.sync.down.b32`	warp reduction
`vote.sync.ballot.b32`	warp mask 생성
`atom.global.add.u32`	HBM atomic
`fence.acq_rel.gpu`	GPU-scope fence
`bar.sync 0`	__syncthreads
`bar.warp.sync 0xff..`	warp 재수렴
`setp.lt.s32`	predicate 생성
`@%p bra L`	predicated branch
`ret`	kernel exit

2 Ampere 핵심 3종 (!) async · tile · mma

cp.async (load) → ldmatrix (stage) → mma.sync (compute) Ampere GEMM 파이프의 3단. 이 순서가 무너지면 throughput 급락. multi-stage로 중첩.

3 State Space 요약

space	한 글자	용도
.reg	R	레지스터
.shared	S	CTA shmem
.global	G	HBM
.const	C	const bank
.local	L	thread-private HBM
.param	P	kernel arg

4 Type Suffix 한 줄

.u/.s: unsigned/signed
.b: opaque bit
.f16/.bf16/.tf32/.f32/.f64: 부동
.f16x2: packed half pair
.pred: 1-bit predicate

5 Modifier 한 줄

.sync: warp-wide barrier 포함
.aligned: 모든 lane 동일 도달 보장
.nc: non-coherent RO cache
.cg: L1 bypass, L2만
.volatile: 최적화 차단

6 Memory Model 한 줄

          (scope: cta|gpu|sys|cluster) × (ord: relaxed|acquire|release|acq_rel|sc)
          필요한 최소 scope × 최소 ordering. 넓고 강할수록 latency ↑.
        

7 Out-of-scope (이 권에서 배제)

SASS 전반 · opcode 대응 ↗ V04 §12~16
Hopper TMA / WGMMA / mbarrier ↗ V04 §4~10
Cluster / DSM ↗ V04 §2~3
실전 inline asm 작성 팁 / 디버깅 수법
특정 커널 튜닝 사례 ↗ V18
실측 benchmark 수치 (vendor 공식 외)

8 다음 권으로

Hopper PTX 추가분: ↗ V04
CUTLASS에서 PTX 실사용: ↗ V06
Attention kernel의 PTX: ↗ V07
FP8 / 양자화 연산 PTX: ↗ V10
SASS 해석: ↗ V04 §12

9 기억 포인트 요약

PTX = virtual ISA, SASS = 실제
Ampere 3종: cp.async · ldmatrix · mma
memory model: scope × ordering 직교
inline asm: constraint + volatile + %% 이스케이프
검증은 -ptx + -Xptxas -v