| intrinsic | src lane | 용도 |
|---|---|---|
__shfl_sync | 임의 idx | broadcast |
__shfl_up_sync | lane − d | prefix scan |
__shfl_down_sync | lane + d | reduction |
__shfl_xor_sync | lane ^ d | butterfly |
각 intrinsic: (mask, val, delta, width=32). width는 logical warp 크기 32 나누기.
// warp 32-wide sum, result lane0..31 동일 for(int d=16; d>0; d>>=1) v += __shfl_xor_sync(0xffffffff, v, d);
| 함수 | 반환 | 의미 |
|---|---|---|
__all_sync | int 0/1 | 전원이 pred true |
__any_sync | int 0/1 | 하나라도 true |
__ballot_sync | uint32 | lane별 pred를 bit로 |
mask = __ballot_sync(0xffffffff, x > 0) → __popc(mask) = positive 개수.
__match_any_sync(mask, val) → 같은 val 가진 lane의 bitmask.__match_all_sync(mask, val, *pred) → 전원 동일하면 mask, 아니면 0.
__activemask(): 현재 convergent lane bitmask 조회__syncwarp(mask): warp 내부 barrier| 상황 | mask 생성 |
|---|---|
| warp 전체 | 0xffffffff |
| divergent if | __ballot_sync(0xff..ff, cond) |
| 활성만 | __activemask() |
| sub-warp 16 | 0x0000ffff 등 |
float warp_sum(float v){ for(int d=16; d>0; d>>=1) v += __shfl_down_sync(0xff..ff, v, d); return __shfl_sync(0xff..ff, v, 0); // bcast }
shfl.sync.{idx,up,down,bfly}.b32vote.sync.{all,any,ballot}.predmatch.any.sync.b{32,64}__ballot_sync로 실제 active만 전달.
| 타입 | 크기 | sync 범위 |
|---|---|---|
thread_block | ≤1024 | __syncthreads |
thread_block_tile<N> | N ∈ {2,4,8,16,32} | warp-local |
coalesced_group | 변동 | active lane |
cluster_group | ≤16 block | cluster.sync (sm_90) |
grid_group | 전체 grid | cooperative launch |
헤더: <cooperative_groups.h>, namespace cg = cooperative_groups.
namespace cg = cooperative_groups; auto block = cg::this_thread_block(); auto warp = cg::tiled_partition<32>(block); auto quad = cg::tiled_partition<4>(warp); block.sync(); // == __syncthreads warp.thread_rank(); // 0..31 warp.size(); // 32
| 함수 | 의미 |
|---|---|
cg::reduce(t, v, op) | tile 내 reduction |
cg::exclusive_scan | exclusive scan |
cg::inclusive_scan | inclusive scan |
t.ballot(pred) | bitmask 반환 |
t.shfl(v, src) | tile-local shfl |
cg::coalesced_threads()로 생성cudaLaunchCooperativeKernel.auto grid = cg::this_grid(); grid.sync(); // 모든 block 대기
persistent kernel · multi-pass reduction을 1-kernel로 구현 가능. ↗ §9
__cluster_dims__(x,y,z) 또는 launch APIcluster.sync() = cluster 내 block 전체 barriercg::labeled_partition(warp, key) → match_any 기반.
histogram privatization, sort-within-warp 등에 활용.
| # | 기법 | 핵심 1-liner |
|---|---|---|
| 1 | Naive interleaved | if(tid%(2*s)==0) s[tid]+=s[tid+s] |
| 2 | Interleaved fix | i = 2*s*tid; s[i]+=s[i+s] |
| 3 | Sequential addr | if(tid<s) s[tid]+=s[tid+s] |
| 4 | First-add on load | s[tid]=g[i]+g[i+BS] |
| 5 | Unroll last warp | tid<32 구간 volatile unroll |
| 6 | Warp shuffle | v+=__shfl_down_sync(m,v,d) |
| 7 | Multi-block + CUB | cub::DeviceReduce::Sum |
PMPP Ch10 / Harris reduction talk 기반. 단계 번호는 관례적 표기.
for(int s=1; s<BS; s*=2){ if(tid % (2*s) == 0) sd[tid] += sd[tid+s]; __syncthreads(); }
warp 내 절반 꺼짐 → SIMT divergence.
for(int s=1; s<BS; s*=2){ int i = 2*s*tid; if(i < BS) sd[i] += sd[i+s]; __syncthreads(); }
divergence ↓ but bank conflict (stride 2,4,8…).
for(int s=BS/2; s>0; s>>=1){ if(tid < s) sd[tid] += sd[tid+s]; __syncthreads(); }
conflict ↓ · divergence ↓. warp 앞쪽이 활성.
int i = blockIdx.x*(2*BS) + tid; sd[tid] = g[i] + g[i+BS]; // ½ grid __syncthreads();
global load 효율 2×, grid ½.
if(tid < 32){ volatile float* v = sd; v[tid]+=v[tid+32]; v[tid]+=v[tid+16]; v[tid]+=v[tid+8]; v[tid]+=v[tid+4]; v[tid]+=v[tid+2]; v[tid]+=v[tid+1]; }
warp 내부 __syncthreads 제거. volatile 필수 캐시 우회.
float v = sd[tid]; for(int d=16; d>0; d>>=1) v += __shfl_down_sync(0xff..ff, v, d); if(tid==0) out[blockIdx.x] = v;
smem 경유 제거 · register only · Kepler+.
atomicAdd로 최종 합산grid.sync()로 1-kernelcub::DeviceReduce::Sum · thrust::reduce| 알고리즘 | work W | step D | 비고 |
|---|---|---|---|
| Sequential | O(N) | O(N) | 기준 |
| Hillis-Steele | O(N log N) | O(log N) | warp scan |
| Kogge-Stone | O(N log N) | O(log N) | regular |
| Brent-Kung | O(N) | O(2 log N) | work-eff ★ |
| Sklansky | O(N log N) | O(log N) | divide·conq |
Hillis-Steele은 Kogge-Stone의 상위 계열 구조 동일.
for(int o=1; o<BS; o*=2){ float v=0; if(tid>=o) v = sd[tid-o]; __syncthreads(); if(tid>=o) sd[tid] += v; __syncthreads(); }
W = N log N · D = log N · 모든 단계 all-active.
[ L ][ R ] → scan(L), scan(R)
→ R += last(L)
step = log N · 동일 level 모두 독립 → bank conflict 주의.
float v = x; for(int d=1; d<32; d*=2){ float u = __shfl_up_sync(0xff..ff, v, d); if(lane >= d) v += u; }
inclusive warp scan · 32-wide · 5-step.
blk0 blk1 blk2 ↓ ↓ ↓ local local local Σ0 Σ1 Σ2 └─ scan aux ──┘ +0 +Σ0 +Σ0+Σ1
응용: SpMV reduce, RLE, 계층적 reduction.
| level | 저장소 | 복사본 |
|---|---|---|
| L0 | global | 1 |
| L1 | shared | per block |
| L2 | register | per thread |
| L3 | hybrid | hot bin reg + cold smem |
__shared__ int sh[K]; // init for(...) atomicAdd(&sh[b], 1); __syncthreads(); // flush if(tid<K) atomicAdd(&hist[tid], sh[tid]);
int b = bin(x); int peers = __match_any_sync(0xff..ff, b); int cnt = __popc(peers); int leader = __ffs(peers)-1; if(lane == leader) atomicAdd(&sh[b], cnt);
contention K → contention warp 수준으로 축소.
| 상황 | 권장 |
|---|---|
| K 작음 (≤256) | shared privat. |
| K 큼 (≫smem) | global + aggregation |
| skew 심함 | sort-then-reduce |
| N 큼·K 중간 | hybrid (hot reg) |
| library | cub::DeviceHistogram |
cub::DeviceHistogram::HistogramEvencub::DeviceHistogram::HistogramRangecub::DeviceHistogram::MultiHistogramEven (channel)| 방법 | 복잡도 | 특징 |
|---|---|---|
| bitonic | O(N·K) | small K 고속 |
| radix | O(N·b) | K 중간·float |
| full sort + slice | O(N log N) | K 큼·여러 Top |
b = 4 or 8. float은 IEEE monotone reorder 필요 부호·지수 처리.
| 요구 | API |
|---|---|
| 정확한 Top-K | cub::DeviceRadixSort::SortKeys + slice |
| K작음·fast | cub::BlockRadixSort |
| approx·sampling | Gumbel top-K · categorical |
| Thrust | thrust::sort + copy_n |
K ≤ 32 ──► warp bitonic K ≤ 256 ─► block bitonic K 중간 ──► radix select K 큼 ───► full sort
| 방식 | writeback | 비용 |
|---|---|---|
| A. atomic | atomicAdd(C) | contention |
| B. buffer + reduce | C_part[S] → 별도 reduce kernel | mem 2× · 정확 |
// slice s, tile (m,n) accum = 0; for(k=s·kseg; k<(s+1)·kseg; k+=BK) accum += A[m,k:k+BK] · B[k:k+BK,n]; atomicAdd(&C[m,n], accum);
tile 단위 하나의 atomic. FP32면 hardware atomic 가능.
// stage 1: partial for(s, m, n) C_part[s,m,n] = A_s · B_s; // stage 2: reduce C[m,n] = Σ_s C_part[s,m,n];
stage 2는 O(S·M·N) reduce kernel. FP 합 순서 deterministic.
#CTA · S ≳ #SM · 2| 요인 | 권장 |
|---|---|
| FP32·tile 큼 | atomic |
| FP16·FP8 accum | buffer (HW atomic 제한) |
| 결정론 요구 | buffer |
| 메모리 tight | atomic |
| S 작음(2-4) | atomic |
| S 큼(≥8) | buffer+tree reduce |
SplitKSerial (buffer+reduce)SplitKParallel (atomic)iter: 0 1 2 3 4 5 6 7 8 9 ...
CTA0: [ 0 .. 3 ]
CTA1: [ 4 .. 6 ]
CTA2: [ 7 .. 9 ]
↑ tile 경계에서 split
Osama et al. 2023 Stream-K. CUTLASS 3.x에 정식 통합.
| 모드 | 설명 |
|---|---|
| Data-parallel | 기존 tile-per-CTA (M·N 병렬) |
| Split-K | K 분할 S-way |
| Stream-K | iter flat 분할 · hybrid |
Stream-K는 실제로 DP 블록 + Stream-K 블록을 혼합 운용하기도 한다.
// per partial slab for(tile) C[tile] = Σ partials[tile][*];
partial은 on-chip 또는 global workspace에 저장. semaphore/flag로 dependency.
| 상황 | 선택 |
|---|---|
| M·N ≫ SM, K 보통 | Data-parallel |
| M·N 작고 K 큼 | Split-K |
| (M·N·K) % SM 크게 안 맞음 | Stream-K |
| tail wave 긴 경우 | Stream-K |
| 결정론 + 간단 | Split-K buffer |
| generic dispatch | Stream-K (CUTLASS 3.x) |
TileScheduler 추상StreamKScheduler dispatch policy__global__ void persist(..., int* ctr){ while(1){ int id = (tid==0) ? atomicAdd(ctr, 1) : 0; id = __shfl_sync(0xff..ff, id, 0); if(id >= N_WORK) break; process_work(id); } }
| 패턴 | description |
|---|---|
| Dynamic | atomic counter로 pull |
| Static block | id = bid·stride + i |
| Hybrid | large chunk static + tail dynamic |
cooperative launch 시: grid ≤ GPU가 동시에 launch 가능한 상한.
| 상황 | 적합? |
|---|---|
| work uniform + 다수 tile | DP 권장 |
| work uneven | persistent |
| tail wave 문제 | persistent |
| launch 수 과다 batch | persistent |
| decoding long loop | persistent |
cudaMemsetAsync로 매 invocation 초기화| stage | 용도 |
|---|---|
| 2 | double buffer |
| 3-4 | 표준 pipelined GEMM |
| ≥5 | HBM latency 큼·Hopper TMA |
stage↑ → smem ↑ → occupancy ↓ trade.
| 방식 | arch |
|---|---|
__syncthreads | 모든 arch |
cp.async.commit/wait | Ampere+ |
mbarrier async | Hopper ↗ V04 §6 |
// producer (warp P) while((head+1)%N == tail) ; buf[head] = load_tile(); __threadfence_block(); head = (head+1)%N; // consumer (warp C) while(head == tail) ; t = buf[tail]; compute(t); tail = (tail+1)%N;
head/tail은 shared, 각각 전담 warp만 write. memory fence로 ordering.
warp 0-3 : producer (TMA load) warp 4-7 : consumer A (WGMMA) warp 8-11 : consumer B (WGMMA) pingpong
Hopper setmaxnreg로 producer ↓ reg / consumer ↑ reg 재분배.
| family | 복잡도 | GPU 적합 |
|---|---|---|
| Bitonic | O(N log²N) | regular pattern ○ |
| Merge (path) | O(N log N) | balanced ○ |
| Radix | O(N·b) | int/float best ★ |
stage k (k=1..log N):
substage j (j=k..1):
compare-swap (i, i ^ (1<<(j-1)))
for(k=2; k<=N; k*=2) for(j=k/2; j>0; j>>=1){ p = tid ^ j; if(p > tid){ up = ((tid & k) == 0); if((a[tid]>a[p])==up) swap(a[tid],a[p]); } __syncthreads(); }
data-independent → warp divergence ↓. power-of-2 크기.
↗ V01 §13 PMPP Ch12 co-rank 함수 정의.
input → digit(input) → local hist
→ scan hist → global hist
→ scan global → offsets
→ scatter to output
Scan · Histogram 재사용 ↗ §4, §5.
radix sort를 float·double에 적용하는 표준 전처리.
| 요인 | 선택 |
|---|---|
| int·float key | Radix ★ |
| key 복잡 struct | Merge |
| 안정성 요구 | 둘 다 stable 가능 |
| key 길이 ≫ 64b | Merge 유리 |
| GPU 표준 | CUB Radix |
cub::DeviceSegmentedRadixSortcub::Device*Sort.
| op | 수식 | 출력 |
|---|---|---|
| SpMV | y = A·x | dense vector |
| SpMM | Y = A·B | dense matrix |
| SDDMM | C = (A·B) ⊙ M | mask된 dense |
A: sparse · B,x,Y: dense · M: sparsity mask.
| 포맷 | 구성 | 특성 |
|---|---|---|
| COO | (row, col, val) | random 순 |
| CSR | row_ptr, col_idx, val | row 연속 ★ |
| CSC | col_ptr, row_idx, val | col 연속 |
| ELL | padded fixed row | regular row |
| 전략 | 단위 | 적합 |
|---|---|---|
| scalar | 1 thread / row | row 짧음 |
| vector | 1 warp / row | row 중간 |
| CSR-stream | block stride | uniform 분포 |
| merge-based | nnz 균등 split | skew 심함 ★ |
// warp = 32 thread / row r = blockIdx.x; float s = 0; for(j=row_ptr[r]+lane; j<row_ptr[r+1]; j+=32) s += val[j] * x[col_idx[j]]; // warp reduce for(d=16; d>0; d>>=1) s += __shfl_down_sync(0xff..ff, s, d); if(lane==0) y[r] = s;
Merrill & Garland 2016. skew 심한 그래프에 강함.
row-by-row × B-column-tile = 일반적 패턴. cuSPARSE 기본 경로.
cuSPARSE: CSR/COO SpMV/SpMMcuSPARSELt: 2:4 structuredCUB: primitive blockDGL·PyG: graph frontend| scope | prefix | 사용 |
|---|---|---|
| Device | cub::Device* | full array |
| Block | cub::Block* | CTA 협력 |
| Warp | cub::Warp* | warp 협력 |
CUB = "CUDA UnBound". NVIDIA/cub on GitHub.
| op | API |
|---|---|
| reduce | DeviceReduce::{Sum,Min,Max,Reduce} |
| scan | DeviceScan::{Inclusive,Exclusive}Sum |
| sort | DeviceRadixSort::Sort{Keys,Pairs} |
| seg reduce | DeviceSegmentedReduce::Sum |
| select | DeviceSelect::{If,Unique,Flagged} |
| histogram | DeviceHistogram::Histogram{Even,Range} |
| run-length | DeviceRunLengthEncode::Encode |
void* d_temp = nullptr; size_t temp_bytes = 0; cub::DeviceReduce::Sum( d_temp, temp_bytes, d_in, d_out, N); cudaMalloc(&d_temp, temp_bytes); cub::DeviceReduce::Sum( d_temp, temp_bytes, d_in, d_out, N);
1회차: 필요한 temp 크기 조회. 2회차: 실제 실행.
| class | 역할 |
|---|---|
BlockReduce<T,BS> | block sum/min/max |
BlockScan<T,BS> | in/exclusive scan |
BlockRadixSort | block-local sort |
BlockLoad/Store | coalesced transfer |
WarpReduce<T>WarpScan<T>| Thrust | CUB |
|---|---|
thrust::reduce | DeviceReduce::Sum |
thrust::inclusive_scan | DeviceScan::InclusiveSum |
thrust::sort | DeviceRadixSort |
thrust::copy_if | DeviceSelect::If |
thrust::unique | DeviceSelect::Unique |
thrust::transform | kernel (no CUB) |
| lib | 강점 |
|---|---|
| cuBLAS | dense GEMM ↗ V06 |
| cuSPARSE | sparse ↗ §12 |
| cuRAND | PRNG |
| cuDNN | DL primitive |
| cuSOLVER | dense solve |
| primitive | work | depth |
|---|---|---|
| Reduction | O(N) | O(log N) |
| Scan (BK) | O(N) | O(log N) |
| Scan (KS) | O(N log N) | O(log N) |
| Radix sort | O(N·b) | O(b) |
| Bitonic sort | O(N log²N) | O(log²N) |
| Merge path | O(N+M) | O(log N) |
| Histogram | O(N) | O(log N) |
| SpMV | O(nnz) | O(log(row)) |
| intrinsic | 용도 |
|---|---|
| shfl_down | reduce |
| shfl_up | scan |
| shfl_xor | butterfly |
| ballot | mask 수집 |
| match_any | peer group |
| popc(mask) | count bit |
| algo | W | D |
|---|---|---|
| Hillis-Steele | N log N | log N |
| Kogge-Stone | N log N | log N |
| Brent-Kung | 2N | 2 log N |
| Sklansky | N log N | log N |
| shape | 전략 |
|---|---|
| M,N 큼 | Data-parallel |
| K 큼·M,N 작음 | Split-K |
| tail imbalance | Stream-K |
| batched 다수 | Grouped |
| uneven tile | Persistent |
Task ? ├ scalar out ──► Reduction ├ per-elem out ──► Scan ├ bin count ───► Histogram ├ sorted perm ─► Radix sort ├ k largest ───► Top-K (radix) ├ y = A·x ────► SpMV (merge) └ mask sel ───► Compact (scan)