《GPU Mode》 L020 Scan Algorithm — Part 1 High priority transcript · slides · available

Scan Algorithm

Sequential 코드에서 가장 빼기 어려운 패턴 — out[i] = out[i-1] + in[i] 의 loop-carried dependency. 그 obvious 한 sequential 을 parallel 로 어떻게 바꾸는가. Beirut 미국대학교의 Izzat El Haj 가 단일 thread block 안에서 Hillis-Steele · Brent-Kung 두 알고리즘을 깐 part 1 의 학습 노트.

prefix sum Hillis-Steele Brent-Kung (Blelloch) work efficiency step efficiency double buffering __syncthreads() single-block scan

Speaker

Izzat El Haj

Professor · American University of Beirut · PMPP co-author

강의 번호

L020

스피커

Izzat El Haj

학습 우선순위

High · 정독

다시 볼 때

scan kernel 직접 구현

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제why scan? 02scan 의 정의inclusive · exclusive 03sequential 의 loop-carried dep왜 어려운가 04Hillis-Steele scan단순한 doubling 05Brent-Kung (Blelloch)work-efficient tree 06work vs step efficiency두 비용 축 07double bufferingin-place 의 함정 08__syncthreads + memory modelindependent thread scheduling 09scan 의 응용 — radix · compaction왜 이 알고리즘이 중요한가 10기억할 메모와 코드 자료key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· why scan?

“가장 sequential 해 보이는 작업을 parallel 로 바꾸기”

Izzat 의 강의는 PMPP 교과서 7장 의 raw 영상 버전으로 보면 거의 정확하다. 다른 chapter 와 달리 scan 은 알고리즘이 직관과 정반대다 — sequential 로 짜면 5줄, parallel 로 잘 짜려면 한 chapter 가 든다.

강의가 깐 큰 질문 두 개.

왜 scan 이 중요한가 — radix sort, stream compaction, segmented reduction, dynamic memory allocation, sparse matrix 처리 등 거의 모든 GPU 데이터 알고리즘의 빌딩블록. L019 의 CSV 파서가 그 사용 사례 중 하나.
obvious 한 sequential 을 어떻게 parallel 로 짜는가 — 두 가지 핵심 알고리즘 (Hillis-Steele, Brent-Kung) 이 답을 준다. 같은 결과, 다른 trade-off.

강의의 인지적 frame

Izzat 은 part 1 에서 “single thread block 안에서만” 의 scan 을 다룬다. multi-block scan, decoupled lookback 같은 production 알고리즘은 다음 강의 L021 · 그리고 더 깊게 L024 의 영역. 이 part 1 은 알고리즘의 골격을 깐다.

“scan 은 GPU 컴퓨팅의 기본기다 — sort, compaction, reduce, parsing, sparse — 거의 모든 곳에 들어 있다.”Izzat El Haj (요약)

§ 02scan 의 정의· inclusive · exclusive

같은 데이터, 두 가지 출력 — 한 칸의 차이

scan = prefix sum 의 일반화. binary 결합 연산 ⊕ 와 입력 [x0, x1, ..., x_{n-1}] 가 주어지면, 출력은 위치 i 의 누적값. 두 변형이 있다.

FIG · inclusive vs exclusive scan (⊕ = +)같은 입력의 두 출력

입력 31704163

inclusive 34111115162225

exclusive 0341111151622

inclusive 의 i 번째 = x0 ⊕ … ⊕ x_i (자기 포함). exclusive 의 i 번째 = x0 ⊕ … ⊕ x_{i-1} (자기 제외, 첫 칸은 identity 0). 둘은 한 칸 shift 차이. radix sort 같은 알고리즘은 exclusive 가 자연스럽고, running average 같은 작업은 inclusive 가 자연스럽다.

일반적으로 ⊕ 는 associative (결합법칙) 면 충분하다 — commutative 일 필요는 없다. min, max, or, and, 행렬 곱 모두 가능. associative 하지 않은 op (예: subtraction) 는 scan 으로 표현할 수 없다.

§ 03sequential 의 loop-carried dep· 왜 어려운가

5줄짜리 sequential 코드 — 그게 곧 문제다

// sequential — 단순하다 못해 자명한 코드
void scan_seq(const float* in, float* out, int n) {
  out[0] = in[0];
  for (int i = 1; i < n; i++) {
    out[i] = out[i-1] + in[i];   // ← 의존
  }
}

loop iteration i 가 i-1 의 결과를 본다. iteration 들이 직렬로 흐를 수밖에 없다. 이게 GPU 가 못 푸는 패턴 — thread 들이 서로 다른 iteration 을 잡으려면 그 사이에 결과를 주고받아야 한다.

n 개 element 에 대해:

work = n-1 개 add 연산
step = n-1 개 (직렬)
parallelism = 0 (순차)

parallel 알고리즘의 목표 — 같은 work 를 유지하면서 step 을 log(n) 까지 내리는 것. 또는 work 를 약간 늘리는 대신 step 을 더 내리는 것.

왜 scan 이 “직관과 반대” 인가

Brent–Kung 같은 work-efficient scan 은 sequential 코드와 정확히 같은 work (n-1 op) 를 하면서 step 만 2log(n) 으로 줄인다. “같은 일을, 다른 순서로” 라는 GPU 알고리즘 디자인의 표준 패턴.

§ 04Hillis-Steele scan· 단순한 doubling

“멀리 있는 element 를 한 칸씩 더 멀리 본다” — log(n) step

가장 단순한 parallel scan. step k 에서 thread i 는 자기 값에 2^k 떨어진 element 를 더한다. log(n) step 후 모든 thread 가 자기 자리의 prefix sum 을 들고 있다.

// Hillis-Steele — block 단위 scan (shared memory 위)
__global__ void hillis_steele(const float* in, float* out, int n) {
  __shared__ float buf[2][BLOCK];     // double buffer
  int tid = threadIdx.x;
  buf[0][tid] = in[tid];
  __syncthreads();

  int read = 0, write = 1;
  for (int stride = 1; stride < n; stride *= 2) {
    if (tid >= stride)
      buf[write][tid] = buf[read][tid] + buf[read][tid - stride];
    else
      buf[write][tid] = buf[read][tid];
    __syncthreads();
    read ^= 1; write ^= 1;
  }
  out[tid] = buf[read][tid];
}

FIG · Hillis-Steele 의 step 들 (n=8)stride 1, 2, 4

step 0 abcdefgh

step 1 aa..bb..cc..dd..ee..ff..gg..h

step 2 aa..ba..ca..db..ec..fd..ge..h

step 3 aa..ba..ca..da..ea..fa..ga..h

3 step 만에 모두 자기 prefix sum. 일반화하면 step = log₂(n). 그런데 work 는 n·log(n) — 매 step 마다 거의 모든 thread 가 add 한다. sequential 의 n 보다 log(n) 배 많은 일.

언제 Hillis-Steele 가 이기는가

n 이 작을 때(예: 32, warp 크기). 코드가 매우 단순하고 control divergence 가 거의 없어서 instruction 효율이 높다. __shfl_up_sync 같은 warp shuffle 로 구현하면 shared memory 도 안 쓰고 warp 안에서 끝난다 — § 21 에서 다시 등장.

§ 05Brent-Kung (Blelloch) scan· work-efficient tree

tree 위로 reduce → tree 아래로 distribute — work O(n)

두 phase 알고리즘. up-sweep (reduce) — binary tree 를 위로 올라가면서 partial sum 을 모은다. down-sweep — 트리를 다시 내려오면서 누적값을 분배한다.

FIG · Brent-Kung up-sweep (n=8)partial sum 누적

log(n) step. 각 step 의 work 가 절반씩 줄어 — 총 work = n−1 (sequential 과 동일). 그 후 down-sweep 이 또 log(n) step 으로 누적값을 트리 아래로 뿌린다. 총 work O(n), 총 step 2log(n).

// Brent-Kung — 약식 (up-sweep)
for (stride = 1; stride < BLOCK; stride *= 2) {
  int idx = (tid + 1) * stride * 2 - 1;
  if (idx < BLOCK) buf[idx] += buf[idx - stride];
  __syncthreads();
}
// down-sweep (exclusive 의 경우 root 를 0 으로 set 후)
for (stride = BLOCK/2; stride > 0; stride /= 2) {
  int idx = (tid + 1) * stride * 2 - 1;
  if (idx + stride < BLOCK) {
    float t = buf[idx];
    buf[idx] += buf[idx - stride];
    buf[idx - stride] = t;
  }
  __syncthreads();
}

§ 06work vs step efficiency· 두 비용 축

같은 결과, 두 가지 가격표 — 알고리즘을 비교하는 정식 frame

PMPP 가 가르치는 정식 frame. work efficiency = 알고리즘이 sequential 과 같은 op 수를 쓰는가. step efficiency = critical path 의 길이가 얼마나 짧은가. 두 축은 trade-off — Hillis-Steele 은 step 이 짧지만 work 가 많다.

Hillis-Steele

total workO(n log n)

step (depth)log₂(n)

work-efficient?no

step-efficient?yes

코드 복잡도simple

언제 좋은가n 작을 때

Brent-Kung (Blelloch)

total workO(n)

step (depth)2·log₂(n)

work-efficient?yes

step-efficient?2× 의 step

코드 복잡도more

언제 좋은가n 클 때

실전의 진실

둘 다 “순수” 형태로는 거의 안 쓰인다. 대부분의 production 코드는 warp-level Hillis-Steele + block-level Brent-Kung + grid-level decoupled lookback 의 hierarchical 합성. 이 합성이 L021 의 main 주제. 그리고 L024 가 “speed of light” 까지 짜내는 영역.

“sequential 과 같은 work 를 유지하면서 step 만 줄이는 게 좋은 GPU 알고리즘이다 — Brent-Kung 이 그 정신을 가장 깔끔하게 보여준다.”Izzat El Haj (요약)

§ 07double buffering· in-place 의 함정

같은 buffer 에 read 와 write 가 섞이면 race

강의에서 Izzat 이 직접 보여준 작은 “함정”. naive 한 Hillis-Steele 구현은 buf[tid] = buf[tid] + buf[tid - stride] — 같은 buffer 에 읽고 쓴다. 어떤 thread 가 자기 자리를 update 한 뒤, 다른 thread 가 그 update 된 값을 읽으면 결과가 망가진다.

// 잘못된 in-place — race condition
if (tid >= stride)
  buf[tid] += buf[tid - stride];   // race!
__syncthreads();

해결은 double buffering — read 와 write 를 다른 buffer 로 분리, 매 step 마다 swap. buf[2][BLOCK] 으로 두 개를 두고 read, write index 를 toggle.

trade-off

shared memory 사용량이 2배. 작은 BLOCK 에서는 문제 없고, 큰 BLOCK (1024+) 에서는 occupancy 에 영향. Brent-Kung 은 in-place 가 안전하다 — read 와 write 의 자리가 step 별로 disjoint 하기 때문.

§ 08__syncthreads + memory model· independent thread scheduling

같은 warp 안 thread 도 “SIMD 처럼 동시에 도는 것” 이 더 이상 보장되지 않는다

강의의 미묘한 디테일. Volta 이전 GPU 에서는 같은 warp 의 32 thread 가 lockstep 으로 도는 게 자연스러운 가정이었다 — warp-internal 동기화가 implicit. Volta 부터 independent thread scheduling 이 들어가서 — warp 안 thread 들이 더 이상 lockstep 으로 도는 게 보장되지 않는다.

Volta+ 에서 안전하게 짜는 법

warp 안에서 다른 thread 가 쓴 결과를 읽으려면 — (1) __syncwarp() 로 명시적 barrier, (2) __shfl_*_sync 같은 sync 가 명시된 collective 만 사용. compiler 도 memory 명령 reorder 를 더 적극적으로 한다 — volatile 또는 __threadfence_block() 가 필요한 경우 늘어남.

그리고 강의의 또 한 자리 — 같은 warp 안 thread 들 사이에서도 ordering 보장이 약해서 “thread A 가 buf[5] 에 쓰고 thread B 가 그 직후 buf[5] 를 읽는다” 같은 패턴은 __syncthreads() 또는 __syncwarp() 없이는 안전하지 않다. scan 처럼 stride 마다 의존성이 바뀌는 알고리즘에서 매 step 후 반드시 sync.

// 안전한 패턴 (Volta+)
for (stride = 1; stride < BLOCK; stride *= 2) {
  float v = (tid >= stride) ? buf[read][tid - stride] : 0.0f;
  __syncthreads();
  buf[write][tid] = buf[read][tid] + v;
  __syncthreads();
  read ^= 1; write ^= 1;
}

§ 09scan 의 응용· radix · compaction

왜 이 알고리즘이 GPU 컴퓨팅의 “핵심 5개 primitive” 중 하나인가

강의 끝부분에서 짧게 짚는 자리지만, scan 이 어디에 쓰이는지 이해하지 않으면 “왜 이렇게 깊이 들어가는가” 의 동기가 약해진다. 대표 응용 5가지.

stream compaction

filter — 조건을 만족하는 element 만 모아서 dense array. 각 element 가 “포함됨? 0/1” 의 mask 를 만든 뒤, mask 에 exclusive scan → 결과의 i 번째 자리. L019 의 SQL filter.

radix sort

매 bit 단계마다 0-bit 와 1-bit 를 분리. 그 분리가 정확히 stream compaction 두 번. radix sort 의 cost 가 거의 다 scan.

segmented reduction

서로 다른 길이의 segment 들의 합을 한 번에. segment boundary 를 mask 로 두고 inclusive scan-with-reset.

CSV / JSON 파싱

L019 § 04 — newline 누적이 row index, comma 누적이 col index. 파싱이 곧 scan.

dynamic memory alloc

각 thread 가 “나는 N byte 가 필요” 라고 말하면 — 그 N 들의 exclusive scan 이 곧 각 thread 의 시작 offset.

sparse matrix · CSR

CSR row offset 이 곧 row 별 nnz 의 prefix sum. matrix construction 의 첫 단계.

§ 10기억할 메모와 코드 자료· key takeaways

다시 열었을 때 5분 안에 손으로 잡혀야 할 것

inclusive vs exclusive

같은 데이터의 한 칸 shift. 거의 모든 알고리즘은 exclusive 가 자연 (radix, compaction, alloc).

Hillis-Steele

stride doubling. work O(n log n), step log(n). 코드 단순. 작은 n 에 좋음. warp-level 의 표준.

Brent-Kung

up-sweep + down-sweep. work O(n), step 2log(n). work-efficient. block-level 의 표준.

work vs step

두 비용 축. 알고리즘 비교의 frame. Hillis-Steele = step 우선, Brent-Kung = work 우선.

double buffering

in-place race 회피. shared memory 2배. Brent-Kung 은 in-place 안전 (자리가 disjoint).

__syncthreads() 매 step

stride 마다 의존성 바뀜 — 반드시 sync. Volta+ 에서는 warp-level 도 __syncwarp().

single-block 한계

이 강의는 한 block 안 (≤1024 element). 더 큰 array 는 part 2 (L021) 의 multi-block scan / decoupled lookback.

scan 의 응용

compaction, radix, segmented reduce, parsing, alloc, sparse — GPU 컴퓨팅의 거의 모든 곳.

YouTube youtube.com/watch?v=ZKrWyEqqPVY

Slides Google Slides — Scan part 1+2

교재 PMPP (Programming Massively Parallel Processors), 4th ed., Ch. 11 — Prefix Sum (Scan)

참고 구현 CUB — DeviceScan / BlockScan · Thrust — exclusive_scan

손에 새기기 — 실습 시퀀스

Hillis-Steele 직접 짜기 — n=1024 의 single-block scan. shared memory + double buffer. CPU sequential 결과와 비교.
Brent-Kung up/down sweep — index 계산 (idx = (tid+1) * 2*stride − 1) 을 종이에 그려보고 코드로. exclusive 와 inclusive 두 버전.
work-vs-step 측정 — 두 알고리즘의 add 호출 수를 카운트해서 sequential 대비 비율 확인. n=1024 면 Hillis-Steele 약 10×, Brent-Kung 약 1×.
volatile/__syncwarp 실험 — Hillis-Steele 의 inner loop 에서 sync 없이 돌려보고 결과가 nondeterministic 한지 확인.
warp shuffle 버전 — __shfl_up_sync 로 32-thread Hillis-Steele. shared memory 안 쓰고 register-only.
compaction 만들기 — random array 의 “positive only” compaction 을 mask + exclusive scan + scatter 로 직접.
NCU 비교 — 두 알고리즘의 instruction count, achieved occupancy, scoreboard stall. work-efficient 가 항상 빠른지 직접 검증.

§ 11다른 강의로 이어지는 길· connections

scan 이 다음 강의들에서 어떻게 다시 등장하는지

L021

Scan Algorithm Part 2

Izzat El Haj — multi-block scan, warp-level + block-level 합성. 이 강의의 직접 후속

L024

Scan at the Speed of Light

Jake Hemstad — speed-of-light analysis, decoupled lookback, CUB internals

L009

Reductions

— scan 의 표면적이 작아진 sibling. up-sweep 만 하면 reduction

L019

Data Processing on GPUs

Devavret Makkar — § 04 의 CSV 파서가 scan 의 최대 응용 사례

L001

Profiling CUDA kernels

— 두 scan 알고리즘의 NCU 비교에서 그 도구를 씀

L005

Going Further with CUDA

Jeremy Howard — shared memory, double buffer, vectorized load 의 실전 패턴

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

shared memory bank conflict — Brent-Kung 의 index 패턴이 bank conflict 를 만드는 자리가 있다. PMPP 가 padding trick (CONFLICT_FREE_OFFSET) 으로 해결. 강의에서는 짧게 언급.
warp shuffle scan — 강의에서는 일반 shared memory 버전 위주. __shfl_up_sync 의 정식 코드는 L021 의 영역.
inclusive ↔ exclusive 변환의 비용 — exclusive 를 inclusive 로 바꿀 때 한 step 의 추가 work. 강의의 BLOCK 안에서는 무시할 수 있지만 multi-block 으로 가면 의미 있음.
thread coarsening — 한 thread 가 여러 element 처리해서 work 를 키우는 대신 BLOCK 의 thread 수를 줄이는 트릭. L021 의 본격 주제.
strict CCCL 비교 — CUB 의 BlockScan 이 정확히 어떤 알고리즘 (RAKING/WARP_SCANS/MATRIX) 을 default 로 쓰는지 직접 source 확인.
numerical stability — float scan 은 더하는 순서에 따라 round-off 누적 차이가 있다. associative 가 numerical 로는 정확히 성립하지 않음. 강의에서 언급 안 됨 — 후속 자료로 보강.

검증 메모

이 노트의 step 도식과 코드는 PMPP 의 표준 표현을 따른다. 자기 GPU 에서 직접 두 알고리즘을 짜고 NCU 로 work/step 을 측정해보면 그래프가 손에 박힌다.

← Lecture 019 Data Processing on GPUs — Devavret Makkar Lecture 021 → Scan Algorithm Part 2 — multi-block, warp-level, thread coarsening