《GPU Mode》 L081 2025 High priority transcript · failed

High-performance Purely Functional Data-Parallel Array Programming

Triton 과 ThunderKittens 의 풍경 옆에 전혀 다른 길 — Troels Henriksen 이 만든 Futhark. 사용자는 map/reduce/scan 같은 SOAC (Second-Order Array Combinator) 만 적고, AOT 컴파일러가 GPU 코드로 떨어뜨린다. CUDA 의 lifetime 문제 없이 nested parallelism 을 표현하는 functional 답안의 학습 노트.

Futhark SOAC functional data-parallel Copenhagen prefix sum flat parallelism array language

Speaker

Troels Henriksen

University of Copenhagen · Futhark 개발 lead

강의 번호

L081

학습 우선순위

High

자막

failed

출신

academic / open

SUB-CHAPTERS · 12 SECTIONS↓ jump to

01강의가 풀려는 문제why this lecture exists 02functional array language 의 가치why functional 03컴파일 모델how Futhark lowers 04GPU mappingSOAC → kernel 05prefix sum 등 빌딩블록scan, reduce, segmented 06ML 적용where Futhark fits 07채택 사례production users 08다른 array language 와 비교APL · Halide · JAX 09한계caveats 10기억할 메모key takeaways 11다른 강의로의 연결connections 12열린 질문open questions

§ 01강의가 풀려는 문제· why this lecture exists

"GPU 프로그래밍은 왜 매번 같은 boilerplate 인가" 의 다른 답

CUDA / Triton 의 풍경에서 사용자는 거의 항상 같은 일을 반복한다 — index 계산, tile 분할, shared memory 의 lifetime 관리, race 회피. 이 boilerplate 가 본질적인가, 아니면 잘못된 추상의 부산물인가?

Futhark 는 후자의 답에 베팅한다. functional 추상 (map, reduce, scan, segmented reduce) 으로 표현된 알고리즘은 컴파일러가 자동으로 효율적 병렬 코드로 lowering 할 수 있다. 사용자는 알고리즘만 적고 lifetime / race / vectorization 은 컴파일러의 책임.

강의의 frame.

data-parallel 의 본질은 functional 이다 — map/reduce/scan 은 모두 closure-form 으로 정의 가능. side effect 가 없으면 compiler 가 reorder, fuse, parallelize 가능.
nested parallelism 의 문제 — map (\x -> map f x) y 같은 nested 패턴이 자연스럽지만 GPU 의 grid/block 모델에 직접 mapping 안 됨. Futhark 의 flattening 이 이 자리를 푼다.
AOT 컴파일 — Python jit 이 아닌 ahead-of-time. C / CUDA / OpenCL 코드 생성. 결과물이 일반 binary 처럼 deploy 가능.

강의의 인지적 frame

GPU 프로그래밍을 "자료구조 위에서 reduce/scan 의 합성" 으로 다시 본다. 이게 어색해 보일 수 있지만 — 사실 이미 익숙한 패러다임이다 (NumPy, JAX, MapReduce 가 모두 이 frame). Futhark 는 그것을 GPU 위에서 native 로 끌어내는 시도.

"GPU 의 모든 빠른 코드는 결국 reduce 와 scan 의 합성으로 쓸 수 있다 — 그러니 그걸 first-class 로 두자."Troels Henriksen · 확인 필요

§ 02functional array language 의 가치· why functional

side effect 가 없으면 compiler 가 자유롭다

Futhark 가 "purely functional" 임을 강조하는 이유 — side effect 가 없는 코드는 컴파일러가 거의 자유롭게 reorder, fuse, parallelize 할 수 있다.

fusion 이 자동 — map f (map g xs) 가 map (f . g) xs 로 자동 변환. 중간 array 가 메모리에 안 떨어진다. (loop fusion + deforestation)
data dependency 가 명시적 — 함수의 input/output 으로 모든 의존이 표현. shared mutable state 가 없어 race condition 자체가 없음.
reorder 자유 — map f xs ++ map g ys 의 두 map 이 독립이면 컴파일러가 동시 실행으로 합칠 수 있다.
autodiff 자연 — pure function 의 derivative 는 컴파일러가 직접 계산. JAX 의 jax.grad 와 같은 idea, 하지만 Futhark 는 ahead-of-time.
uniqueness type — Futhark 의 escape — "in-place update" 가 필요하면 uniqueness 로 표현. functional 의 본질을 유지하면서도 효율적 mutation 가능.

왜 이게 GPU 에 특히 좋은가

GPU 는 본질적으로 SIMD/SIMT 모델 — 같은 일을 다른 데이터에 동시에. 이건 정확히 map 의 모양이다. functional 추상이 hardware 의 모양과 1:1. CUDA 의 thread index 계산은 그 1:1 mapping 을 사람이 매번 손으로 하는 것.

§ 03컴파일 모델· how Futhark lowers

C / CUDA / OpenCL 로 떨어지기까지의 단계

L0 Futhark source .fut 파일. ML-family 문법. SOAC 와 일반 함수의 합성

L1 SOACS IR map / reduce / scan 등 high-level 추상이 살아 있는 IR

L2 fusion + flattening nested parallelism 을 flat 으로 변환. 인접 SOAC 합치기

L3 low-level kernel IR flat kernel 들 — segmented reduce / map 의 unrolled 형태

L4 backend 선택 CUDA · OpenCL · multicore C · HIP · sequential C

L5 코드 생성 최종 .c / .cu 파일. nvcc / clang 으로 binary 빌드

이 pipeline 의 핵심 — fusion 과 flattening 이 가장 가치 있는 단계. fusion 은 중간 array 를 없애 memory bandwidth 절감, flattening 은 nested parallelism 을 GPU 의 grid/block 위에 mapping.

flattening 의 의미를 풀어본다.

regular nested — map (\x -> map f x) xs — inner map 의 size 가 같으면 outer × inner 평면을 하나의 grid 로 펼친다. 자명.
irregular nested — inner size 가 다르면 (variable-sized rows) — segmented operations 으로 변환. row offset 배열을 들고 다닌다.
scan 의 nested — outer map 안의 scan. segmented scan 으로 변환.

이 단계의 어려움은 "flattening 이 항상 효율적이지 않다" 는 점. 작은 inner size 면 thread divergence 비용이 큼. 컴파일러가 heuristic 으로 결정.

-- Futhark source — sparse matmul
def sparse_matmul (rows: [][]i32) (vals: [][]f32) (x: []f32) =
  map2 (\row vs ->
    reduce (+) 0f32
    (map2 (\j v -> v * x[j]) row vs))
    rows vals

-- 컴파일러가 만드는 형태 (개념)
-- 1. nested map → segmented map
-- 2. inner reduce → segmented reduce
-- 3. flatten 으로 single kernel

-- 결과: 한 번의 kernel launch,
--       row 마다 다른 길이를 segment scan 으로 처리

§ 04GPU mapping· SOAC → kernel

map / reduce / scan 이 grid/block/thread 위로 떨어지는 방식

FIG · SOAC composition diagramfunctional → CUDA grid/block/thread

SOAC 의 composition 패턴이 GPU 의 3-level hierarchy 와 자연스럽게 만난다 — 외곽 map → grid, 중간 scan → block (cooperative), 안쪽 map → thread. 사용자는 이 mapping 을 직접 안 적는다.

map

map f xs

[a] → (a → b) → [b]

완전 독립 element-wise. 가장 자연스러운 GPU primitive.

reduce

reduce f e xs

[a] → (a → a → a) → a → a

associative 이면 tree-reduce 로 O(log n). f 는 monoid.

scan

scan f e xs

[a] → (a → a → a) → a → [a]

prefix sum 의 일반화. work-efficient parallel scan 구현.

filter

filter p xs

[a] → (a → bool) → [a]

scan 으로 표현 가능 — index 누적 후 compaction.

scatter

scatter dst is vs

[a] → [i32] → [a] → [a]

indexed write. uniqueness type 으로 in-place 가능.

stream_red

stream_red op f xs

chunk 단위 reduce

large array 의 sequential 한 chunk 처리 — register tiling.

§ 05prefix sum 등 빌딩블록· scan, reduce, segmented

"work-efficient parallel scan" 이 모든 것의 base

Futhark 의 모든 algorithm 이 결국 map/reduce/scan 의 합성. 그 중에서도 scan 이 핵심 — 한 단계 더 깊은 병렬화의 도구.

parallel scan 은 1980 년 Blelloch 의 work-efficient algorithm 으로 잘 알려져 있다. n 개 element 의 prefix sum 을 O(log n) depth 로, total work O(n) 으로 계산.

up-sweep + down-sweep — tree 모양의 두 단계. up 에서 partial reduce, down 에서 prefix 누적.
segmented scan — 같은 array 안의 여러 segment 별 prefix 를 한 번에. flag array 가 segment 경계 표시.
radix sort 의 base — sorting 도 결국 scan 의 합성. Futhark 의 sort 가 scan 위에 짜여 있다.
SpMV 와 그래프 알고리즘 — sparse matrix-vector, BFS, connected components — 모두 segmented scan 의 응용.
regex match / parsing — semigroup scan 으로 표현 가능. functional 추상의 강점이 진하게 드러나는 자리.

왜 scan 이 GPU 에 특히 좋은가

scan 은 local 작업 + tree reduction + broadcast 의 합성. CUDA 의 warp shuffle, block 단위 reduce 와 1:1. CUB, Thrust 같은 라이브러리가 모두 정교한 scan 구현을 가지고 있다 — Futhark 는 그것들을 사람이 매번 부르지 않게 추상.

"GPU 위에서 빠른 모든 알고리즘은 reduce 와 scan 의 합성이다 — Futhark 는 그것을 사용자의 손에서 뗀다."학습 노트

§ 06ML 적용· where Futhark fits

"Futhark 로 LLM 을 짤 수 있는가" 의 답

Futhark 의 ML 적용 가능성 — 강의에서 Troels 가 명시적으로 짚는 자리. 결론은 "가능하지만 지금은 PyTorch / JAX 만큼 큰 모델을 cover 하기는 어렵다".

autodiff — Futhark 가 first-class 로 지원. functional 코드의 derivative 를 컴파일러가 자동 생성. JAX 의 jax.grad 와 비슷, 하지만 ahead-of-time.
일반 computational kernel — convolution, pooling, batch norm 같은 표준 ML op. Futhark 로 짠 reference 구현이 cuDNN 의 80-90% 정도 성능을 낼 수 있다 (확인 필요).
비표준 op — research 영역의 새 op 를 빠르게 prototype 하는 자리. PyTorch/JAX 처럼 자세한 indexing 안 해도 됨.
BSP-style 분산 — multi-GPU 를 functional 로 표현. distributed Futhark 가 진행 중 (확인 필요).

그러나 한계도 명확하다.

Hopper / Blackwell 의 새 명령 (TMA, WGMMA, tcgen05.mma) 의 흡수 속도가 느리다. tensor core 코드 생성이 일반 CUDA backend 보다 늦음.
FlashAttention / FA3 / FA4 같은 깊은 트릭은 Futhark 의 추상으로는 표현 어려움. side effect 와 fine-grained control 이 필요.
PyTorch / JAX 의 ecosystem (model zoo, dataset, deployment) 부재.

"실용적 자리"

Futhark 가 빛나는 자리 — 커스텀 ML / 과학 계산 알고리즘. 예: 새 attention variant 의 reference, image processing pipeline, simulation. PyTorch 의 default op 안에 없는 자리에 prototype 으로 좋다.

§ 07채택 사례· production users

"academic 출신, 실제 production 에서 쓰이는가" 의 답

Copenhagen 대학 강의 — array programming / parallel computing 강의의 standard tool. 학부 수업 자료 다수 공개.
HIPERFIT center — high-performance financial computing. Monte Carlo simulation, option pricing 같은 자리.
특정 회사 (financial / scientific) — Futhark 의 ahead-of-time 컴파일이 production binary 와 잘 맞음. 정확한 회사 명단은 확인 필요.
research 도구 — array language 와 parallel programming 연구의 base. Accelerate (Haskell GPU), SaC, Single-Assignment C 같은 다른 functional 언어의 비교 대상.
Python integration — Futhark 로 짠 모듈을 Python 에서 import. NumPy 와 자연스럽게 결합.

큰 그림 — Futhark 는 산업 default 가 아니지만 niche 의 strong tool. JAX/PyTorch 가 cover 안 하는 자리, 또는 functional 추상이 코드 정확성에 결정적인 자리.

§ 08다른 array language 와 비교· APL · Halide · JAX

같은 산을 오르는 다른 길들

language	패러다임	compile	GPU	사용처
Futhark	pure functional	AOT	CUDA / HIP	HPC, custom ML, prototype
APL / J / Q	array calculus	interpreted	제한적	finance, data analysis
Halide	schedule-separated	JIT/AOT	CUDA / Metal	image processing
JAX (Pallas)	functional + tracing	JIT (XLA)	TPU / GPU	ML research
Accelerate (Haskell)	pure functional	JIT	CUDA	academic, niche
SaC	single-assignment C	AOT	제한적	academic
Triton	tile DSL	JIT	CUDA	ML kernel
NumPy	imperative array	interpreted	cuPy 별도	data analysis, ML

핵심 비교 차원.

JAX vs Futhark — 둘 다 functional, 둘 다 autodiff. JAX 는 JIT (Python 에서 trace), Futhark 는 AOT (binary 생성). JAX 가 ML 생태계 dominant, Futhark 는 HPC niche.
Halide vs Futhark — Halide 는 algorithm 과 schedule 을 분리 (사용자가 schedule 도 직접 적음). Futhark 는 schedule 도 컴파일러가 자동. 둘 다 image processing 에서 채택.
APL/J vs Futhark — APL 류는 interpreted, 짧은 표현식. Futhark 는 compiled, 더 길고 type-safe. parallel 능력 차이가 큼.
Triton vs Futhark — Triton 은 tile-level imperative, Futhark 는 array-level functional. 추상 레벨이 다름. Triton 이 ML kernel default, Futhark 가 HPC default.

"functional array language 는 항상 niche 였다 — 대중화하지는 못했지만, 그 추상이 옳다는 사실은 JAX 의 존재로 증명됐다."학습 노트

§ 09한계· caveats

Futhark 가 못하는 것들

fine-grained GPU 통제 부재 — TMA, WGMMA, tcgen05.mma 같은 새 명령을 직접 부를 방법이 없다. ThunderKittens / 직접 CUDA 가 필요한 자리는 Futhark 로 못 짠다.
side-effectful 알고리즘의 표현 어려움 — race-based 알고리즘 (Hogwild SGD, lock-free hash map) 은 functional 추상에 자연스럽게 안 들어간다.
autodiff 의 한계 — 일반 ML operator 는 좋지만, FlashAttention 같은 algorithm 의 backward 가 efficient 하게 안 나옴.
ML ecosystem 부재 — model zoo, datasets, deployment infrastructure 가 PyTorch/JAX 에 비해 빈약.
학습 곡선의 ML 면 — ML family 문법 (Haskell-like) 이 Python 에 익숙한 데이터 과학자에게 진입 장벽.
커뮤니티 크기 — academic 중심. 산업 채택 작아 production 사례가 적다.
새 hardware 흡수 속도 — 새 칩의 새 명령이 Futhark backend 에 들어가기까지 시간 걸림. NVIDIA 인프라보다 latency 큼.

§ 10기억할 메모· key takeaways

다시 열었을 때 손에 잡혀야 할 것

SOAC

Second-Order Array Combinator. map/reduce/scan/scatter 등. Futhark 의 first-class primitive.

pure functional

side effect 없음. 컴파일러가 reorder, fuse, parallelize 자유. race condition 본질적 부재.

flattening

nested parallelism 을 grid/block 위 flat 으로. regular vs irregular 처리 다름.

fusion + deforestation

map (map f) → map (f . id). 중간 array 안 만든다. memory bandwidth 절감.

work-efficient scan

Blelloch up-sweep + down-sweep. O(n) work, O(log n) depth. 모든 알고리즘의 base.

uniqueness type

in-place update 의 functional escape. mutation 을 표현하면서도 race 안전.

AOT 컴파일

CUDA / OpenCL / HIP / multicore C 로 떨어짐. Python 모듈로도 export. binary deploy.

ML niche

새 op prototype, custom 과학 계산, research. cuDNN / FA4 같은 hand-tuned 자리는 못함.

YouTube강의 영상 (확인 필요)

Sitefuthark-lang.org

Repogithub.com/diku-dk/futhark

BookParallel Programming in Futhark

RelatedHalide · Accelerate · SaC · Co-dfns · JAX

손에 새기기 — 실습 시퀀스

hello-world — Futhark 설치, 1-D dot product. map2 (*) xs ys |> reduce (+) 0. CUDA backend 로 컴파일, Python 에서 호출.
prefix sum 직접 짜기 — Blelloch scan 을 Futhark 의 scan (+) 0 으로. 같은 알고리즘을 raw CUDA 로 짜고 코드 길이 비교.
fusion 효과 확인 — map f (map g xs) 의 IR dump. 컴파일러가 두 map 을 합쳤는지 직접 확인.
segmented reduce — variable-row matrix 의 row-wise sum. flag array 로 segment 표시. CUB 의 SegmentedReduce 와 성능 비교.
mini ML — 3-layer MLP 의 forward + backward 를 Futhark 로. autodiff vjp 사용. 결과를 PyTorch 와 atol 비교.
raw CUDA backend 출력 읽기 — Futhark 의 --save-cuda 로 생성된 .cu 파일. 사람이 짠 CUDA 와 어떻게 다른지 직접.

§ 11다른 강의로의 연결· connections

이 강의가 시리즈 안에서 어디로 이어지는가

L021

Scan

parallel prefix scan 의 base — Futhark 의 핵심 primitive

L077

DSLs for GPU Kernels

Futhark 가 가지는 다른 추상 레벨의 자리

L018

Fusing Kernels

fusion 의 base — Futhark 의 자동 fusion 이 어떻게 이를 흡수

L053

torch.compile

graph 단계 compiler — Futhark 의 자매 idea

L074

PaTH Attention

prefix Householder product = Futhark 의 자연 표현

L079

Mirage MPK

super-optimizer — Futhark 의 schedule 자동화의 다른 방향

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

cuDNN 대비 성능 격차의 정확한 수치 — § 06 의 80-90% 추정. 실제 conv2d / matmul 별 측정.
새 hardware 흡수 timeline — Hopper TMA / WGMMA 가 Futhark 에 들어왔는지. Blackwell 은?
distributed Futhark — multi-GPU 가 어디까지 진행됐는지. Iris (L078) 같은 추상과의 차이.
autodiff 의 production 성숙도 — vjp / jvp 의 사용 사례. JAX 와의 비교.
PyTorch 통합 — Futhark 모듈을 PyTorch 의 custom op 으로 wrap 하는 표준 방법.
research 채택의 변화 — 강의 시점 이후 ML research 에서 Futhark 의 채택이 늘었는지.
flat parallelism 의 효율 곡선 — irregular nested 에서 컴파일러의 결정이 사람의 manual flatten 보다 좋은가.

검증 메모

이 노트는 Futhark 공식 site / 도메인 지식 (functional array programming, parallel scan) / 강의 metadata 의 재구성. 모든 절대 수치는 시점 의존 — futhark-lang.org 와 책 직접 참조 권장. 강의의 정확한 demo 코드와 측정은 영상 직접 확인 필요.

← Lecture 080How FlashAttention 4 Works Lecture 082 →다음 강의로