CUDA COMPILER 18VOL · TIER 4 · A4 LANDSCAPE · 18p

XLA · TVM · Polyhedral 비교 단권화

HLO · Relay/TE/TIR · Schedule Primitive · AutoTVM/Ansor · Halide · Polyhedral Model · 4-compiler Matrix

Volume 14 / 18

Tier T4 Compiler

선행 V11 (Triton), V12 (MLIR)

용도 대안 컴파일러 지도 · 결정 트리

목차 (Compiler 지형)

1. 컴파일러 지형도 (Triton/XLA/TVM/Inductor)p.2

2. XLA 개요 — HLO IR, JAX/TF 프론트엔드p.3

3. HLO Pass — simplify/layout/fusion/schedulep.4

4. XLA GPU backend — Triton · cuDNN · Cmd Bufferp.5

5. TVM 개요 — Relay → TE → TIR → codegenp.6

6. Tensor Expression (compute + reduce_axis)p.7

7. Schedule Primitive 11종 — split/bind/tensorizep.8

8. TIR — block · buffer · iter_var · intrinsicp.9

9. AutoTVM — template + XGBoost costp.10

10. Ansor / Meta Schedule — auto search spacep.11

11. Halide의 영향 — compute/schedule 분리 철학p.12

12. Polyhedral Model 기본 — domain/access/schedulep.13

13. Polyhedral 적용 — MLIR affine · ISL · Plutop.14

14. 4-Compiler 비교 매트릭스 (10 기준)p.15

15. Auto-tuning 방법론 — grid/BO/GA/RL/costp.16

16. Search Space 설계 원칙p.17

17. Cheat Sheet — 상황 → 컴파일러 결정 트리p.18

범례

핵심 용어 (노랑)

매우 중요·표 헤더

정의·수식 박스

예시·워크드

빨강주의·함정

★필수 핵심

(!)니모닉

↗권 간 참조 (V11·V12·V13)

S · Tschedule · tensor

Ωiteration domain (polyhedral)

Source XLA HLO docs · TVM papers (MLSys'18·OSDI'20) · Halide (SIGGRAPH'12) · ISL · Pluto · AutoTVM (NeurIPS'18) · Ansor (OSDI'20)

V14 · 18 pages · Out-of-scope: TVM API 상세, 비-GPU target

1 네 컴파일러 한 줄 정의 ★ 지도

정의 Triton: block-level Python DSL, GPU 단일 타겟 ↗ V11 §1.
XLA: HLO 중심 graph compiler, JAX/TF 프론트엔드.
TVM: Relay→TE→TIR 다층 IR, multi-backend.
Inductor: PyTorch 2 backend, ATen→Triton codegen ↗ V13 §2.

2 추상 수준 스펙트럼

graph-op ──────────────────── loop-nest
  XLA HLO    Inductor FX      TVM TE/TIR    Triton
  (coarse)   (mid)            (fine)        (tile)
    │          │                │              │
   fusion    Triton            schedule       직접
   passes    codegen           primitive      작성

추상 수준 ↑ → 자동화 ↑, 제어 ↓ / 추상 수준 ↓ → 제어 ↑, 자동화 ↓

3 지배 원칙

XLA: fusion-first — op을 묶어 memory traffic 축소
TVM: compute/schedule 분리 — Halide 계보
Triton: tile DSL — 사람이 tile 크기·layout 지정 ↗ V11 §3
Inductor: graph capture → codegen — Dynamo·AOT·TorchInductor

4 입출력·타겟 비교표 ★

컴파일러	입력	IR	타겟
Triton	Py DSL	TTIR/TTGIR	NV/AMD GPU
XLA	JAX/TF	HLO/StableHLO	GPU/TPU/CPU
TVM	Relay/ONNX	Relay→TE→TIR	GPU/CPU/FPGA/μC
Inductor	FX graph	Aten IR+Loop IR	GPU(Triton)/CPU(C++)

StableHLO = HLO의 portable subset (2023+)

5 자동화 축 ★

control = f(user_schedule_intent)
automation = 1 − control Triton: control ≈ 0.7 / XLA: ≈ 0.1 / TVM w-sched: ≈ 0.5 / TVM+Ansor: ≈ 0.1

TVM은 두 모드: 수동 schedule vs Ansor auto
Inductor = 수동 schedule 거의 없음, 휴리스틱 + Triton 튜너

6 생태계·진입점

컴파일러	주된 사용자	조직
Triton	kernel 작성자	OpenAI → LF
XLA	JAX/TF 프레임워크	Google · OpenXLA
TVM	배포·edge·연구	Apache
Inductor	PyTorch 사용자	Meta

7 중첩 사용 패턴 실제

JAX → XLA HLO → (일부 pattern) Triton kernel ↗ V11 §12
Inductor → Triton codegen (decode fusion) ↗ V13 §5
TVM Meta Schedule → TIR → CUDA C codegen
공통 트렌드: tile-level DSL이 최하위 공통 기반으로 수렴

8 선택 질문 3가지 제·어·생

제어: 내가 tile/layout 지정? → Triton / 수동 TVM
어디서: 프레임워크 어디 있음? → XLA(JAX) / Inductor(PT)
생태계: multi-backend? → TVM

지형도 4축: 추·자·입·생 (추상수준·자동화·입력·생태계)

1 XLA 파이프라인 ★ 전체

JAX / TF graph
   │ (tracing, jaxpr → StableHLO)
   ▼
StableHLO (portable)
   │ (legalize)
   ▼
HLO (XLA internal)
   │ (HLO passes: §3)
   ▼
GPU backend (LLVM + Triton emitters)
   │
   ▼
cubin / PTX

2 HLO 성격 graph IR

HLO High Level Optimizer IR. op 단위 (dot, convolution, reduce, broadcast, dynamic-slice). shape는 정적이 원칙 (dynamic op은 제한적).

functional · SSA 형태, side-effect 없음 (send/recv 제외)
모든 value = shaped array (f32[128,256] 등)
pure function인 computation 단위로 구성

3 HLO op 주요 카테고리

카테고리	대표 op
Elementwise	`add`·`mul`·`exp`·`tanh`
Reduce	`reduce`·`reduce-window`
Contraction	`dot`·`convolution`
Shape	`reshape`·`broadcast`·`transpose`
Data movement	`gather`·`scatter`·`slice`
Control	`while`·`conditional`·`call`
Collective	`all-reduce`·`all-gather`·`reduce-scatter`

collective는 SPMD partitioner가 삽입 ↗ V15 §1

4 JAX 프론트엔드 흐름

jax.jit(f)
  └─ tracing (abstract vals)
      └─ jaxpr (JAX IR)
          └─ jaxpr → StableHLO
              └─ XLA compile
                  └─ PjRt executable

5 HLO 텍스트 예시 Softmax snippet

// f(x) = softmax(x) along dim 1
HloModule softmax
ENTRY main (x: f32[B,N]) -> f32[B,N] {
  x  = f32[B,N] parameter(0)
  mx = f32[B]   reduce(x, max, dim=1)
  bx = f32[B,N] broadcast(mx, {0})
  sh = f32[B,N] subtract(x, bx)
  e  = f32[B,N] exponential(sh)
  s  = f32[B]   reduce(e, add, dim=1)
  bs = f32[B,N] broadcast(s, {0})
  ROOT y = f32[B,N] divide(e, bs)
}

6 SPMD · Sharding 개요

@jax.jit + PartitionSpec → sharding 주석 HLO
GSPMD partitioner가 collective op 자동 삽입
op별 sharding 규칙 propagation (local vs replicate)

실제 collective·topology는 V15에서 상세

7 TF vs JAX 차이

TF: graph 빌더 직접, tf2xla가 HLO 생성
JAX: functional, staging 명확, StableHLO 우선
둘 다 동일한 HLO 백엔드 경유

1 Pass 4대 축 간·레·융·배

간소화 — algebraic simplification
레이아웃 — layout assignment
융합 — fusion
배치 — scheduling / memory planning

2 Algebraic Simplification

규칙	예시
identity	`x·1 → x`, `x+0 → x`
fold const	`reshape(const) → const'`
slice(concat)	slice가 가리키는 one operand 선택
reshape chain	연속 reshape 병합
broadcast sink	elementwise 앞으로 이동
transpose fold	dot dimension 숫자 재지정

수십 개 규칙을 fixed-point로 반복 적용

3 Layout Assignment ★

의미 각 tensor의 minor-to-major 차원 순서 결정. contraction·convolution은 선호 layout이 있고, elementwise는 layout-agnostic.

conv → NHWC vs NCHW (cuDNN 입장에서 최적)
dot → lhs_contracting minor인 layout 선호
layout mismatch → copy op 삽입
goal: copy 최소, cuBLAS·cuDNN 친화적 layout 전파

4 Fusion 종류 ★ 핵심

종류	설명
kLoop	elementwise chain fusion
kInput	reduce + elementwise 앞
kOutput	elementwise epilogue
kCustom	Triton / cuDNN 등 external emitter

Fusion decision은 cost model 기반: reuse, register pressure 추정

5 Fusion 조건식

benefit = saved_memory_traffic
cost = extra_register + extra_compute
fuse iff benefit > cost ∧ pattern legal

softmax → reduce(max) → bcast → sub → exp → reduce(sum) → div을 1~2 fused HLO로 묶어 single kernel 생성.

6 Scheduling · Memory

HLO scheduler: sequential order 결정 → live-range 정함
대표 heuristic: memory-pressure-minimizing, ListScheduler
Buffer assignment: 같은 메모리 slot 재사용 (aliasing)
Rematerialization: memory 초과 시 재계산 ↗ V17 §7

7 Pass 순서 대략

StableHLO→HLO
  → SimplifyReshape
  → AlgSimplifier (fixpoint)
  → LayoutAssignment
  → HloCSE / DCE
  → FusionPipeline
  → HloScheduler
  → BufferAssignment
  → GPU emitter (LLVM+Triton)

layout assignment 후 algsimplifier 재호출 금지 — layout invariant 깨질 수 있음.

1 Emitter 계층 ★

fused HLO  ──►  GPU emitter 선택
   │            ├─ LLVM emitter (loop)
   │            ├─ Triton emitter (softmax/attn)
   │            ├─ cuBLAS call (dot)
   │            └─ cuDNN call (conv/attn)
   ▼
cubin + host 코드

2 Triton Integration H100+

특정 pattern (softmax, layernorm, attention 일부) → Triton emitter
TTIR 생성 후 Triton 컴파일러 호출 ↗ V11 §6
block size·num_warps autotune은 XLA 내부 grid search
LLVM emitter 대비 fused reduction이 빠른 케이스에서 선택

3 cuBLAS / cuDNN Fallback

HLO op	선호 라이브러리
dot (GEMM)	cuBLASLt · cuBLAS
convolution	cuDNN
fused attention	cuDNN FA / Triton
batched dot	cuBLASLt batched

4 Algorithm Picker

의미 cuBLAS·cuDNN에 여러 algorithm이 있음 (heuristic · explicit · winograd 등). XLA는 최초 실행 시 autotuning으로 가장 빠른 것 캐시.

cache key = (shape, dtype, layout, device)
결과는 XLA_FLAGS=--xla_gpu_autotune_level=N으로 조절
JIT 시 첫 실행이 느려지는 원인

5 Command Buffer ★

정의 여러 kernel launch + memcpy를 단일 submit으로 묶는 구조. CUDA Graph와 동일 개념.

장점: launch overhead 제거, 특히 decode 경로
조건: shape/algorithm 고정, dynamic control flow 없음
conditional graph로 일부 while·cond 지원

cf. PyTorch CUDA Graph ↗ V13 §10

6 Memory Layout 전략

BufferAssignment: live-range 기반 slot 할당
Input/output donation: donate_argnums → in-place 재사용
scratch/workspace: cuBLAS·cuDNN 전용 buffer 예약

7 PjRt Runtime

PjRt Portable JIT Runtime. device enumerate, buffer, executable 추상. JAX·PyTorch/XLA 공통 런타임 layer.

Client → Device → Buffer → Executable
async dispatch: H2D·compute·D2H 별도 stream

8 XLA GPU 요약 결정

상황	경로
대규모 dot	cuBLASLt
conv	cuDNN
softmax·LN	Triton emitter
elementwise chain	LLVM loop emitter
decode graph	Command Buffer

1 TVM 스택 ★ 3-layer IR

Relay / Relax       ◄── graph IR
  │  (op split, partition, quant)
  ▼
TE (Tensor Expression)
  │  compute + schedule
  ▼
TIR (Tensor IR)
  │  block-based low IR
  ▼
LLVM / CUDA / ROCm / Metal / C
       codegen

2 Relay · Relax

Relay functional graph IR. op = nn.conv2d·nn.dense·add. shape polymorphism.

Relax: 차세대 (2023+), dynamic shape symbolic, dataflow block
pattern matching, fusion rule DSL
ONNX·PyTorch frontend가 여기로 import

3 핵심 철학: compute/schedule 분리 ★

원칙 compute = 무엇을 계산. schedule = 어떻게. 하나의 compute에 여러 schedule을 bind해 다른 코드 생성.

C = compute(def_of_matmul)   ← WHAT
s = schedule(C)              ← HOW
  s[C].split(i, 32)
  s[C].bind(io, "blockIdx.x")
  ...
codegen(s, "cuda") → CUDA C

Halide 2012에서 유래 · §11

4 왜 분리하는가

correctness 변경 없음 · perf만 변화
target별 다른 schedule (CPU vs GPU)
auto-tuner가 schedule만 탐색 가능 (§9·§10)
scheduling language 그 자체로 연구 가치

5 Multi-backend 지원

타겟	codegen
NV GPU	CUDA C / NVRTC
AMD GPU	ROCm / HIP
CPU	LLVM IR
Apple	Metal
Edge/μC	microTVM / C
WebGPU	WGSL

본 권은 NV GPU만 다룸 (spec §Out-of-scope)

6 TVM vs XLA 위치

축	XLA	TVM
IR 계층	HLO 1층	3층 (Relay/TE/TIR)
스케줄	암시적 (pass)	명시적 (primitive)
프론트엔드	JAX/TF	ONNX/PyTorch
타겟	GPU/TPU/CPU	GPU/CPU/Edge 등

7 Relay fusion

op pattern 기반 (Opaque, Elemwise, Broadcast, Injective, CommReduce)
pattern table에 따라 fuse legality 판정
fused subgraph → TE compute로 lowering

1 TE 3요소 P·C·R

Placeholder = input tensor 선언
Compute = output tensor의 elemwise 정의
Reduce axis = reduction 전용 iter var

2 Vector Add 최소 예

import tvm
from tvm import te

n = te.var("n")
A = te.placeholder((n,), name="A")
B = te.placeholder((n,), name="B")
C = te.compute(
  (n,),
  lambda i: A[i] + B[i],
  name="C")

3 Matmul reduce

M, N, K = te.var("M"), te.var("N"), te.var("K")
A = te.placeholder((M, K), "A")
B = te.placeholder((K, N), "B")
k = te.reduce_axis((0, K), name="k")
C = te.compute(
  (M, N),
  lambda i, j:
    te.sum(A[i, k] * B[k, j], axis=k),
  name="C")

4 iter_var 종류

종류	의미
Spatial	output 위치 loop (i, j)
Reduce	축약되는 loop (k)
Ordered	순서 있는 외부 loop
Thread	bind된 thread/block index

5 Compute 대수적 성질 ★

compute(shape, λ idx: expr)
· shape → iter domain
· expr → pure function of indices
· side-effect 없음 → schedule 변환이 안전 (correctness 불변)

6 Conv2D TE 축약

dy = te.reduce_axis((0, Kh))
dx = te.reduce_axis((0, Kw))
dc = te.reduce_axis((0, IC))
O = te.compute(
  (B, OC, Oh, Ow),
  lambda b, oc, h, w:
    te.sum(
      X[b, dc, h*sh+dy, w*sw+dx] *
      W[oc, dc, dy, dx],
      axis=[dc, dy, dx]))

7 Stage / Operation

Stage 한 compute가 한 stage를 만든다. schedule은 stage별 조작. s = te.create_schedule(C.op)로 시작.

stage = (op, iter_vars, attach_point)
pipeline은 stage들의 유향 그래프
producer → consumer 의존성 자동 추적

8 Tensor vs TIR Buffer

계층	표현
TE	Tensor (abstract, shape만)
TIR	Buffer (address, stride, scope)

TE → TIR lowering 시 buffer 생성 + load/store 구체화

9 TE 한계

control flow 빈약 (while·if 드물게)
irregular access 표현 제한
→ 그래서 TIR block이 등장 (§8)

1 전체 11 primitive 매트릭스 ★★

primitive	효과	용도
split	1 loop → 2 loop (factor)	tile, vectorize 준비
fuse	2 loop → 1 loop	bind 단순화
reorder	loop 순서 교체	locality, reuse
bind	loop → thread/block idx	GPU 실행축 지정
cache_read	producer 전에 shared buffer 삽입	smem tile load
cache_write	consumer 이후 register/shared buffer	register blocking
compute_at	stage를 다른 stage 안으로 이동	fusion, locality
compute_inline	stage를 사라지게 inline	elementwise 축약
tensorize	inner loop → HW intrinsic	mma, wmma, tensor core
vectorize	loop → SIMD / vector ld	float4 load 등
unroll	loop 풀어쓰기	ILP, 상수 addr
parallel	CPU multi-thread (참고)	CPU target

2 split · reorder 예

s = te.create_schedule(C.op)
i, j = C.op.axis
k,   = C.op.reduce_axis

# tile outer
io, ii = s[C].split(i, factor=64)
jo, ji = s[C].split(j, factor=64)
ko, ki = s[C].split(k, factor=8)
s[C].reorder(io, jo, ko, ii, ji, ki)

3 bind · GPU 실행축

# GPU mapping
s[C].bind(io, te.thread_axis("blockIdx.y"))
s[C].bind(jo, te.thread_axis("blockIdx.x"))
s[C].bind(ii, te.thread_axis("threadIdx.y"))
s[C].bind(ji, te.thread_axis("threadIdx.x"))

bind 순서가 coalescing 결정. threadIdx.x는 연속 mem dim에 bind.

4 cache_read → smem tile

AA = s.cache_read(A, "shared", [C])
BB = s.cache_read(B, "shared", [C])
s[AA].compute_at(s[C], ko)
s[BB].compute_at(s[C], ko)

5 tensorize 개념 ★

tensorize inner loop 패턴을 HW intrinsic 한 호출로 치환. NV Tensor Core wmma / mma에 대응. compute pattern과 intrinsic signature가 일치해야 함.

# 16x16x16 mma 대응 intrinsic 등록 후
s[C].tensorize(ii, wmma_sync_intrin)

cf. CUTLASS의 mma 직접 호출 ↗ V06 §3

6 compute_at 시멘틱

before                 after compute_at(s[B], s[C], ko)
for i, j:              for i, j:
  for ko:                for ko:
    B[i,j,ko] = ...        B[i,j,ko] = ...   ← 여기서 생성
  C[i,j] = f(B, ...)       ...
                         C[i,j] = f(B, ...)

B의 live-range가 ko 안으로 축소 → buffer 작아짐

7 Primitive 조합 규칙

split 먼저 → reorder → bind → cache_* → tensorize
tensorize는 항상 마지막 (shape 고정됨)
vectorize·unroll은 innermost axis에서
compute_at은 fusion 의도 명시

GPU matmul 순서 11: 분·순·바·캐·부·텐 (split·reorder·bind·cache·compute_at·tensorize)

1 TIR 위치 loop-nest IR

TE (compute + schedule)
   │ lower (ScheduleOps, StorageFlatten, ...)
   ▼
TIR (PrimFunc)
   │ pass: VectorizeLoop, InjectDoubleBuffer, ...
   ▼
codegen (LLVM / CUDA)

2 PrimFunc 최소 예

@T.prim_func
def matmul(A: T.Buffer((M,K), "float32"),
           B: T.Buffer((K,N), "float32"),
           C: T.Buffer((M,N), "float32")):
  for i, j, k in T.grid(M, N, K):
    with T.block("C"):
      vi, vj, vk = T.axis.remap("SSR", [i,j,k])
      with T.init():
        C[vi, vj] = 0.0
      C[vi, vj] += A[vi, vk] * B[vk, vj]

3 Block ★ 핵심 추상

block isolated computation unit. read/write region, iter_var 선언, optional init을 가진다. 스케줄러 분석 단위.

"SSR" = Spatial, Spatial, Reduce iter kind
T.init(): reduction 초기화 블록
block 경계 = schedule 변환 단위 경계

4 Buffer

속성	의미
shape·strides	다차원 view
dtype	float32/f16/int8
scope	global · shared · local · wmma.*
elem_offset	base offset (align)

5 iter_var 종류 ★

kind	표기
Spatial	S
Reduce	R
Opaque	O
Scan / Ordered	— (드물게)

analyzer가 dependency·parallelizability 판단에 사용

6 Intrinsic 표현

T.call_intrin(...): HW 전용 호출
예: tvm_mma_sync, tvm_load_matrix_sync (wmma)
target별 lowering: CUDA에서 mma.sync PTX로 치환 ↗ V03 §7

7 Meta Schedule Input

의미 Meta Schedule은 TIR PrimFunc을 직접 입력으로 받음. TE는 필요 없음. block 기반이어서 자동 검색이 가능.

block 주석이 풍부할수록 공간 축소 유효
transform을 block IR 위에서 bijective 하게 수행

8 TIR pass 주요

LoopPartition · VectorizeLoop · UnrollLoop
InjectDoubleBuffer · InjectPrefetch
StorageRewrite / StorageFlatten
CompactBufferAllocation · LowerDeviceStorageAccessInfo

1 AutoTVM 아이디어 ★ 1세대

정의 사람이 template(schedule skeleton)을 쓰고 그 안의 knob(tile size, unroll factor 등)만 auto search. 2018 NeurIPS Chen et al.

2 Template 예

@autotvm.template("matmul")
def matmul(M, N, K, dtype):
  cfg = autotvm.get_config()
  # knob 선언
  cfg.define_split("tile_i", M, num_outputs=2)
  cfg.define_split("tile_j", N, num_outputs=2)
  cfg.define_knob("unroll", [0, 1, 2])
  # schedule 적용
  io, ii = cfg["tile_i"].apply(s, C, i)
  ...

3 Search Space 크기

|S| = ∏_{knob k} |values_k| split(M,2): M의 divisor 수. unroll: 이산 factor.

matmul 기준 보통 10³~10⁶ 포인트. grid search 비현실적.

4 Tuning Loop ★

┌──────────────┐
│ Cost model   │◄── feature ← schedule
│  (XGBoost)   │
└───┬──────┬───┘
    │      │ predict best K configs
    │      ▼
    │  ┌───────────┐
    │  │ Measurer  │  real run on GPU
    │  └────┬──────┘
    │       │ (time)
    │   record
    └──── update model

5 XGBoost Cost Model

입력: feature vector of schedule (loop trip, mem access pattern)
출력: relative rank (not absolute time)
gradient boosted trees — cheap, interpretable
rank loss로 학습 → top-K 추천에 집중

6 Feature 예

feature	의미
loop extent	outer/inner trip count
ann type	unroll/vectorize/thread
stride	mem access 규칙
touched bytes	총 load/store 양

7 Measurement 세부

JIT build → upload → run → median time
timeout, retries, outlier filter
RPC-tracker로 원격 device farm 측정 가능
실측 ≫ cost model 신뢰 — model은 sampling 가이드일 뿐

8 탐색 알고리즘

알고리즘	특징
GridSearchTuner	작은 공간
RandomTuner	baseline
XGBTuner	model 기반 UCB
GATuner	evolutionary

9 AutoTVM 한계

template을 사람이 써야 함
template 바깥 최적화 불가 (공간 상한)
→ Ansor / Meta Schedule이 template 자체를 자동 생성 (§10)

1 Ansor 기여 ★ OSDI'20

정의 template 없이 compute def만으로 자동 search space 생성. Zheng et al. 2020.

sketch = 고수준 schedule 뼈대 자동 제안
annotation = 구체 factor (tile size 등) 채우기
evolutionary + cost model

2 Sketch → Annotation 2단계

compute def (TE)
  │  rule-based sketch gen
  ▼
Sketch 집합   ← 추상 skeleton
  │  annotation
  ▼
구체 schedule
  │  evolutionary search + cost model
  ▼
best schedule

3 Sketch Rule 예

MultiLevelTiling — output loop tile 3~4 레벨
AddCacheRead/Write — smem/reg 버퍼 삽입
RfactorReduction — reduction 분할
InlineTrivial — 간단한 stage inline

4 Search Space 비교 ★

방식	공간	사람 노력
Halide auto	좁음	낮음
AutoTVM	중간	template 작성
Ansor	넓음	compute만
Meta Schedule	Ansor+block기반	compute만

5 Evolutionary Loop

population P (candidate schedule들)
  ┌───────────────────────────┐
  │ mutate + crossover        │
  │ cost_model.predict        │
  │ select top-K              │
  │ measure top-K on GPU      │
  │ update cost_model         │
  └───────────────────────────┘

generation 수십~수백, 측정 수만 trial 가능

6 Cost Model 업그레이드

Ansor: MLP + XGBoost 혼합
Meta Schedule: MLP / TabNet 옵션
feature에 block 구조 추가 (Spatial/Reduce ratio)

7 Task Scheduler

정의 네트워크에 task가 수십~수백 개일 때, 얼마 tuning budget을 어느 task에 쓸지 결정. gradient bandit 류 알고리즘.

목표: 전체 latency 최소
task별 한계 이득(diminishing return) 추정

8 Meta Schedule (TVM Unity)

Ansor 후속 세대, TIR block 기반
ScheduleRule / PostprocessRule / Searcher 플러그인 구조
같은 framework로 CPU·GPU·tensor core 통합
Relax와 직접 통합

9 수렴 조건

stop iff
· best 개선률 < ε for N gen, or
· trial budget 소진, or
· wallclock 한계

Ansor는 측정 비용이 지배. 수천 trial = 시간·전력 큼. cost model이 나쁘면 더 커짐.

1 Halide 핵심 주장 ★ 2012

원제 Decoupling Algorithms from Schedules for Image Processing Pipelines — Ragan-Kelley et al., SIGGRAPH 2012.

같은 algorithm에 여러 schedule → 수십 배 성능 차
algorithm은 short·declarative, schedule은 별도 DSL
autoscheduler = schedule만 자동 생성 (2019+)

2 Halide 최소 예

// algorithm
Func blur;
Var x, y;
blur(x, y) = (in(x-1,y) + in(x,y) + in(x+1,y)) / 3;

// schedule
blur.split(x, xo, xi, 8)
    .vectorize(xi)
    .parallel(y);

3 Halide 용어 대응표

Halide	TVM
Func	compute/stage
Var	iter_var (spatial)
RDom	reduce_axis
split/reorder	split/reorder
compute_at / store_at	compute_at
vectorize / parallel	vectorize / parallel

4 왜 분리가 힘이 되는가 ★

correctness 분리 → 튜너 안전
target별 schedule → portability
search automate → human-free
문서화·재현성 향상

5 한계와 반론

control flow 복잡한 알고리즘 (IR, sparse 등) 표현 제한
schedule primitive 직교성이 완벽하지 않음 → 조합폭발
cost model 없으면 human exploration도 한계

6 계보 Halide → X

Halide (2012)
  ├─► TVM (2018) — compute/sched DSL
  ├─► Polymage / Tiramisu
  ├─► Exo
  └─► Triton (tile-level, 2021)
        ↑ 개념적 차이: 사용자는
          이미 "tile"이 schedule임

7 Halide autoscheduler 변천

세대	방법
Mullapudi'16	heuristic cost
Adams'19	ML cost model (tree)
Anderson'21	+ feature fusion

TVM Ansor의 직접 선조 아이디어

8 "Schedule language"라는 발상

코드 = alg + sched → 두 언어로 보기
sched는 loop transformation의 합성
각 primitive는 semantic-preserving 변환
→ 수학적 해석: schedule = functor on loop nests

9 Triton과의 철학적 차이

Halide/TVM: compute와 schedule을 분리해 각각 독립
Triton: 사용자가 tile을 이미 전제 — schedule 직접 write ↗ V11 §3
둘 다 "사람이 schedule을 쓴다"는 가정이지만 tile 기반 여부 차이

1 모델 요약 ★ Ω·A·S·D

4요소 Ω iteration domain, A access relation, S schedule, D dependence relation.

loop nest를 정수 affine 공간으로 모델
semantics-preserving 재일정(reschedule)이 수학적으로 기술됨

2 Iteration Domain Ω

Ω = { (i, j) ∈ ℤ² | 0 ≤ i < N ∧ 0 ≤ j ≤ i } loop 경계 + affine constraint의 교집합 = polytope

for i = 0..N-1:
  for j = 0..i:
    A[i,j] = f(i,j)

Ω = 삼각형 영역
     j
     │ ●
     │ ● ●
     │ ● ● ●
     └────── i

3 Access Relation A

A : Ω → Mem
A(i, j) = (i + j, i − j) 각 iteration이 touch하는 memory location. affine func of indices.

read access, write access 따로
파라미터 (N, M, …)도 포함 가능 → symbolic

4 Schedule S

정의 Ω의 각 점에 logical 실행 시점을 할당하는 affine 함수. lexicographic 순서가 실행 순서.

S(i, j) = (i, j) — original loop
S'(i, j) = (j, i) — loop interchange
S"(i, j) = (i+j, j) — skewing legal iff dependence D를 위반하지 않음

5 Dependence D ★

D ⊂ Ω × Ω: (src, dst) 쌍의 집합
RAW·WAR·WAW 세 종류
legality: ∀(s, d) ∈ D, S(s) < S(d) (lex)
dependence vector = d − s

6 Transformation 예

변환	S 변화
interchange	(i,j) → (j,i)
skewing	(i,j) → (i+j, j)
tiling	(i,j) → (i/B,j/B,i%B,j%B)
reversal	(i,j) → (-i, j)
fusion	두 schedule의 lex 결합

7 표현력의 강점

모든 affine loop 변환을 통합적 수식화
legality를 기계적으로 판정
optimal schedule을 ILP·LP로 탐색 가능
→ Pluto 같은 auto-parallelizer가 등장

8 한계

non-affine index (A[B[i]]) → 표현 밖
data-dependent control flow 표현 제한
실무 GPU kernel은 일부 영역만 polyhedral로 해석

affine ≠ linear. constant term 허용. 예: 2i + 3j + 5는 affine. i*j는 아님.

1 MLIR `affine` dialect ★

의미 MLIR의 built-in dialect. loop·map을 affine 제약으로 표현해 polyhedral 분석을 열어줌. ↗ V12 §5

op: affine.for, affine.if, affine.load/store, affine.apply
access index: affine map (d0,d1) -> (d0+d1, d0)
lowering 대상: scf/llvm/GPU

2 affine 예시 MLIR 조각

affine.for %i = 0 to %N {
  affine.for %j = 0 to (%i + 1) {
    %v = affine.load %A[%i, %j] : memref<?x?xf32>
    affine.store %v, %B[%j, %i]
      : memref<?x?xf32>
  }
}

3 ISL Integer Set Library

Presburger 연산 · polyhedra manipulation의 표준 라이브러리
set, map, union_set, union_map 등
MLIR의 Presburger util이 ISL과 유사 디자인
GCC graphite·LLVM Polly도 ISL 사용

4 Pluto ★ 자동 병렬화

정의 affine transformation을 ILP로 찾는 polyhedral compiler. 목적함수: outer loop 병렬성 + inner loop locality.

tiling·skewing·fusion 자동 결정
OpenMP pragma까지 생성
CPU 중심 태생, GPU 확장(PPCG)도 존재

5 자동 Tiling 조건

tileable iff ∀ dep vec d: S(d) ≥ 0 component-wise
(fully permutable band) skewing으로 fully permutable band를 만들어 tile 가능하게 하는 게 핵심

6 Stencil Skewing 예 고전

for t in 0..T:
  for i in 1..N-1:
    A[t+1,i] = A[t,i-1]+A[t,i]+A[t,i+1]

d1=(1,-1), d2=(1,0), d3=(1,1)
t축 tiling 불가 (d1 t+ but i-)
skew (t, i) → (t, i+t):
  새 dep = (1,0),(1,1),(1,2) → fully permut
  → time-tile 가능 (locality ↑)

7 GPU에의 적용

툴	타겟	설명
PPCG	CUDA	polyhedral → CUDA C
Polly	LLVM IR	loop opt
MLIR affine	GPU·CPU	모듈식 polyhedral
Tiramisu	다양	scheduling + polyhedral 혼합

8 Schedule primitive와의 관계

TVM schedule primitive는 대다수 affine 변환 — polyhedral의 부분집합
tensorize·cache_*는 poly 밖 (HW-aware)
Ansor 내부 일부 결정은 poly 관점 해석 가능

9 "자동화" 한계

GPU에서는 memory coalescing·bank conflict가 affine 밖 결정
cost model 없으면 "legal but slow"
→ polyhedral + learned cost 혼합이 실용 방향

polyhedral은 legality 보장은 강력하지만 GPU 성능은 별도의 휴리스틱·탐색 필요.

1 10-기준 매트릭스 ★★

기준	Triton	XLA
추상 수준	tile DSL	graph (HLO)
사용자 제어	높음	낮음
최적화 자동성	중 (autotune)	높음 (passes)
GPU 성숙도	NV高 · AMD中	NV高 · TPU高
프론트엔드	Py DSL	JAX / TF
IR 계층	TTIR→TTGIR	HLO 1층
Fusion	tile 내부	graph pass
Auto-tune	config sweep	algo picker
생태계	OpenAI · LF	OpenXLA · JAX
Hopper 지원	sm_90 · TMA	Triton emitter

2 10-기준 매트릭스 (cont.) ★★

기준	TVM	Inductor
추상 수준	Relay/TE/TIR 3층	FX graph + Loop IR
사용자 제어	중~높음 (수동 sched)	낮음
최적화 자동성	높음 (MetaSched)	높음
GPU 성숙도	중	NV高
프론트엔드	ONNX / Relay	PyTorch / Dynamo
IR 계층	3층	Aten + Loop IR
Fusion	op + schedule	scheduler + codegen
Auto-tune	AutoTVM·Ansor	heuristic + Triton tune
생태계	Apache · edge	Meta · PyTorch
Hopper 지원	제한적	Triton 경유

3 축별 1-liner 요약

추상: Triton 낮음, XLA 높음, TVM 가변, Inductor 중간
제어: 사용자가 원하면 Triton/TVM-수동
자동: 맡기면 XLA/Inductor/Ansor
생태: 프레임워크가 이미 XLA/Inductor를 품음

4 Fusion 정책 비교

툴	결정자
XLA	cost model + rule
TVM	Relay pattern + schedule
Inductor	scheduler (group ops)
Triton	사용자가 tile 내부에서 결합

5 선택 요약

Python framework → 그 framework의 default (XLA or Inductor)
kernel 단위 실성능 → Triton
multi-backend 배포 → TVM
edge/μC → microTVM

4 축 한 줄: 추·제·자·생 (추상·제어·자동·생태계)

1 6 방법 비교표 ★

방법	샘플 효율	전제
Grid	낮음	공간 작음
Random	낮음~중	baseline
Bayesian	중~높음	부드러운 함수
Evolutionary	중	조합/이산
RL	가변	많은 data
Cost model	높음	feature 가능

2 Grid Search

모든 조합 완전 탐색
차원 ↑ → 지수 폭발 (curse of dim)
3~4 knob × 적은 value면 실용

3 Random Search

p(miss) = (1 − f)^N
f = top-f% fraction, N = trials Bergstra 2012: 많은 공간에서 grid 대비 동등 이상

4 Bayesian Optimization ★

surrogate: GP · TPE · RF
acquisition: EI · UCB · PI
연속·저차원에 강
GPU schedule은 이산 많음 → TPE / SMAC 선호

x_next = argmax_x α(x | D)
α(EI) = 𝔼[max(f^* − f(x), 0)]

5 Evolutionary / GA

population N 초기화
cost·measure로 fitness
mutation + crossover
selection (elitism, tournament)
N generation 반복

Ansor·Meta Schedule은 이 경로 기본

6 RL 기반

state = partial schedule, action = primitive 적용
reward = −time (measure)
PPO·A3C 등 적용 시도
실무 대비 학습 비용·샘플 비용 높음

7 Cost Model Guided

핵심 모든 후보를 측정하지 않고 학습된 cost model로 top-K만 측정. measure → model update 루프.

XGBoost / MLP / Transformer
rank loss 선호 (절댓값 아닌 순위)
out-of-distribution 위험

8 Measure 비용 모델

T_total = N · (T_build + T_run + T_sync)
T_build ≫ T_run 인 경우 많음

실제로는 build 병렬화·cache가 절대 필요

9 실무 선택 가이드

공간 ≤ 10³: grid
공간 ~ 10⁴: random + top-K measure
공간 ≥ 10⁵: cost-model + evolutionary
RL은 연구용, 대규모 서비스에선 성능/비용 비 낮음

cost model은 shape distribution에 강하게 의존. 새 shape이면 재학습 필요.

1 이중 위험 ★

원칙 너무 넓으면 탐색 실패, 너무 좁으면 최적 배제. 목적: "좋은 해가 안에 있으면서 탐색이 끝나는" 폭.

넓이 ┐                 ┌ 좋은 해 포함
     │   optimal       │
     │   coverage ↑    │
     │                 │
     └─────────────────┘
탐색  ↑ 비용·시간 ↑

2 공간 축소 기법

divisor·power-of-2만 허용
HW 제약 사전 적용 (smem ≤ 228 KB) ↗ V02 §3
legality 기반 pruning (polyhedral dep)
cost model로 top-K만 측정

3 과소 축소 예시

tile size를 32·64·128만 → 96이 최적인 경우 miss
reorder 후보 고정 → K가 innermost가 아닐 때 miss
tensorize 강제 → 작은 shape에선 오히려 손해

4 좋은 primitive 설계

성질	의미
직교성	primitive가 독립 차원
합성성	순서·조합 가능
안전성	correctness 불변
관찰성	feature 추출 가능

TVM schedule primitive는 대체로 이 성질 만족

5 Locality 축 ★

reuse = accesses / unique_bytes
→ tile size가 reuse를 결정 tile ↑ → reuse ↑ · smem 포화 ↑

locality · parallelism · register pressure 3각 trade
3축 모두 탐색 가능해야 좋은 공간

6 Feature-friendly 설계

primitive 1회 적용 = feature delta 가시
cost model이 feature를 읽을 수 있게 annotation 풍부화
block(TIR) 구조가 이 관점에서 유리 (§8)

7 과도한 축소 신호

best trial이 boundary value (예: 최대 tile)
convergence가 너무 빨리 멈춤
shape 변화에 민감하지 않음 — 의심
→ 공간 확장 후 재측정

8 과도한 확장 신호

trial 수 많은데 cost model 분산 큼
measurement outlier 빈번
generation 간 best 개선 미미
→ legality·HW 제약으로 prune 강화

9 공간 설계 체크리스트 ★

좋은 해가 공간 내부에 있는가 (경계 아님)
cost model feature가 모든 primitive에 반응
HW 한계 상단 pruning 존재
trial budget 내 수렴 근거
shape 바뀌어도 rebuild 부담 적음

공간 3원칙: 포·반·수 (좋은 해 포함·feature 반응·수렴 가능)

1 결정 트리 ★★

START: 어떤 모델/커널인가?
│
├─ 프레임워크가 PyTorch?
│   ├─ Yes → Inductor (기본) ──► 부족하면 Triton
│   │        (↗ V13 §2)
│   └─ No
│       │
├─ 프레임워크가 JAX / TF?
│   ├─ Yes → XLA (기본) ──► 특수 pattern은 Triton emitter
│   └─ No
│       │
├─ multi-backend 배포 (CPU/GPU/Edge)?
│   ├─ Yes → TVM (Relax + Meta Schedule)
│   └─ No
│       │
├─ custom kernel 최고 성능이 목표?
│   └─ Triton (또는 CUTLASS/CuTe ↗ V06)
│
└─ 연구: 자동 tiling·polyhedral 탐구
    └─ MLIR affine + Pluto/PPCG

2 3-초 속도 요약

상황	답
JAX 학습	XLA
PyTorch 학습·추론	Inductor → Triton
vLLM/TRT 외 직접 kernel	Triton/CUTLASS
ONNX 서빙 multi-target	TVM
edge/μC	microTVM
새 HW 타겟 연구	MLIR + TVM

3 흔한 혼동 ★

"XLA = TPU 전용" ❌ → GPU/CPU도 주요 타겟
"TVM = 느림" ❌ → Ansor·MetaSched로 vendor 라이브러리 수준 가능
"Triton = XLA 대체" ❌ → Triton은 kernel, XLA는 graph
"polyhedral = 자동이니 최고" ❌ → GPU perf는 poly 밖

4 언제 Auto-tune 쓸까

shape distribution 고정·반복 생산
build·measure 시간 투자 가능
기준 비교 대상 존재 (cuBLAS 등)
결과를 캐시해 서빙에 재사용

5 Schedule Primitive 11 한 줄

split/fuse — loop 축 재구성
reorder — 축 순서
bind — GPU 실행축
cache_read/write — smem/reg
compute_at — stage 위치
tensorize — HW intrinsic
vectorize/unroll/parallel — inner

6 Polyhedral 4-원소 요약

기호	의미
Ω	iteration domain
A	access relation
S	schedule (affine)
D	dependence

7 권 간 연결

Triton 깊이: ↗ V11
MLIR/LLVM: ↗ V12
Inductor/PT2: ↗ V13
분산: ↗ V15
profiling: ↗ V18

8 마지막 한 줄

원리 컴파일러 선택은 프레임워크 · 제어 욕구 · 타겟 다양성의 곱이다. 네 개를 동시에 쓸 수도 있다.