CUDA COMPILER SERIES · CONTENT-FIRST · A4 LANDSCAPE · 18p

TorchInductor / torch.compile 단권화

TorchDynamo · AOT Autograd · FX Graph · ATen Decomp · Inductor IR · Triton Codegen

Volume V13/18

Tier T4 Compilers

선행 V11 (Triton 컴파일러)

용도 torch.compile 내부 지도

범례

핵심 용어 (노란 배경)

표 헤더 / 매우 중요

정의·개념 박스

예시·워크드 박스

빨강주의·실수하기 쉬움

★필수·자주 참조

(!)니모닉 (권당 ≤5)

↗권 간 cross-ref

→인과·흐름

∵∴이유·결론

인쇄 A4 가로 / 여백 없음 / 배경 그래픽 포함

Source: PyTorch 2.x · torch/_inductor · torch/fx · torch/_dynamo · Meta blog

1 추상의 정체 왜 존재

정의 torch.compile = eager PyTorch를 frame-level에서 가로채 graph로 모은 뒤 backend compiler로 kernel을 생성하는 just-in-time 경로이다.

eager overhead (Python dispatch, allocator) 제거 → kernel fusion
graph 추출은 partial — graph break 허용이 TF graph와 차별점이다
AOT Autograd로 backward도 같은 IR에서 컴파일된다

2 5단계 Pipeline ★ Dy·AO·FX·IN·CG

Python fn (eager)
      │  ① TorchDynamo
      │      · CPython frame intercept
      │      · guard 생성
      ▼
  FX Graph  (torch operator level)
      │  ② AOT Autograd
      │      · fwd + bwd 분리
      │      · joint graph → partitioner
      ▼
  FX Graph  (aten / prims, 2개: fwd, bwd)
      │  ③ Decomposition
      │      · _refs / _decomp 적용
      ▼
  Inductor Input IR (lowered aten)
      │  ④ Inductor
      │      · Loops/Pointwise/Reduction
      │      · Scheduler fusion
      ▼
  Triton Python  /  C++/OpenMP
      │  ⑤ Codegen → ptxas / gcc
      ▼
  PTX + SASS  /  .so

source: torch/_dynamo/convert_frame.py · torch/_functorch/aot_autograd.py · torch/_inductor/compile_fx.py

3 단계별 책임 매트릭스

단계	입력	출력	핵심 역할
Dynamo	Python bytecode	FX Graph + guards	frame capture
AOT Autograd	FX (torch op)	fwd+bwd FX	diff graph 생성
Decomp	aten 고수준	prims/aten 저수준	canonical form
Inductor	lowered FX	Inductor IR	fusion·schedule
Codegen	Inductor IR	Triton/C++	kernel source

4 Entry API

import torch
def f(x, y):
    return (x + y).relu() * 2

g = torch.compile(f,
       backend="inductor",
       mode="reduce-overhead",
       fullgraph=False,
       dynamic=None)
g(x, y)   # compile on first call
g(x, y)   # guard hit → cached kernel

5 mode preset

mode	의미
`default`	Inductor fusion, no cudagraph
`reduce-overhead`	+ cudagraph capture (decode loop)
`max-autotune`	+ matmul/conv autotune, coord descent

6 Cache 위치 JIT 산출물

layer	경로 / key
Dynamo	`torch._dynamo.config.cache_size_limit` (per-fn)
Inductor FX	`TORCHINDUCTOR_CACHE_DIR` · default `/tmp/torchinductor_$USER`
Triton	`~/.triton/cache` · key hash ↗ V11 §15
cubin	Triton cache 내부

7 Inductor ↔ 다른 경로

Triton 컴파일러 내부 ↗ V11 — Inductor는 Triton의 소비자다
MLIR lowering 일반 ↗ V12
XLA / TVM 비교 ↗ V14
distributed training path ↗ V17

5단계: Dy · AO · FX · IN · CG (Dynamo → AOT → FX(prims) → Inductor → Codegen)

Out of scope: PyTorch 사용법 일반 (model 정의·dataloader·optimizer 선택 등). 이 권은 컴파일 경로만 다룬다.

1 Frame Evaluation 가로채기 ★

정의 TorchDynamo는 CPython PEP 523 frame-eval 훅을 통해 Python 함수 실행 직전 bytecode를 분석해 FX Graph로 변환한다.

target: CPython 3.8+ · PEP 523 _PyInterpreterState_SetEvalFrameFunc
대상: tensor-producing op만 trace, Python 제어흐름은 유지
결과: FX Graph + guard set + residual bytecode

source: torch/_dynamo/convert_frame.py (v2.x)

2 Symbolic Evaluator

Python frame
  bytecode = co_code
  for each op:
    if op produces Tensor:
      → trace into FX
      → record Guard(expr)
    elif op needs concrete value:
      → graph break
      → fallback to eager
    else:
      → emit residual bytecode

3 FakeTensor & shape prop

trace 시 실제 tensor 대신 FakeTensor (shape+dtype+device만)
실제 compute 0회 · metadata만 흘림
dynamic shape은 SymInt로 상징화 ↗ §15

4 Guard 시스템 기본 ★

guard 종류	체크 대상
`TENSOR_MATCH`	dtype · device · stride · size
`CONSTANT_MATCH`	Python int/bool/None 상수
`TYPE_MATCH`	object type identity
`ID_MATCH`	특정 object id (module param)
`DUPLICATE_INPUT`	alias 관계 (x is y)
`DICT_KEYS`	dict의 key 집합

각 guard는 C++로 lowering되어 call 당 ~ns overhead (v2.3+)

5 Graph break 조건

Python data-dependent control (if x.item() > 0)
지원 안 된 builtin (print, input 등)
.numpy() / CPU scalar 추출
unsupported torch.* op (목록 있음)
사용자 @torch._dynamo.graph_break()

fullgraph=True이면 graph break 시 raise. 디버깅용.

6 explain API 진단

from torch._dynamo import explain
exp = explain(f)(x, y)
print(exp.graph_count,
      exp.graph_break_count,
      exp.break_reasons)

반환: graph 수, break 수, 각 break의 (op, reason, stack)

7 Recompile 트리거

조건	재컴파일?
shape 변화 (static)	예
shape 변화 (dynamic=True)	대개 아니오
dtype 변화	예
device 변화	예
value 변화 (Tensor)	아니오
Python 상수 변화	예

8 cache_size_limit

n_recompile > cache_size_limit → disable default 8. 초과 시 해당 frame은 eager 실행. 많은 shape 변화가 예상되면 dynamic=True를 먼저 시도.

1 존재 이유 ★

정의 AOT Autograd는 Dynamo가 추출한 forward graph로부터 backward graph를 ahead-of-time에 생성해, Inductor가 fwd/bwd 양쪽을 같은 IR 수준에서 최적화하도록 한다.

eager autograd: backward는 runtime에 autograd engine이 구성 → fusion 대상 불가
AOT: backward를 FX Graph로 미리 구성 → fusion · decomp 적용 가능

source: torch/_functorch/aot_autograd.py

2 Joint Graph

inputs:   (x1, x2, ..., xk)          primals
labels:   (l1, ..., lm)               (loss inputs)
outputs:  (y1, ..., yn)               forward outputs
            + (g1, ..., gk)           backward grads

joint_fn(primals, tangents) -> (outs, grad_ins)
  = torch.autograd.Function.apply
     but traced into ONE FX graph

3 aot_function API

from torch._functorch.aot_autograd import aot_function
compiled = aot_function(
   fn,
   fw_compiler=my_fw,   # FX -> callable
   bw_compiler=my_bw,
   partition_fn=default_partition,
   decompositions=core_aten)

4 Partitioner ★

정의 joint graph를 forward subgraph와 backward subgraph로 분할하는 pass. 경계는 saved tensor로 명시화된다.

default_partition: autograd와 동일한 activation 저장
min_cut_rematerialization_partition: recompute로 saved tensor 감소 (checkpointing) ↗ V17 §7

5 Saved tensor 결정

saved = {t : t ∈ fwd outputs · required by bwd} min-cut partitioner는 graph를 directed graph로 보고 fwd-side/bwd-side 사이의 min cut을 찾아 전달 tensor를 최소화한다. cut 비용 = tensor memory.

6 functionalization

in-place op (add_, relu_) → out-of-place로 재작성
view는 copy로 구체화되지 않고 alias info로 전달
이유: fusion/decomp가 pure functional IR을 가정

7 aot_module nn.Module

compiled = aot_module(
   mod,
   fw_compiler=inductor_fw,
   bw_compiler=inductor_bw)

module param/buffer는 graph input으로 승격
state-free functional form으로 compile

8 Inference-only 경로

torch.no_grad() 또는 inference_mode이면 AOT가 forward graph만 생성
partitioner 호출 없음, bwd compiler 미사용

eager에서 requires_grad=True인 입력은 compile 경로로 들어가면 bwd graph가 항상 생성된다. inference 경로를 원하면 no_grad를 명시.

9 세 경로 대조

경로	fwd	bwd
eager	immediate	runtime tape
no_grad compile	FX	없음
train compile	fwd FX	bwd FX

1 torch.fx.Graph ★

정의 FX Graph는 Python-level의 static single assignment형 DAG이다. Node는 operation, edge는 tensor value dependency다.

Python 객체 — C++ IR 아님
pretty-printable, print(gm.graph)로 즉시 관측
transform은 Python rewriter로 작성

2 Node 종류 전체표

op	의미	target
`placeholder`	graph input	arg name
`get_attr`	module attribute	attr path
`call_function`	free fn	callable
`call_method`	self.method()	method name
`call_module`	sub-module	module path
`output`	graph return	tuple of nodes

3 Node 필드

op, target, args, kwargs, name
meta['val'] — FakeTensor (shape/dtype/device)
meta['stack_trace'] — 원본 코드 위치
users, all_input_nodes — DAG edge

4 예시: f(x) = relu(x·W + b)

def f(x, W, b):
    y = x @ W
    z = y + b
    return z.relu()

gm = torch.fx.symbolic_trace(f)
print(gm.graph)

graph():
  %x : [num_users=1] = placeholder[target=x]
  %W : [num_users=1] = placeholder[target=W]
  %b : [num_users=1] = placeholder[target=b]
  %matmul : call_function[target=torch.matmul](args=(%x, %W))
  %add    : call_function[target=operator.add](args=(%matmul, %b))
  %relu   : call_method[target=relu](args=(%add,))
  return (%relu,)

5 변환 API

# 1) node-by-node walk
for n in gm.graph.nodes:
    if n.op == "call_function" and \
       n.target is operator.add:
        with gm.graph.inserting_after(n):
            new = gm.graph.call_function(
                torch.add, n.args)
            n.replace_all_uses_with(new)
            gm.graph.erase_node(n)
gm.recompile()

replace_pattern — subgraph rewrite
Interpreter — 재실행으로 meta 재계산
Transformer — node 변환 framework

6 GraphModule

정의 GraphModule = Graph + owning nn.Module. .recompile() 호출 시 Graph로부터 Python source를 생성해 forward()로 저장한다.

생성된 source는 gm.code로 확인 — eager fallback 및 디버깅의 핵심.

7 Symbolic Trace vs Dynamo

축	symbolic_trace	Dynamo
메커니즘	Proxy tensor	bytecode
제어흐름	불가	graph break
dynamic shape	제한적	SymInt
guard	없음	있음

8 Pass 배치 순서

Dynamo → FX (torch op level)
AOT Autograd → fwd+bwd FX
Decomposition → prims/aten core
Inductor lowering → Inductor IR ↗ §7

symbolic_trace는 data-dependent if·list.append 등에서 trace 불가. torch.compile은 Dynamo만 사용.

1 Op의 두 층위 ★

층	예	개수
`aten::*`	`softmax`, `addmm`, `layer_norm`	~2000+
`prims::*`	`add`, `mul`, `broadcast_in_dim`, `reduction`	~100

source: torch/_refs · torch/_prims · torch/_decomp

2 canonical form

정의 canonical form = 모든 고수준 aten op을 core aten 또는 prims 조합으로 풀어낸 IR. Inductor가 lowering하는 target 형식이다.

Inductor가 케이스 폭발을 피하도록 → 줄어든 op 집합만 처리
수학적 동치이되 수치 비트단위 동일은 보장하지 않음

3 _refs · _decomp 역할

torch._refs.* — op의 reference Python 구현 (prims 기반)
torch._decomp.* — aten→aten 분해 (derivative-safe)
core_aten_decompositions() — Inductor가 기본으로 받는 dict

4 예시 before/after ★ softmax

BEFORE (torch op level, Dynamo 직후):
  %a : placeholder[target=x]
  %sm = call_function[target=aten.softmax](
          args=(%a, -1))
  return (%sm,)

AFTER decomp (core/prims level):
  %a  = placeholder[target=x]
  %mx = call_function[target=aten.amax](
          args=(%a, [-1], True))
  %sub = call_function[target=aten.sub](
          args=(%a, %mx))
  %ex  = call_function[target=aten.exp](
          args=(%sub,))
  %sm  = call_function[target=aten.sum](
          args=(%ex, [-1], True))
  %div = call_function[target=aten.div](
          args=(%ex, %sm))
  return (%div,)

stable softmax: subtract amax for overflow safety ↗ V09 §9

5 Decomp 전후 장단점

관점	before	after
Node 수	적음	많음
fusion 기회	없음	많음
의미 명확성	고수준	저수준
backward 생성	custom rule	auto

6 Decomp 등록

from torch._decomp import register_decomposition
from torch import Tensor

@register_decomposition(aten.mse_loss)
def mse_loss(x: Tensor, t: Tensor,
             reduction: int = 1):
    d = (x - t)
    loss = d * d
    if reduction == 1:
        return loss.mean()
    return loss

7 Decomp 우선순위

user custom decomp
_refs (prims 기반)
_decomp (aten→aten)
decomp 없음 → Inductor fallback (extern kernel)

8 Opaque op 유지

aten::mm, aten::convolution — decomp 대신 extern kernel (cuBLAS/cuDNN) 호출로 남김
Inductor max-autotune 모드는 Triton matmul 후보도 생성

실수: 사용자 정의 op에 decomp 미등록 시 Inductor가 fallback으로 eager 호출 → fusion 기회 소실.

1 Entry point ★

정의 Inductor는 compile_fx(gm, example_inputs)로 진입한다. gm은 이미 decomp가 적용된 FX GraphModule, example_inputs는 FakeTensor 목록이다.

source: torch/_inductor/compile_fx.py :: compile_fx_inner

2 입력 불변조건

graph는 functional (in-place 없음)
모든 node에 meta['val'] (FakeTensor) 존재
op은 core aten + prims + 약간의 허용 set
dynamic shape은 SymInt로 표현

3 FakeTensor

필드	용도
`shape`	(s0, 64, 64)
`stride`	layout inference
`dtype`	fusion 호환성
`device`	codegen 분기
`storage_offset`	view 추적

4 Shape Propagation

for node in graph:
  fake_args = [a.meta['val'] for a in node.inputs]
  out_fake  = node.target(*fake_args)     # FakeTensor op
  node.meta['val'] = out_fake             # shape/dtype 기록

FakeTensor 연산은 meta kernel 호출 — 실제 memory 0
dispatch는 Meta key로 라우팅

5 Symbolic Shape dynamic

s_i ∈ SymInt · constraint set C Inductor는 shape을 C의 모순이 없을 때까지 symbolic로 유지. 예: s0·s1 ≥ 1, s0 mod 8 == 0 등. guard로 방출 ↗ §15.

6 Layout Inference

input stride 관측 → contiguous/channels_last/custom 분기
출력 layout은 make_contiguous_strides_for(shape) 기본
layout 결정은 fusion 가능성에 영향

7 Lowering 개요

for node in gm.graph.nodes:
  lower = LOWERINGS[node.target]
  ir_out = lower(*input_ir_tensors,
                 **kwargs)
  env[node] = ir_out                # TensorBox

op별 lowering 함수는 dict에 등록
결과는 TensorBox (wrap Loops IR) ↗ §7

8 Fallback 정책

op 유형	Inductor 처리
pointwise	Pointwise IR
reduction	Reduction IR
matmul/conv	extern (cuBLAS/cuDNN) 또는 Triton template
복잡 custom	eager fallback

9 backend 라우팅 분기

FakeTensor.device == "cuda" → Triton codegen
== "cpu" → C++/OpenMP codegen ↗ §10
mixed device → split subgraph

shape propagation이 실패하면 guard 추가 후 재컴파일. 잦은 실패는 dynamic shape가 너무 공격적이라는 신호.

1 IR 계층 ★

정의 Inductor IR은 index expression으로 정의된 tensor 생산 규칙이다. 각 node는 output shape과 inner function f(idx) → value로 서술된다.

source: torch/_inductor/ir.py

2 핵심 클래스 전체표

class	역할
`TensorBox`	lowering 결과 wrapper
`Loops`	iteration 공간을 가진 base
`Pointwise`	f(idx) → v (no reduction)
`Reduction`	reduction axes + combine
`Scatter`	indirect store
`MatMul`	extern / Triton template
`Buffer`	realized storage
`ComputedBuffer`	Loops + Buffer pair

3 Pointwise 예시

def relu_add(a, b):
    def inner(idx):
        x = a.load(idx)
        y = b.load(idx)
        return ops.maximum(x + y, 0)
    return Pointwise(
       device=a.device,
       dtype=a.dtype,
       inner_fn=inner,
       ranges=a.size)

inner_fn(idx)는 symbolic index → value
compose는 함수 합성으로 이루어짐 (== vertical fusion 기반) ↗ §8

4 Reduction 예시

Reduction(
   device=x.device,
   dst_dtype=torch.float32,
   src_dtype=x.dtype,
   inner_fn=lambda idx, ridx:
       x.load(idx + ridx),
   ranges=[M],
   reduction_ranges=[N],
   reduction_type="sum")

non-reduced axis = ranges
reduced axis = reduction_ranges
type: sum · prod · max · min · argmax · argmin · welford

5 ops namespace

그룹	op
arith	`add · sub · mul · div`
math	`exp · log · sqrt · sin · cos`
relu-like	`maximum · minimum · where`
compare	`eq · lt · le · gt · ge`
cast	`to_dtype · to_dtype_bitcast`
memory	`load · store · index_expr`

ops는 backend에 의해 재해석 (Triton ops vs C++ ops)

6 Buffer 두 종류

InputBuffer — graph input / placeholder
ComputedBuffer — Inductor가 생성할 tensor
ExternKernel — cuBLAS/cuDNN 호출 node

7 Realization 시점

realize(node) = ∃ multiple consumers ∨ reduction boundary 여러 소비자가 있거나 reduction을 넘어서는 tensor는 realize되어 실제 buffer가 할당된다. 그렇지 않으면 inner_fn으로 소비자에 inline된다 — 이것이 pointwise fusion의 메커니즘.

IR은 side-effect-free. inplace는 허용되지 않으며, store는 backend codegen 단계에서만 등장한다.

1 Scheduler 역할 ★

정의 Scheduler는 IR node를 SchedulerNode로 감싸 의존성 그래프를 구성한 뒤 FusedSchedulerNode로 묶어 kernel 경계를 결정한다.

source: torch/_inductor/scheduler.py :: Scheduler

2 Fusion 종류

종류	의미
vertical	producer→consumer 합치기 (inner_fn 합성)
horizontal	동일 iteration space의 독립 node 합치기
reduction+pointwise	reduction 뒤 elementwise 병합
epilogue	matmul 뒤 pointwise 합치기 (Triton template)

3 Fusion의 이득

kernel launch 1회 → overhead ↓
intermediate buffer 제거 → memory BW ↓
register에서 값 재사용 → L2/HBM traffic ↓

4 가능/불가능 조건표 ★

producer	consumer	fuse?	조건
pointwise	pointwise	예	동일 shape · elementwise index
pointwise	reduction	예	pointwise가 reduction 입력
reduction	pointwise	조건부	reduction 출력이 broadcast 없이 사용
reduction	reduction	조건부	axis 호환 · persistent 가능
pointwise	matmul	예(prologue)	Triton matmul template 한정
matmul	pointwise	예(epilogue)	Triton matmul template 한정
scatter	any	아니오	indirect store → realize
any	mutation	아니오	inplace 경계
dtype A	dtype B	예	cast op 자동 삽입
device A	device B	아니오	kernel 경계

5 Fusion 제한 요인

shared memory 용량 (persistent reduction)
register 압박 — 너무 많이 합치면 spill
iteration space 불일치
dtype·device 경계
명시적 realize() hint

6 알고리즘 개요

nodes = topo_sort(IR graph)
for n in nodes:
  for pred in n.deps:
    if can_fuse(pred, n):
      fuse(pred, n)            # vertical
for n in nodes:
  for sibling in same_range(n):
    if can_fuse_h(n, sibling):
      fuse_h(n, sibling)       # horizontal

cost model은 memory bytes 기반 heuristic
loop-order-aware (loop_ordering_after_fusion)

7 반복 합침

합친 뒤 다시 fusion 시도 · fixed point까지 반복 한 번의 fusion이 새로운 기회를 만들 수 있음 (예: pointwise 합침 후 reduction 경계 사라짐). 상한은 config로 설정.

실수: graph break가 많으면 각 subgraph가 독립이라 fusion 기회 소실. 먼저 graph break를 줄여라 ↗ §2.

1 Triton Backend 진입 ★

정의 Inductor의 CUDA codegen은 TritonKernel 클래스가 FusedSchedulerNode로부터 Triton Python source를 합성하여 파일로 기록한 뒤 @triton.jit을 통해 컴파일한다.

source: torch/_inductor/codegen/triton.py

2 생성 코드 skeleton

@triton.jit
def triton_poi_fused_add_relu_0(
    in_ptr0, in_ptr1, out_ptr0,
    xnumel, XBLOCK: tl.constexpr):
  xoffset = tl.program_id(0) * XBLOCK
  xindex = xoffset + tl.arange(0, XBLOCK)
  xmask = xindex < xnumel
  x0 = tl.load(in_ptr0 + xindex, xmask)
  x1 = tl.load(in_ptr1 + xindex, xmask)
  r  = tl.maximum(x0 + x1, 0.0)
  tl.store(out_ptr0 + xindex, r, xmask)

kernel name: triton_{kind}_fused_{ops}_{hash}
kind: poi pointwise · red reduction · per persistent

3 Block Size 결정

축	심볼	기본
x (linearized)	XBLOCK	256 / 1024
y (2D)	YBLOCK	16 / 32
r (reduction)	RBLOCK	8 / 16 / 32 ..

autotune config는 triton_heuristics.py에서 후보 생성. size_hint에 따라 분기.

4 Reduction 형태

persistent: reduction 차원을 한 번에 register에 → no-loop · R ≤ ~2048
looped: reduction loop + accumulator · R 제한 없음
split: 두 kernel (partial + final)
선택 기준: reduction numel, shared memory 용량

5 출력 코드 구성

call_fn:
  args: input ptr, output ptr, shape
  grid: launch config
  stream: current CUDA stream
  triton_kernel[grid](... , XBLOCK=256)

6 Horizontal fusion in codegen

동일 grid로 여러 output을 한 kernel에 작성
서로 다른 output buffer에 tl.store 반복
launch overhead 상각, I/O 재사용

7 Matmul Template

정의 max-autotune은 Triton mm template(kernel/mm.py)을 사용해 tile 크기 후보를 brute-force로 autotune하고 최적 config를 cache한다.

prologue: load_scale · cast 등 fuse
epilogue: bias · gelu · layernorm 등 fuse
후보는 MI300/A100/H100별로 다름

8 Kernel 이름 규칙

prefix	의미
`triton_poi_*`	pointwise
`triton_red_*`	looped reduction
`triton_per_*`	persistent reduction
`triton_tem_*`	template kernel (mm/conv)

이름으로 Inductor가 만든 어떤 kind인지 바로 읽힘.

생성 kernel은 ↗ V11 §2 Triton pipeline으로 들어간다. num_warps·num_stages는 Inductor의 autotune config가 결정.

1 존재 이유

정의 Inductor는 CPU device에 대해 C++ source를 합성해 gcc/clang으로 .so를 빌드한 뒤 dlopen으로 로드한다.

source: torch/_inductor/codegen/cpp.py · cpp_wrapper.py

2 출력 코드 skeleton

extern "C" void kernel(
    const float* in0,
    const float* in1,
    float* out,
    long N) {
  #pragma omp parallel for
  for(long i=0; i<N; i+=16){
    auto a = at::vec::Vectorized<float>
                ::loadu(in0+i);
    auto b = at::vec::Vectorized<float>
                ::loadu(in1+i);
    (a + b).relu().store(out+i);
  }
}

3 Vectorization 경로

ISA	width	type
AVX2	8 × float32	`__m256`
AVX-512	16 × float32	`__m512`
NEON (ARM)	4 × float32	`float32x4_t`
SVE	scalable	`svfloat32_t`

at::vec::Vectorized<T>가 ISA 추상화.

4 OpenMP 병렬화

outer loop에 #pragma omp parallel for
thread 수 = torch.get_num_threads()
reduction은 omp parallel reduction(+:acc)

5 Reduction scheme

split → parallel partial → serial final N 큰 경우 thread별 partial accumulator를 쓴 뒤 마지막에 합친다. 숫자 결과는 thread 수에 약간 영향을 받음 (비결합성).

6 Build path

.py source
  → write /tmp/torchinductor_*/xx.cpp
  → compile (gcc/clang) -> .so
  → dlopen -> ctypes fn ptr
  → call via cpp_wrapper

7 cpp_wrapper 모드

config.cpp_wrapper = True → Python wrapper까지 C++로
dispatch overhead 제거 → deployment에 유리
AOT export 경로 (torch.export + AOTInductor)의 기반

8 Inductor CPU 한계

항목	CPU	CUDA
matmul	MKL/oneDNN extern	Triton/cuBLAS
fusion 깊이	얕음	깊음
dynamic shape	지원	지원
autotune	제한	풍부

이 권은 참고만. CPU 최적화 일반은 out-of-scope — 본문은 GPU path를 다룬다.

1 목적 ★

정의 CUDAGraph는 kernel launch 시퀀스를 capture한 뒤 cudaGraphLaunch 단 한 번으로 replay하는 API. launch당 ~수 μs overhead를 ~ns 수준으로 내린다.

source: torch/_inductor/cudagraph_trees.py

2 Inductor 활성화 조건

mode="reduce-overhead" 또는 mode="max-autotune"
또는 config.triton.cudagraphs = True
CUDA device에 한함

3 Graph capture 조건

요구	이유
static shape	kernel launch가 shape에 의존
static pointer	input tensor address 고정
no CPU sync	capture mid에 synchronize 금지
no mem alloc	capture 중 new alloc 불가
no host→device copy	non-captureable

4 Static Input Copy

Inductor는 graph 실행 시 static input buffer에 입력을 복사한 뒤 replay
copy 비용 vs launch 절약 trade-off
config.triton.cudagraph_skip_dynamic_graphs로 우회

5 CUDA Graph Trees

정의 cudagraph_trees는 여러 graph를 memory pool을 공유하며 tree 형태로 관리해, inference의 다양한 분기를 효율적으로 replay하게 한다.

default ON in v2.2+ when mode uses cudagraph

6 Re-capture 트리거

input shape 변화 → 새 graph
new tensor object → pointer mismatch → re-capture
warmup 단계 (첫 2~3회는 plain run)

7 적용 유/불리 상황

상황	유리도	이유
LLM decode loop	★★★	small kernel × 많음
training step	★★	shape 거의 고정
prefill	★	shape 다양 → re-capture 많음
variable batch	△	graph explosion

8 비활성 패턴

NCCL collective (동기화 포함) — stream-aware collective 필요 ↗ V15 §15
.cpu() / .item() 호출
dynamic allocator가 새 block을 요청

9 진단 env

env	효과
`TORCH_LOGS=cudagraphs`	capture 로그
`TORCH_CUDAGRAPH_TRACE=1`	trace 상세
`TORCHINDUCTOR_CUDAGRAPHS=0`	비활성

inference serving 경로에서의 cudagraph 통합은 ↗ V16 §10에서 자세히 다룬다 (vLLM · SGLang pool).

1 설정 경로

torch._inductor.config.* — Python attr
TORCHINDUCTOR_* env — env 변수 매핑
우선순위: env > Python attr > default

source: torch/_inductor/config.py

2 주요 bool flag

flag	효과
`triton.unique_kernel_names`	kernel name 안에 op 이름 포함 (디버깅)
`triton.cudagraphs`	cudagraph 켬
`coordinate_descent_tuning`	추가 autotune
`epilogue_fusion`	matmul epilogue fuse
`cpp_wrapper`	C++ wrapper 생성
`freezing`	const folding · inference

3 Size-related

flag	뜻
`max_fusion_size`	fuse 할 최대 node 수
`max_autotune_gemm`	matmul 후보 수 상한
`triton.max_block`	block size 상한 (x/y/r)

4 Debug flag ★

flag / env	산출물
`TORCH_COMPILE_DEBUG=1`	IR dump dir 생성
`TORCH_LOGS="inductor"`	Inductor 단계 로그
`TORCH_LOGS="dynamo"`	Dynamo 로그
`TORCH_LOGS="aot"`	AOT Autograd 로그
`TORCH_LOGS="output_code"`	생성된 Triton/C++ 소스
`TORCH_LOGS="schedule"`	scheduler 결정

5 Cache dir 구조

/tmp/torchinductor_$USER/
  └ <fxgraph_hash>/
      ├ output_code.py        # Python wrapper
      ├ fx_graph_readable.py  # pre-lowering
      ├ fx_graph_runnable.py  # re-executable
      ├ fx_graph_transformed.py
      ├ triton_<hash>.py      # kernel source
      └ fx_graph.aot_inductor.so (cpp_wrapper)

6 Shape policy

flag	효과
`dynamic_shapes`	symbolic shape 허용
`assume_static_by_default`	첫 호출을 static로
`automatic_dynamic`	재컴파일 시 axis 자동으로 symbolic

자세한 동작은 ↗ §15 Guard.

7 Trace dump

TORCHINDUCTOR_TRACE=1 → 각 kernel call에 NVTX range
Nsight Systems로 load · kernel별 분리 관측 가능 ↗ V18 §12

8 권장 조합

목적	flag
이름 해독	`unique_kernel_names=True`
IR 확인	`TORCH_COMPILE_DEBUG=1`
kernel 소스	`TORCH_LOGS=output_code`
재현	`fx_graph_runnable.py` 실행

디버깅 5 env: LOGS · DEBUG · TRACE · CACHE · NAMES

1 backend 인자

정의 torch.compile(fn, backend=...)의 backend는 Dynamo가 추출한 FX GraphModule을 callable로 컴파일하는 함수 또는 등록된 이름이다.

2 기본 목록

name	설명
`inductor`	default · TorchInductor
`eager`	Dynamo만 적용 (graph 테스트)
`aot_eager`	Dynamo + AOT Autograd, backend은 eager
`cudagraphs`	Inductor 없이 cudagraph
`onnxrt`	ONNX Runtime
`tvm`	TVM ↗ V14

3 언제 eager?

graph break 원인만 확인 → eager
bwd 오류 원인만 확인 → aot_eager
Inductor 버그 격리 → aot_eager로 bisect

4 Custom backend signature

from torch._dynamo import register_backend

@register_backend
def my_be(gm, example_inputs):
    # gm: torch.fx.GraphModule
    # example_inputs: list[FakeTensor]
    return lambda *args: gm(*args)

signature: (gm, example_inputs) → callable
callable은 eager-equivalent이어야 함
내부에서 AOT Autograd를 호출할지 선택

5 AOT Autograd 조합

from functorch.compile import \
    aot_module_simplified

def my_be(gm, inputs):
    return aot_module_simplified(
        gm, inputs,
        fw_compiler=my_compile,
        bw_compiler=my_compile)

backend이 fwd/bwd 모두 컴파일하려면 AOT를 명시 호출.

6 backend 비교 축

축	inductor	onnxrt	tvm
IR 수준	Inductor IR	ONNX	Relay/TIR
dynamic shape	지원	제한	제한
training	지원	미지원	미지원
cudagraph	내장	외부	외부

상세 비교는 ↗ V14 §15.

7 list_backends()

import torch
print(torch._dynamo.list_backends())
# 현재 설치된 backend 이름 목록

8 minifier 경로

TORCHDYNAMO_REPRO_AFTER=dynamo
TORCHDYNAMO_REPRO_AFTER=aot
버그를 minimal FX graph로 재현 → fx_graph_runnable.py로 산출

custom backend이 control flow를 반환하면 Dynamo가 guard와 결합 못 함. backend은 반드시 FX Graph의 동치 callable을 돌려줄 것.

1 왜 필요한가

정의 user가 직접 작성한 CUDA/Triton kernel을 torch.compile path에 노출하려면 op registration이 필요하다. 단순 Python 함수는 Dynamo가 trace할 수 없을 수 있다.

2 3 종류 impl

종류	dispatch key	용도
CUDA impl	CUDA	실제 GPU kernel
CPU impl	CPU	reference
Meta / Fake impl	Meta	shape inference
Autograd impl	AutogradCUDA	backward rule

3 custom_op API

@torch.library.custom_op(
  "my::scaled_add",
  mutates_args=())
def scaled_add(
  x: torch.Tensor,
  y: torch.Tensor,
  alpha: float) -> torch.Tensor:
    return x + alpha * y

4 Fake / Meta impl ★

@scaled_add.register_fake
def _(x, y, alpha):
    torch._check(x.shape == y.shape)
    return torch.empty_like(x)

fake impl은 shape/dtype만 돌려줌 — 실제 compute 없음
FakeTensor 경로에서 호출 (Dynamo·Inductor·AOT)
없으면 trace 실패 → eager fallback

5 Backward impl

def setup_ctx(ctx, inputs, output):
    x, y, alpha = inputs
    ctx.alpha = alpha

def bwd(ctx, grad):
    return (grad,
            ctx.alpha * grad,
            None)

scaled_add.register_autograd(
    bwd, setup_context=setup_ctx)

6 Inductor lowering 등록

from torch._inductor.lowering import register_lowering

@register_lowering(
    torch.ops.my.scaled_add)
def _(x, y, alpha):
    return x + alpha * y    # IR ops

등록 시 Inductor가 op을 inline해 fusion 가능
미등록이면 extern kernel로 유지 (fusion 불가)

7 등록 계층 요약

단계	필수?
kernel impl (CUDA)	필수
fake impl	필수 (compile path)
autograd rule	train 시 필수
inductor lowering	선택 (fusion 원할 때)
schema mutates_args	in-place 있을 때

8 torch.library 네임스페이스

Library("my", "DEF") — 새 op 선언
Library("aten", "IMPL") — 기존 aten override
torch.library.opcheck(fn, args) — 등록 정합성 검사

실수: fake impl에서 torch.empty_like가 아닌 실제 compute를 수행 → Inductor가 shape prop에서 실데이터를 요구해 OOM·성능 저하.

1 Guard의 책임

정의 Guard는 cached compile 결과가 현재 call에 유효한지를 call당 검증하는 술어이다. 하나라도 false면 재컴파일 또는 cache miss.

2 Static vs Symbolic

축	static	symbolic
의미	특정 값으로 특화	SymInt로 일반화
guard	equality	range / constraint
codegen	상수 embed	runtime arg
성능	↑ (상수 fold)	약간 ↓
재컴파일	값 바뀌면	거의 없음

3 SymInt / SymFloat

SymInt = symbolic integer, 대수 연산 지원
backing state: ShapeEnv가 constraint 보관
shape 관련 비교는 guard 생성의 주요 원천

4 Guard 생성 시점

trace:
  if x.shape[0] == 1024:     # guard: s0 == 1024  (static)
  if x.shape[0] > 0:         # guard: s0 > 0        (constraint)
  y = torch.empty(x.shape)   # no guard (shape-passthrough)
  z = x.view(-1, 8)          # guard: s0 mod 8 == 0

source: torch/fx/experimental/symbolic_shapes.py

5 dynamic=True 효과

모든 shape을 SymInt로 시작
constraint는 필요 시만 추가
재컴파일 훨씬 감소
단, fusion 기회 약간 감소

6 mark_dynamic / mark_static

torch._dynamo.mark_dynamic(x, 0)
# axis 0만 symbolic, 나머지 static
torch._dynamo.mark_static(y, 1)
# axis 1은 반드시 특정 값으로 컴파일

mixed 정책: 일부 axis만 dynamic
초기 호출에서 한 번 호출 → Dynamo hint

7 재컴파일 감소 전략

padding/bucketing으로 shape을 고정된 bucket 집합에 정렬
dynamic=True 또는 mark_dynamic으로 symbolic화
cache_size_limit 조정
graph break 제거 (재컴파일이 graph 단위임)

8 관측 API

도구	출력
`TORCH_LOGS="guards"`	각 guard 리스트
`TORCH_LOGS="recompiles"`	왜 재컴파일인지
`torch._dynamo.explain`	break/guard 요약

9 ShapeEnv

(symbols, constraints, value hints) ShapeEnv는 SymInt 심볼과 그 constraint를 관리. Inductor는 ShapeEnv를 읽어 kernel의 block size 선택과 fusion 가능성을 결정한다.

증상: 호출마다 재컴파일 · "recompiles" 로그 다수 → guard overspecialization. dynamic=True를 먼저 시도.

1 TORCH_LOGS 카테고리

category	의미
`dynamo`	frame capture · bytecode
`aot`	joint graph · partition
`inductor`	lowering · scheduler
`output_code`	최종 Triton/C++ 소스
`schedule`	fusion 결정
`guards`	guard 목록
`recompiles`	재컴파일 이유
`graph_breaks`	break 이유

여러 카테고리 동시: TORCH_LOGS="+inductor,output_code"

2 TORCH_COMPILE_DEBUG=1

모든 중간 산출물을 cache dir에 덤프
fx_graph · output_code · fusion decisions 포함
debug_trace/ 하위 디렉토리에 step-by-step

3 Inductor dump 경로 ★

$TORCHINDUCTOR_CACHE_DIR or /tmp/torchinductor_$USER/

  <fxhash>/
    ├ fx_graph_readable.py  ← Dynamo 직후
    ├ fx_graph_transformed.py ← decomp 후
    ├ fx_graph_runnable.py  ← 단독 실행 가능
    ├ output_code.py        ← Python wrapper + triton src
    ├ triton_poi_*.py       ← per-kernel source
    └ debug_trace/
        ├ 0_before_pre_grad_graph.py
        ├ 1_after_decomp_graph.py
        ├ 2_after_post_grad_graph.py
        └ *_ir_pre/post_fusion.txt

4 fx_graph_runnable 활용

# 단독 재현
import torch
exec(open("fx_graph_runnable.py").read())
# args0, mod 변수 사용 가능
mod(*args0)

버그를 최소 FX로 축소할 때 유용. minifier 출력과 동일 형식.

5 소스 디렉토리 지도

path	내용
`torch/_dynamo/`	frame capture
`torch/_functorch/`	AOT Autograd
`torch/fx/`	FX Graph
`torch/_decomp/`	decomp rules
`torch/_refs/`	prims refs
`torch/_inductor/ir.py`	Inductor IR
`torch/_inductor/scheduler.py`	fusion
`torch/_inductor/codegen/triton.py`	Triton codegen
`torch/_inductor/codegen/cpp.py`	C++ codegen
`torch/_inductor/config.py`	config · env

6 Trace 연계

NVTX marker로 각 kernel launch를 표시
Nsight Systems timeline에서 triton_poi_* 이름으로 식별
roofline·stall 분석은 ↗ V18

7 흔한 증상 → 힌트

증상	먼저 확인
매 call 느림	graph break / recompile 로그
numerical 불일치	decomp 비활성 + aot_eager
OOM	fusion 실패 → realize 과다
cudagraph 안 함	mode 및 static input 여부

Inductor 버전별로 파일명·config key가 자주 바뀐다. current source를 기준으로 확인 — 이 권은 구조 지도에 해당.

1 Pipeline 단계 × debug 카드

stage	dump env	소스
Dynamo	`TORCH_LOGS=dynamo`	`_dynamo/`
AOT Autograd	`TORCH_LOGS=aot`	`_functorch/`
Decomp	`COMPILE_DEBUG=1`	`_decomp/`
Inductor lowering	`TORCH_LOGS=inductor`	`_inductor/ir.py`
Scheduler	`TORCH_LOGS=schedule`	`scheduler.py`
Codegen	`TORCH_LOGS=output_code`	`codegen/`
Guard	`TORCH_LOGS=guards`	`symbolic_shapes.py`
Recompile	`TORCH_LOGS=recompiles`	`_dynamo/`

2 단계 산출 IR 요약

Python → FX(torch op) → FX(aten+prims) → Inductor IR → Triton/C++ 5개 IR 스냅샷이 존재. 각 스냅샷은 debug_trace/에 *_graph.py로 남는다.

3 Fusion 의사결정 1-liner

pointwise+pointwise → vertical (기본)
pointwise+reduction → producer 흡수
reduction+reduction → persistent에서만
matmul 전후 → template prologue/epilogue
scatter/mutation → fuse 금지

4 재컴파일 최소화 5원칙

shape bucketing
dynamic=True 또는 mark_dynamic
Python 상수 arg 최소
graph break 제거
cache_size_limit 확대

5 mode 결정 트리

inference · decode loop  → reduce-overhead
training · 큰 matmul       → max-autotune
디버깅·정확도 bisect       → aot_eager
graph break 원인 찾기      → eager

6 Custom op 등록 체크리스트

□ kernel impl (CUDA/CPU)
□ register_fake
□ register_autograd (학습 시)
□ register_lowering (fusion 원할 때)
□ opcheck로 정합성 확인

7 핵심 env 6종

env	용도
`TORCH_LOGS`	stage별 log
`TORCH_COMPILE_DEBUG`	full dump
`TORCHINDUCTOR_CACHE_DIR`	cache 위치
`TORCHINDUCTOR_UNIQUE_KERNEL_NAMES`	kernel name 해독
`TORCHINDUCTOR_CUDAGRAPHS`	cudagraph on/off
`TORCHDYNAMO_REPRO_AFTER`	minifier

8 타 권 연결

Triton 내부 pipeline·layout ↗ V11
MLIR 하위 lowering 일반 ↗ V12
XLA/TVM 비교 매트릭스 ↗ V14 §15
distributed training 경로 ↗ V17
roofline·stall 분석 ↗ V18

9 한 줄 요약

요지 torch.compile = Dynamo로 frame을 가두고, AOT로 bwd까지 묶고, decomp로 canonical form으로 내리고, Inductor가 IR fusion 후 Triton/C++로 내보내는 5-단 파이프라인이다.

파이프라인 5문자: Dy · AO · FX · IN · CG