CUDA 18-VOL · TIER 4 COMPILER · A4 LANDSCAPE · 18p

MLIR & LLVM for GPU

Dialect 설계 · Progressive Lowering · NVPTX Backend · linalg → PTX 전 과정

Volume 12 / 18

Tier T4 Compiler

선행 V11 (Triton)

용도 dialect 읽기 · lowering 추적 · NVPTX backend 이해

범례

핵심 용어 (노랑)

매우 중요·표 헤더

정의·개념 박스

예시 / 워크드

빨강주의·함정

★실무 핵심

(!)니모닉 (권당 ≤5)

↗cross-ref (다른 권)

→인과·lowering

::dialect namespace 구분자

Out-of-scope non-GPU MLIR 상세 · CIRCT · 특정 프로덕션 컴파일러 (↗ V13 Inductor / ↗ V14 XLA·TVM)

MLIR docs · LLVM NVPTX · llvm-project/mlir · 18 pages

1 MLIR Multi-Level IR multi · dialect · progressive

정의 MLIR = 하나의 IR 프레임워크 안에서 여러 abstraction level이 공존한다. 각 level = dialect. 한 IR 모듈에 고수준 op와 저수준 op가 동시에 존재할 수 있다.

LLVM IR 단일 level → MLIR은 tensor · loop · hardware intrinsic 여러 level
dialect mix-in: 한 함수 body에 linalg·scf·memref 섞어 쓰기 가능
개발자는 새 dialect를 정의해 자기 domain 의미를 그대로 표현한다

2 왜 MLIR이 존재하는가 ★

계층	LLVM IR만 쓸 때 문제	MLIR 해결
Tensor	shape · layout 소실	tensor / memref type
Loop	scalar 루프만	affine / scf region
GPU	intrinsic 풀(flat)	gpu · nvgpu dialect
Domain	표현 불가 (FHE, sparse)	확장 dialect 작성

MLIR docs "Motivation" · 원 도입 2019 (Chris Lattner, Google)

3 Progressive Lowering ★ high → mid → low → target

개념 high-level op를 한 번에 target 코드로 바꾸지 않고, 점진적으로 (progressive) 여러 dialect를 거치며 하강시킨다. 각 단계에서 잃는 정보를 제어한다.

tensor/linalg          ← 고수준 의미 (행렬곱 그 자체)
    ↓ tiling
scf + memref           ← 중수준 루프·버퍼
    ↓ vectorize
vector                 ← SIMD-like tile
    ↓ gpu lowering
gpu + nvgpu            ← GPU 추상화 · Tensor Core
    ↓ nvvm
nvvm                   ← NVVM intrinsic 래핑
    ↓ translate
llvm (NVPTX)           ← LLVM IR
    ↓ NVPTX codegen
PTX assembly

4 "하나의 IR" 원칙

한 .mlir 파일이 lowering 단계마다 다른 모양이 된다
print/parse round-trip 가능 → dump 검증 쉽다
mlir-opt: pass pipeline 한 번에 적용
mlir-translate: MLIR ↔ LLVM IR 변환

5 설계 철학 3-key (!) 재사용 · 확장 · 검증

재사용 (reuse): Pass/Pattern infra는 공유. 각 dialect가 op·type·interface만 추가
확장 (extensibility): 새 dialect 작성 비용 낮음 (TableGen ODS)
검증 (verification): op마다 invariant 선언 → 자동 verifier 생성

6 사용 생태계

프로젝트	MLIR 사용
Triton	Triton IR · TritonGPU IR ↗ V11
IREE	모델 → CPU/GPU backend 전체 파이프라인
TorchInductor	일부 통합 탐색 (주로 Triton 경유) ↗ V13
ONNX-MLIR	ONNX → 여러 backend
CIRCT	hardware design (본권 out-of-scope)

혼동 금지: MLIR = 프레임워크, 특정 컴파일러 아님. "MLIR이 컴파일한다"는 문장은 부정확 — dialect + pass pipeline 조합이 컴파일한다.

1 6-key IR elements ★ Op·Attr·Ty·Rgn·Blk·Val

요소	역할
Operation	계산 단위 (LLVM의 instruction 일반화)
Attribute	컴파일-타임 상수 (shape, integer, string)
Type	Value의 타입 (`i32`, `tensor<4x4xf16>`)
Region	Block 리스트 (nested IR)
Block	Operation 시퀀스 + terminator
Value	SSA 결과 또는 Block argument

2 Operation 일반 형식 ★

// generic form
%r = "dialect.op_name"(%a, %b) <{attr = 42 : i32}>
     : (i32, i32) -> i32

// pretty form (등록된 printer)
%r = dialect.op_name %a, %b
     {attr = 42 : i32} : i32

operand: SSA Value (입력)
result: SSA Value (출력, 0개 이상)
attribute: 상수 metadata
region: nested body (선택)
successor: block 분기 대상 (선택)

3 Recursive IR 구조 ★

개념 Operation이 Region을 포함할 수 있고, Region 안에 Block이, Block 안에 다시 Operation이 있다. 재귀적 중첩이 핵심 — LLVM IR은 function → block → instr 고정 2단계뿐.

ModuleOp
  └─ Region
      └─ Block
          └─ func.func
              └─ Region
                  └─ Block (args = %arg0, %arg1)
                      ├─ scf.for
                      │   └─ Region
                      │       └─ Block
                      │           └─ ... ops ...
                      └─ func.return

4 SSA + Block argument

Value = OpResult 또는 BlockArgument
φ (phi) 대신 block argument 사용
분기: cf.br ^bb(%x : i32) → ^bb(%arg : i32)에 바인딩
정의-사용(def-use) 체인 자동 유지

5 Type 계층 예시

Type	예
builtin 정수	`i1, i8, i32, i64`
builtin 실수	`f16, bf16, f32, f64`
Vector	`vector<4xf32>`
Tensor	`tensor<?x16xf16>` (shape 포함)
MemRef	`memref<32x32xf16, 3>` (addr space)
LLVM	`!llvm.ptr<1>`, `!llvm.struct<...>`

? = dynamic dim · 마지막 정수 = memory space

6 Attribute 예시

// integer / float / string
{value = 3 : i32}
{scale = 1.25 : f32}
{name = "kernel_0"}

// dense array / affine map
{mask = dense<[1,0,1,0]> : vector<4xi1>}
{map = affine_map<(d0)->(d0 floordiv 4)>}

// symbol ref
{callee = @my_func}

7 Use vs Def

Operand = Use(Value)
Value.getUsers() → { Operation* } rewrite 시 use를 다 돌며 replace → dangling 없음

1 Dialect 정의 ★

정의 Dialect = 한 domain의 Op·Type·Attribute·Interface 모음 + 고유한 namespace prefix (arith.addi의 arith).

namespace로 op 이름 충돌 방지
dialect 간 경계 = lowering 단위
dialect는 의존 관계 선언 (upstream / downstream)

2 TableGen ODS op 선언 DSL

// .td 파일
def Arith_AddIOp : Arith_Op<"addi",
    [Pure, Commutative, SameOperandsAndResultType]> {
  let summary = "integer addition";
  let arguments = (ins SignlessIntegerLike:$lhs,
                       SignlessIntegerLike:$rhs);
  let results   = (outs SignlessIntegerLike:$result);
  let assemblyFormat =
      "$lhs `,` $rhs attr-dict `:` type($result)";
}

ODS = Operation Definition Specification
C++ 코드 자동 생성 (Op 클래스 · builder · verifier · printer · parser)

3 Trait · Interface ★

개념	의미
Trait	op 속성 tag (Pure, Commutative, Terminator)
OpInterface	동적 다형성 API (MemoryEffects, LoopLike)
TypeInterface	type 계열 공통 API (ShapedType)
AttrInterface	attribute 계열 공통 API

pass가 isa<LoopLikeOpInterface>로 dialect 모르고도 일반 처리

4 대표 dialect 일람 ★

dialect	level	주요 op
`func`	core	`func.func`, `func.return`, `func.call`
`arith`	core	`addi`, `mulf`, `cmpi`
`scf`	mid	`scf.for`, `scf.if`, `scf.yield`
`affine`	mid	`affine.for`, `affine.load`
`memref`	mid	`memref.alloc`, `memref.load`
`tensor`	high	`tensor.empty`, `tensor.extract`
`vector`	mid	`vector.contract`, `vector.transfer_read`
`linalg`	high	`linalg.matmul`, `linalg.generic`
`gpu`	mid	`gpu.launch`, `gpu.thread_id`
`nvgpu`	low	`nvgpu.mma.sync`, `nvgpu.ldmatrix`
`nvvm`	target	`nvvm.barrier0`, `nvvm.mma.sync`
`llvm`	target	`llvm.func`, `llvm.add`, `llvm.call`

5 Dialect 짝 in vs out

tensor ↔ memref: value semantics ↔ reference semantics
linalg on tensors ↔ linalg on memrefs (bufferization 전후)
affine ↔ scf: 정규 루프 ↔ 일반 structured control flow
gpu ↔ nvgpu: 벤더 중립 ↔ NVIDIA 특화

6 Dialect 등록 플로우

1. C++ DialectRegistry에 등록
2. mlir-opt 실행 시 필요한 dialect load
3. parse → verify → pass → print
4. 미등록 dialect op는 generic form으로만 print

7 In-tree vs Out-of-tree

구분	예
In-tree	llvm-project/mlir 내부 dialect
Out-of-tree	Triton · IREE · CIRCT · ONNX-MLIR

함정: assemblyFormat으로 pretty form을 지정했어도 dialect 미등록이면 generic form으로만 출력 — tool pipeline에 dialect 누락 자주 발생.

1 4-tier 계층 ★ high · mid · low · target

tier	관심사	대표 dialect
High	수학적 연산 · tensor 그 자체	`linalg`, `tensor`, `tosa`
Mid	loop · buffer · region	`scf`, `affine`, `memref`
Low	SIMD tile · device 추상	`vector`, `gpu`
Target	ISA intrinsic 근접	`nvgpu`, `nvvm`, `rocdl`, `llvm`

2 계층 트리 도식 ★★

          ┌─────────────────┐
          │ tensor · linalg │    HIGH  (value · math)
          │ (tosa, mhlo)    │
          └────────┬────────┘
                   │ tile / fuse / bufferize
          ┌────────▼────────┐
          │ affine / scf    │    MID   (control flow)
          │ memref          │          (buffer addr)
          └────────┬────────┘
                   │ vectorize
          ┌────────▼────────┐
          │ vector          │    LOW   (SIMD tile)
          │ gpu             │          (device abstr)
          └────────┬────────┘
                   │ nvgpu lowering
          ┌────────▼────────┐
          │ nvgpu           │    TARGET-near
          │   (mma · ldmat) │
          └────────┬────────┘
                   │ nvvm lowering
          ┌────────▼────────┐
          │ nvvm            │    intrinsic wrap
          │ (PTX-like)      │
          └────────┬────────┘
                   │ translate
          ┌────────▼────────┐
          │ llvm (NVPTX tgt)│    LLVM IR
          └────────┬────────┘
                   │ llc · ptxas
                   ▼
                PTX / SASS

3 Tier 간 변환 Pass 이름

from → to	Pass
linalg → loops	`convert-linalg-to-loops`
tensor → memref	`one-shot-bufferize`
affine → scf	`lower-affine`
scf → cf	`convert-scf-to-cf`
vector → gpu	`convert-vector-to-gpu`
gpu → nvvm	`convert-gpu-to-nvvm`
nvgpu → nvvm	`convert-nvgpu-to-nvvm`
nvvm → llvm	`convert-nvvm-to-llvm`
llvm → LLVM IR	`mlir-translate --mlir-to-llvmir`

4 Tensor ↔ Memref 차이

축	`tensor`	`memref`
의미	value (불변)	reference (mutable)
메모리	없음 (SSA)	명시적 alloc / load / store
addr space	없음	integer (GPU: 1, 3, ...)
전환	bufferization pass가 tensor → memref

5 Affine vs SCF

affine.for: 경계·step 이 affine 식만 허용 → polyhedral 분석 대상
scf.for: 일반 SSA 값 허용 (더 자유)
fusion / tiling 초기엔 affine, 이후 scf로 일반화
affine.load는 memref에서만 동작

6 Vector dialect 위치

역할 hardware SIMD (CPU AVX / GPU warp-lane)에 매핑되는 tile 단위 op. CPU·GPU 모두 공유한다.

vector.contract: GEMM-shape SIMD
vector.transfer_read/write: 경계 padding 포함
gpu lowering은 vector → nvgpu.mma.sync로 매핑

주의: 계층은 강제 순서가 아님. 프로덕션 파이프라인이 중간 단계를 뛰어넘거나 되돌아가기도 한다 (예: IREE는 linalg에서 바로 vector로).

1 Pass 구조 ★ scope · apply · preserve

base	scope
`OperationPass<ModuleOp>`	모듈 전체
`OperationPass<func::FuncOp>`	함수 하나
`OperationPass<>`	any op
`InterfacePass`	특정 interface

runOnOperation() 구현
preserve analysis 선언 → pass 간 결과 공유

2 Pass Pipeline 지정 mlir-opt

# 파이프라인 문자열
mlir-opt input.mlir \
  --pass-pipeline="builtin.module(\
    func.func(convert-linalg-to-loops,\
              convert-vector-to-gpu),\
    gpu.module(convert-gpu-to-nvvm))" \
  -o out.mlir

괄호 = nested pass manager · func.func(...) = 함수 단위 scope

3 Pattern Rewriting 기본 ★

정의 RewritePattern = IR 조각을 매칭해 다른 조각으로 교체하는 규칙. matchAndRewrite(op, rewriter)에서 rewriter.replaceOp / createOp 호출.

struct AddZeroFold : OpRewritePattern<AddIOp> {
  LogicalResult matchAndRewrite(
      AddIOp op, PatternRewriter &rw) const {
    if (matchPattern(op.getRhs(), m_Zero())) {
      rw.replaceOp(op, op.getLhs());
      return success();
    }
    return failure();
  }
};

4 Greedy vs Conversion

축	Greedy	Dialect Conversion
목적	canonicalize / folding	한 dialect → 다른 dialect
종료	고정점	legal op만 남을 때
type 변환	불가	TypeConverter 사용
원자성	op 단위	그래프 전체 커밋

5 Dialect Conversion 3-key ★ Target · TypeConv · Pattern

ConversionTarget: 어떤 op가 legal인지 선언
TypeConverter: 타입 변환 규칙 등록 (tensor → memref)
ConversionPattern: op rewrite 구현 (legal 한 op만 생성)

target.addLegalDialect<LLVM::LLVMDialect>();
target.addIllegalDialect<nvvm::NVVMDialect>();

if (failed(applyFullConversion(
    module, target, patterns)))
  signalPassFailure();

6 Legalization 모드

함수	동작
`applyFullConversion`	모든 illegal op 제거 필수
`applyPartialConversion`	변환 가능한 만큼만
`applyAnalysisConversion`	IR 수정 없이 분석

7 Canonicalization

-canonicalize pass = 각 op의 getCanonicalizationPatterns 모아 greedy 적용
dead code 제거 (-cse와 짝)
순서: canonicalize → cse → canonicalize 반복

1 gpu dialect 위치 ★

역할 gpu = 벤더 중립 GPU 추상. CUDA·ROCm·SPIR-V 공통 개념 (kernel · thread · block · memory space)을 표현. 이후 벤더별 dialect로 lowering된다.

NVIDIA: gpu → nvgpu → nvvm
AMD: gpu → rocdl
Khronos: gpu → spirv

2 gpu.module & gpu.func

gpu.module @kernels {
  gpu.func @add_kernel(
      %a : memref<?xf32>,
      %b : memref<?xf32>,
      %c : memref<?xf32>)
      kernel
      attributes {gpu.known_block_size = array<i32: 256,1,1>} {
    %tid = gpu.thread_id x
    %bid = gpu.block_id  x
    ...
    gpu.return
  }
}

kernel = entry point · known_block_size = compile-time hint

3 주요 gpu op 표 ★

op	의미	→ NVVM
`gpu.thread_id`	threadIdx.{x,y,z}	`nvvm.read.ptx.sreg.tid.*`
`gpu.block_id`	blockIdx.*	`nvvm.read.ptx.sreg.ctaid.*`
`gpu.block_dim`	blockDim.*	`nvvm.read.ptx.sreg.ntid.*`
`gpu.grid_dim`	gridDim.*	`nvvm.read.ptx.sreg.nctaid.*`
`gpu.barrier`	`__syncthreads`	`nvvm.barrier0`
`gpu.shuffle`	warp shuffle	`nvvm.shfl.sync.*`
`gpu.printf`	device printf	`vprintf`
`gpu.alloc`	device alloc	`cudaMalloc` 래퍼
`gpu.memcpy`	H↔D copy	`cudaMemcpy`
`gpu.launch_func`	kernel 호출	runtime launch

4 gpu.launch vs launch_func

gpu.launch: region 안에 kernel body를 직접 포함 (outline 전 초기 형태)
gpu.launch_func: 별도 gpu.func 참조 (outline 후 host side)
pass gpu-kernel-outlining이 전자 → 후자로 변환

5 예시 — gpu.launch

%grid = arith.constant 1 : index
%block = arith.constant 256 : index

gpu.launch
    blocks(%bx, %by, %bz) in (%grid, %c1, %c1)
    threads(%tx, %ty, %tz) in (%block, %c1, %c1) {
  %tid  = gpu.thread_id x
  %idx  = arith.addi %tid, %offset : index
  %v    = memref.load %a[%idx] : memref<?xf32>
  memref.store %v, %c[%idx] : memref<?xf32>
  gpu.terminator
}

6 Memory space 매핑 gpu → PTX

gpu space	memref addr-space	PTX
Global	1	`.global`
Workgroup (shmem)	3	`.shared`
Private (reg/local)	5 (local)	`.local`
Constant	4	`.const`

§10에서 LLVM NVPTX 관점으로 재정리 (완전판)

주의: gpu.launch의 host-side 인자가 memref이면 gpu-async-region 경로에서 lowering 중에 gpu.alloc / gpu.memcpy로 분해된다.

1 nvgpu 위치 ★

정의 nvgpu = NVIDIA 특화 고수준 op 집합. gpu보다 아래, nvvm보다 위. Tensor Core · async copy · TMA 같은 기능을 의미 단위로 표현 (NVVM intrinsic 1:1이 아님).

gpu로 표현 못 하는 NV 전용 동작
nvvm이 풀어헤친 intrinsic보다 의도가 보존됨 (fragment layout 등)
lowering: vector → nvgpu → nvvm

2 nvgpu.mma.sync ★★

// m16n8k16 FP16 → FP32
%d = nvgpu.mma.sync(%a, %b, %c)
     { mmaShape = [16, 8, 16] }
   : (vector<4x2xf16>,
      vector<2x2xf16>,
      vector<2x2xf32>)
     -> vector<2x2xf32>

operand 타입이 fragment shape 그대로
각 thread가 가지는 register tile 수가 결과 type 차원
mma 고유 layout이 type에 보존됨 ↗ V03 §7 mma

3 nvgpu.ldmatrix ★

%frag = nvgpu.ldmatrix
        %smem[%i, %j]
        { transpose = false,
          numTiles = 4 : i32 }
      : memref<?x?xf16, 3>
     -> vector<4x2xf16>

shmem tile → register fragment 병렬 로드
transpose = true: ldmatrix.trans 활용
memref<..., 3> = addr space 3 (shared)
하위에서 nvvm.ldmatrix로 하강

4 nvgpu.device_async_copy

%tok = nvgpu.device_async_copy
       %gmem[%i], %smem[%j], 16
     : memref<?xf16>  to memref<?xf16, 3>

%g = nvgpu.device_async_create_group %tok
nvgpu.device_async_wait %g { numGroups = 1 : i32 }

Ampere cp.async 추상 ↗ V03 §6
token 기반 ordering (group commit)
double-buffer 패턴 직접 표현

5 Hopper 확장 op

op	매핑
`nvgpu.tma.async.load`	TMA descriptor load ↗ V04 §4
`nvgpu.mbarrier.create`	smem mbarrier alloc
`nvgpu.mbarrier.init`	arrive count init
`nvgpu.mbarrier.arrive`	arrive signal
`nvgpu.mbarrier.test_wait`	try_wait.parity
`nvgpu.warpgroup.mma`	wgmma async ↗ V04 §5
`nvgpu.warpgroup.mma.store`	epilogue store

6 Type family

!nvgpu.mma.ldmatrix.fragment: fragment layout 보존
!nvgpu.tensormap.descriptor: TMA descriptor handle
!nvgpu.mbarrier.group: mbarrier 배열
vector / memref과 혼용

7 설계 목적 왜 따로?

고수준 의미 유지 = vectorizer가 최적화하기 쉬움 nvvm으로 바로 내리면 register layout 정보 손실 → fusion·reuse 분석 불가

함정: nvgpu.mma.sync operand type은 특정 shape와 dtype 조합만 유효. mismatch 시 verifier 에러 — mma.sync legal shape 표는 ↗ V03 §7.

1 nvvm 정체성 ★ intrinsic 1:1 wrapper

정의 nvvm dialect = LLVM NVPTX의 intrinsic을 MLIR op로 래핑한 계층. 의미는 낮고, 이름은 llvm.nvvm.* intrinsic과 대부분 1:1.

nvgpu 의미 층이 없을 때만 직접 사용
gpu 추상을 lowering한 결과도 nvvm에 착지
이후 convert-nvvm-to-llvm으로 LLVM IR

2 자주 쓰는 nvvm op

nvvm op	→ LLVM intrinsic
`nvvm.read.ptx.sreg.tid.x`	`llvm.nvvm.read.ptx.sreg.tid.x`
`nvvm.read.ptx.sreg.ctaid.x`	`llvm.nvvm.read.ptx.sreg.ctaid.x`
`nvvm.barrier0`	`llvm.nvvm.barrier0`
`nvvm.shfl.sync`	`llvm.nvvm.shfl.sync.*`
`nvvm.ldmatrix`	`llvm.nvvm.ldmatrix.sync.aligned.*`
`nvvm.mma.sync`	`llvm.nvvm.mma.*`
`nvvm.cp.async.shared.global`	`llvm.nvvm.cp.async.*`
`nvvm.wmma.mma.*`	`llvm.nvvm.wmma.mma.*`

3 nvvm.mma.sync 시그니처

// m16n8k16 FP16 → FP32
%d0, %d1, %d2, %d3 = nvvm.mma.sync
  A[%a0, %a1, %a2, %a3]
  B[%b0, %b1]
  C[%c0, %c1, %c2, %c3]
  { shape   = #nvvm.shape<m = 16, n = 8, k = 16>,
    layoutA = #nvvm.mma_layout<row>,
    layoutB = #nvvm.mma_layout<col>,
    multiplicandAPtxType = #nvvm.mma_type<f16>,
    multiplicandBPtxType = #nvvm.mma_type<f16>,
    intOverflowBehavior = #nvvm.mma_int_overflow<satfinite> }
: (vector<2xf16>, ..., f32, ...) -> !llvm.struct<(f32,f32,f32,f32)>

operand가 scalar · vector · struct 혼합 — 이미 LLVM IR의 ABI에 가깝다

4 Kernel entry annotation

llvm.func @my_kernel(%arg0: !llvm.ptr<1>)
    attributes { nvvm.kernel } {
  ...
  llvm.return
}

nvvm.kernel attr = entry point 표시
lowering 시 !nvvm.annotations metadata 자동 생성
Host에서 launch 가능한 심볼

5 nvvm 고유 attribute

attr	의미
`nvvm.maxntid`	block 당 최대 thread
`nvvm.minctasm`	SM 당 최소 CTA
`nvvm.reqntid`	정확한 block dim
`nvvm.cluster_dim`	Hopper cluster 크기
`nvvm.grid_constant`	kernel 인자 중 grid-const

→ LLVM metadata !nvvm.annotations로 flatten

6 nvgpu vs nvvm

축	nvgpu	nvvm
의미	mma 의도	intrinsic 그대로
type	fragment 보존	scalar 나열
verify	shape 강제	느슨
optimization	fusion 쉬움	끝단 (거의 안 건드림)

주의: nvvm op는 대부분 side-effect 있음으로 표시 — barrier · shuffle 등을 CSE가 합치지 않도록. 잘못 purity 표기하면 race 발생.

1 LLVM IR 구조 ★ Module·Func·BB·Instr

Module
  └─ Function (sig + attrs)
      └─ BasicBlock
          ├─ Instruction (SSA)
          └─ Terminator (br, ret, switch, ...)

SSA: 각 %tmp 정확히 한 번 정의
phi = predecessor별 값 병합
각 BB는 정확히 하나의 terminator로 끝

2 기본 instruction 예

define i32 @add(i32 %a, i32 %b) {
entry:
  %s = add nsw i32 %a, %b
  ret i32 %s
}

define i32 @loop(i32 %n) {
entry:
  br label %header
header:
  %i = phi i32 [0, %entry], [%i.n, %body]
  %c = icmp slt i32 %i, %n
  br i1 %c, label %body, label %exit
body:
  %i.n = add i32 %i, 1
  br label %header
exit:
  ret i32 %i
}

3 Type system 요약

종류	예
정수	`i1, i8, i16, i32, i64`
실수	`half, bfloat, float, double`
vector	`<4 x float>`
pointer	`ptr addrspace(1)` opaque ptr
struct	`{ i32, float }`
array	`[256 x i8]`

opaque pointer (2023~) → typed ptr 대신 ptr 한 종류

4 Intrinsic ★

정의 intrinsic = LLVM이 아는 특수 함수. @llvm.* prefix. backend가 단일 instruction으로 낮추거나 라이브러리 호출로 확장.

@llvm.memcpy.p0.p0.i64
@llvm.fma.f32
@llvm.nvvm.barrier0 ← NVPTX 전용
intrinsic은 선언만, 정의 없음

5 Metadata 시스템 ★

; 예: branch weight
%c = icmp slt i32 %i, %n, !prof !0
!0 = !{!"branch_weights", i32 99, i32 1}

; NVPTX annotation (kernel 표시)
!nvvm.annotations = !{!10}
!10 = !{ptr @my_kernel, !"kernel", i32 1}

metadata node = !N 으로 참조
backend · middle-end가 힌트로 사용
IR 의미엔 영향 없지만 codegen에 영향

6 Function / Arg attribute

attr	의미
`nounwind`	예외 없음
`readonly`	mem 수정 안 함
`noalias`	ptr 간 별칭 없음
`align N`	ptr 정렬
`dereferenceable(N)`	최소 N byte 역참조 가능

7 mlir→llvm 변환

llvm dialect IR → mlir-translate --mlir-to-llvmir
pure text .ll 생성
opt -O3로 추가 최적화 → NVPTX backend

주의: opaque ptr 이후 bitcast가 pointer 간엔 불필요 · 타입은 instruction이 결정한다 (load i32, ptr %p).

1 NVPTX backend 역할 ★

정의 NVPTX = LLVM의 NVIDIA target backend. LLVM IR (with NVPTX intrinsic/metadata) → PTX assembly 텍스트 출력. PTX는 virtual ISA, ptxas가 다시 SASS로 JIT/AOT 컴파일.

triple: nvptx64-nvidia-cuda
data layout: e-i64:64-i128:128-v16:16-v32:32-n16:32:64
출력 = 텍스트 .ptx, object 아님

2 Address Space 매트릭스 ★★

addr space	이름	용도	PTX state
0	generic	default · 어느 space든 참조	(cast 필요)
1	global	HBM · device-wide	`.global`
2	internal use	(예약)	—
3	shared	SMEM · block 내 공유	`.shared`
4	const	read-only 상수 뱅크	`.const`
5	local	per-thread stack	`.local`
6	(reserved)	—	—
7	param	kernel 인자	`.param`

src: LLVM NVPTXBaseInfo · PTX ISA §5.1 state spaces

3 Address space 주석 예

; shared memory buffer
@smem = internal addrspace(3)
        global [256 x i32] undef,
        align 16

; global pointer
define void @kern(ptr addrspace(1) %X) {
  %p = getelementptr i32,
       ptr addrspace(1) %X, i64 %tid64
  %v = load i32, ptr addrspace(1) %p
  ...
}

addrspace(N)으로 pointer type 수식
addrspacecast로 generic(0) ↔ specific 변환

4 !nvvm.annotations ★

; named metadata
!nvvm.annotations = !{!10, !11, !12}

; 각 항목 = (함수, 태그, 값)
!10 = !{ptr @kern, !"kernel",    i32 1}
!11 = !{ptr @kern, !"maxntidx",  i32 256}
!12 = !{ptr @kern, !"minctasm",  i32 2}

tag	의미
`kernel`	entry point 표시 (= `.entry`)
`maxntidx/y/z`	block 당 최대 thread
`reqntidx`	정확한 block dim (고정)
`minctasm`	SM 당 최소 CTA
`grid_constant`	unchanged param idx
`cluster_dim_x`	Hopper cluster (sm_90)

5 Target 선택

# LLVM level
llc -march=nvptx64 \
    -mcpu=sm_80 \
    -mattr=+ptx75 \
    input.ll -o out.ptx

# mlir-translate → llc
mlir-translate \
  --mlir-to-llvmir out.mlir | \
  llc -march=nvptx64 -mcpu=sm_90a

-mcpu: sm_70/80/86/89/90/90a
-mattr=+ptx80: PTX ISA 버전 (backward)
sm_90a = Hopper 전용 (TMA·WGMMA) ↗ V04

6 Codegen pipeline NVPTX

LLVM IR (NVPTX target)
  → NVPTXLower* passes
  → ISel (tablegen patterns)
  → NVPTXMachineFunction
  → NVPTXAsmPrinter
  → .ptx (text)
  → ptxas (외부)
  → cubin (SASS binary)

함정: generic(0) pointer에서 shared(3)로 addrspacecast 없이 직접 load하면 backend가 PTX ld를 generic load로 내보내 성능 저하. 반드시 specific addrspace로 캐스팅하고 작업.

1 intrinsic 카테고리

group	예
sreg (special reg)	`tid.x`, `ctaid.x`, `laneid`, `warpid`
barrier	`barrier0`, `bar.sync`, `membar`
shuffle	`shfl.sync.{idx,up,down,bfly}`
vote	`vote.sync.{all,any,ballot}`
atomic	`atom.{add,cas,min,max,and,or,xor}`
mma	`mma.m16n8k16.`, `wmma.mma.`
ldmatrix	`ldmatrix.sync.aligned.*`
cp.async	`cp.async.{ca,cg}.shared.global`
TMA (Hopper)	`cp.async.bulk.tensor.*`
tensor mem	`tcgen05.*` (Blackwell)

2 Special register 읽기

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.laneid()
declare i32 @llvm.nvvm.read.ptx.sreg.warpid()

%t = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()

PTX: mov.u32 %r, %tid.x;

3 Barrier · fence ★

; block 전체 sync (= __syncthreads)
declare void @llvm.nvvm.barrier0()

; partial barrier: arrive count 지정
declare void @llvm.nvvm.barrier(
    i32 %barId, i32 %threadCount)

; membar scope
declare void @llvm.nvvm.membar.cta()
declare void @llvm.nvvm.membar.gl()
declare void @llvm.nvvm.membar.sys()

scope 순서: cta ⊂ gl(GPU) ⊂ sys(system) ↗ V03 §12

4 Warp shuffle

declare i32 @llvm.nvvm.shfl.sync.idx.i32(
    i32 %mask, i32 %val,
    i32 %srcLane, i32 %packedCfg)

declare i32 @llvm.nvvm.shfl.sync.down.i32(
    i32 %mask, i32 %val,
    i32 %delta, i32 %packedCfg)

declare i32 @llvm.nvvm.shfl.sync.bfly.i32(
    i32 %mask, i32 %val,
    i32 %lane, i32 %packedCfg)

mask: 참여 lane bitmask (0xFFFFFFFF = 전체)
packedCfg: clamp/segment bits 포함

5 mma intrinsic 예시

; m16n8k16 FP16→FP32 sync mma
declare {float, float, float, float}
@llvm.nvvm.mma.m16n8k16.row.col.f16.f16(
  <2 x half>, <2 x half>,   ; A 4 frags → 8 half
  <2 x half>, <2 x half>,
  <2 x half>, <2 x half>,   ; B 2 frags → 4 half
  float, float, float, float) ; C 4 f32

vectorized <N x T> = fragment per-thread layout ↗ V03 §7

6 ldmatrix · cp.async

declare {i32,i32,i32,i32}
@llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16(
    ptr addrspace(3) %smem)

declare void
@llvm.nvvm.cp.async.cg.shared.global.16(
    ptr addrspace(3), ptr addrspace(1))

declare void
@llvm.nvvm.cp.async.commit.group()

declare void
@llvm.nvvm.cp.async.wait.group(i32)

주의: intrinsic 이름은 LLVM 버전별로 조금씩 다르다 (bf16 variant 추가 등). NVPTX RST와 IntrinsicsNVVM.td를 버전별로 확인.

1 Arch string 매트릭스 ★

sm_arch	GPU	PTX ISA	필수 기능
`sm_70`	V100	ptx60+	wmma
`sm_75`	T4 / Turing	ptx63+	mma INT8
`sm_80`	A100	ptx70+	mma FP16, cp.async
`sm_86`	A10/A40/RTX30	ptx71+	—
`sm_89`	L40 / RTX40	ptx78+	FP8 (Ada)
`sm_90`	H100 (공통)	ptx78+	cluster · TMA (일부)
`sm_90a`	H100 전용	ptx78+	WGMMA · TMA
`sm_100`	Blackwell	ptx86+	tcgen05 · FP4

sm_XXa suffix = architecture-specific (forward-incompatible)

2 llc 옵션 매핑

# 기본 PTX 출력
llc -march=nvptx64 -mcpu=sm_80 \
    -mattr=+ptx75 \
    in.ll -o out.ptx

# fast-math / denormal
llc -fp-contract=fast \
    -enable-unsafe-fp-math \
    -nvptx-f32ftz    \   # ftz denormal
    -nvptx-prec-divf32=0 \ # approx div
    -nvptx-prec-sqrtf32=0

3 ptxas 주요 플래그 ★

flag	의미
`-arch=sm_90a`	target architecture
`-O0..-O3`	opt level (ptxas 자체)
`-maxrregcount=N`	reg 상한 → spill 유도
`--gpu-name`	sm 이름 문자열
`-v`	register · spill · shmem 요약
`-dlcm=ca/cg`	default cache mode
`--compile-only`	link 생략
`-ftz=true`	denormal flush-to-zero
`-prec-div=false`	FP32 div 근사
`-prec-sqrt=false`	FP32 sqrt 근사

4 Fast-math 체인

ftz + approx_div + approx_sqrt + fma_contract
→ FP32 peak 도달 가능성 ↑ 정확도 ↓ · IEEE-754 non-compliant

level	효과
strict	IEEE · 느림
contract	fma 융합만 허용
fast	ftz + approx 전부

5 MLIR 경로 ★

# 1) MLIR → LLVM IR
mlir-translate \
  --mlir-to-llvmir kernel.mlir \
  -o kernel.ll

# 2) LLVM IR → PTX
llc -march=nvptx64 -mcpu=sm_90a \
    kernel.ll -o kernel.ptx

# 3) PTX → cubin
ptxas -arch=sm_90a -O3 \
      kernel.ptx -o kernel.cubin

# 4) cubin → fatbin (선택)
fatbinary --image=profile=sm_90,file=kernel.cubin \
          --create=kernel.fatbin

6 Compile-time vs JIT

PTX는 forward-compat: 새 GPU에서 드라이버가 JIT
SASS(cubin)는 arch-specific
fatbin = 여러 arch의 PTX/SASS 번들
-code=sm_80,sm_90 + -code=compute_80 조합

함정: sm_90과 sm_90a는 PTX 레벨부터 다름. sm_90으로 빌드한 커널은 WGMMA / TMA PTX instruction을 생성할 수 없다 — ptxas에서 verifier error.

1 Stage 1 — `linalg` ★

// 고수준 의미: D = A·B
func.func @matmul(
    %A : tensor<128x64xf16>,
    %B : tensor<64x128xf16>,
    %C : tensor<128x128xf32>)
    -> tensor<128x128xf32> {
  %D = linalg.matmul
       ins (%A, %B : tensor<128x64xf16>,
                     tensor<64x128xf16>)
       outs(%C    : tensor<128x128xf32>)
       -> tensor<128x128xf32>
  return %D : tensor<128x128xf32>
}

tensor 값 의미 · shape·layout 보존 · 아직 loop 없음

2 Stage 2 — `scf.for`

// tile (32,32,16) + bufferize
scf.for %m = %c0 to %c128 step %c32 {
 scf.for %n = %c0 to %c128 step %c32 {
  %acc = vector.transfer_read
         %Cbuf[%m, %n], %cst0
       : memref<128x128xf32>,
         vector<32x32xf32>
  %r = scf.for %k = %c0 to %c64 step %c16
        iter_args(%aI = %acc) -> vector<32x32xf32> {
     %a = vector.transfer_read %Abuf[%m, %k], %cst
     %b = vector.transfer_read %Bbuf[%k, %n], %cst
     %n = vector.contract ... %a, %b, %aI
     scf.yield %n : vector<32x32xf32>
  }
  vector.transfer_write %r, %Cbuf[%m, %n]
 }
}

3 Stage 3 — `vector.contract` → `nvgpu`

// 32x32 contract → 4개의 m16n8k16 mma
%a0 = nvgpu.ldmatrix %Asm[%i0, %k0]
      { numTiles = 4 } : ... -> vector<4x2xf16>
%b0 = nvgpu.ldmatrix %Bsm[%k0, %j0]
      { numTiles = 2, transpose = true }
   : ... -> vector<2x2xf16>

%d0 = nvgpu.mma.sync(%a0, %b0, %c0)
      { mmaShape = [16, 8, 16] }
    : (vector<4x2xf16>,
       vector<2x2xf16>,
       vector<2x2xf32>)
      -> vector<2x2xf32>

각 thread의 fragment 보존된 type 유지

4 Stage 4 — `nvvm`

%fa = nvvm.ldmatrix %Asm
      { num = 4 : i32, layout = #nvvm.mma_layout<row> }
    : (!llvm.ptr<3>)
      -> !llvm.struct<(i32, i32, i32, i32)>

%dd = nvvm.mma.sync
      A[%a0, %a1, %a2, %a3]
      B[%b0, %b1]
      C[%c0, %c1, %c2, %c3]
      { shape = #nvvm.shape<m=16, n=8, k=16>,
        layoutA = #nvvm.mma_layout<row>,
        layoutB = #nvvm.mma_layout<col>,
        multiplicandAPtxType = #nvvm.mma_type<f16>,
        multiplicandBPtxType = #nvvm.mma_type<f16> }
    : (...) -> !llvm.struct<(f32,f32,f32,f32)>

5 Stage 5 — LLVM IR (NVPTX)

define void @matmul_kernel(
    ptr addrspace(1) %A,
    ptr addrspace(1) %B,
    ptr addrspace(1) %C)
  !nvvm.annotations !0 {
entry:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  ...
  %r = call {float,float,float,float}
    @llvm.nvvm.mma.m16n8k16.row.col.f16.f16(
      <2 x half> %a0, ..., float %c0, ...)
  ...
  ret void
}
!0 = !{ptr @matmul_kernel, !"kernel", i32 1}

6 Stage 6 — PTX

.version 7.8
.target  sm_80
.address_size 64

.visible .entry matmul_kernel(
    .param .u64 A,
    .param .u64 B,
    .param .u64 C)
{
  .reg .f32   %f<128>;
  .reg .b32   %r<128>;
  ...
  ldmatrix.sync.aligned.x4.m8n8.shared.b16
       {%r0,%r1,%r2,%r3}, [%sA];
  mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
       {%f0,%f1,%f2,%f3},
       {%r0,%r1,%r2,%r3},
       {%rb0,%rb1},
       {%f0,%f1,%f2,%f3};
  ...
}

관찰: 각 stage는 같은 알고리즘이다. 단지 표현 계층이 달라진 것. type의 형태가 tensor → memref → vector → fragment → <N x half> → .reg로 점차 HW 친화로 수렴한다.

1 GPU MLIR 프로젝트 맵

프로젝트	입력	출력	GPU backend
IREE	MLIR (TOSA / StableHLO)	VM bytecode	CUDA · ROCm · Vulkan
Polygeist	C/C++/OpenMP	MLIR (affine)	CPU · CUDA (연구)
nvfuser	PyTorch JIT IR	CUDA kernel	NVIDIA (MLIR 미사용)
Triton	Python DSL	PTX ↗ V11	NVIDIA (MLIR 사용)
ONNX-MLIR	ONNX	LLVM / accel.	CPU 주, GPU 확장
CIRCT	HW IR	Verilog	(GPU 아님)

2 IREE 파이프라인 개념

TOSA / StableHLO / TorchMLIR
    ↓ input conversion
linalg on tensors
    ↓ dispatch region 형성 (CTA 단위)
flow.executable + hal.executable
    ↓ target-specific pipeline
    ├─ LLVM-CPU: vector · llvm
    └─ LLVM-GPU (CUDA): gpu · nvvm · llvm
IREE VM bytecode + cubin
    ↓ runtime
execute

3 IREE 특이점

dispatch region: linalg fuse 단위 = CTA 수준 kernel
HAL: runtime abstraction (device / command / allocator)
codegen을 TransformDialect script로 제어 (선언적 schedule)
target 별 전체 pass pipeline이 한 톨-체인에 포함

4 Polygeist

역할 Clang을 수정해 C/C++ 소스 → affine/scf MLIR로 직접 import. polyhedral 최적화 pass 적용 후 GPU로 내림.

legacy CPU 코드의 GPU 이식 연구용
affine scheduling + tiling pass가 주력
실 프로덕션 드물지만 "C → MLIR" reference 구현

5 nvfuser 참고

PyTorch의 pointwise/reduction fusion backend
MLIR 사용 안 함 — 자체 IR (Fusion IR)
CUDA 코드 직접 방출
TorchInductor와 역할 중복 (정리 중) ↗ V13

6 GPU lowering 전략 비교

전략	예	특징
dispatch-centric	IREE	dispatch = 독립 kernel
block-centric	Triton ↗ V11	프로그래머가 block 직접 제어
loop-centric	Polygeist	loop nest → GPU map
graph-centric	XLA ↗ V14	HLO op 단위 fusion

7 TransformDialect 한 줄

// 선언적 schedule 스크립트
transform.sequence %root : !any_op {
^bb(%arg0: !any_op):
  %mm = transform.structured.match
        ops{["linalg.matmul"]} in %arg0
  %tiled, %loops =
    transform.structured.tile_using_for %mm
      [32, 32, 16]
  transform.structured.vectorize_children_and_apply_patterns
    %tiled
}

IREE·MLIR에서 codegen 실험의 사실상 표준

scope 주의: 본권은 GPU 컴파일러 개념만 다룸. 실제 production 컴파일러 (TorchInductor · XLA · TVM) 특성은 각각 ↗ V13 · V14.

1 Triton의 dialect 2단 ★

dialect	level	핵심
`triton` (TT)	block-level DSL	`tt.load`, `tt.dot`, `tt.store`
`triton_gpu` (TTG)	GPU-aware	layout encoding 포함

자세한 설계·pipeline은 ↗ V11 · 여기선 MLIR 관점 특이점만

2 핵심 특이점 ★

Block = programmer 노출: Triton은 tl.program_id로 block idx를 직접 준다. MLIR gpu dialect의 gpu.launch를 쓰지 않음
Layout encoding: tensor type에 #triton_gpu.blocked<...> attribute로 분배 · register fragment · shared layout 삽입
upstream linalg 미사용: Triton IR이 GEMM을 이미 high-level로 표현하므로 linalg 단계 없음
자체 conversion: TT → TTG → LLVM (NVPTX) 직접 — nvgpu/nvvm dialect 이용 (최근 일부)

3 Triton Layout encoding 핵심

// blocked layout: CTA 내 thread 분배
#blocked = #triton_gpu.blocked<{
    sizePerThread  = [1, 8],
    threadsPerWarp = [8, 4],
    warpsPerCTA    = [4, 1],
    order          = [1, 0]}>

// mma layout: tensor core fragment
#mma = #triton_gpu.nvidia_mma<{
    versionMajor = 2,
    warpsPerCTA  = [4, 1],
    instrShape   = [16, 8]}>

// shared layout: smem swizzle
#shared = #triton_gpu.shared<{
    vec = 8, perPhase = 2,
    maxPhase = 8, order = [1, 0]}>

4 왜 layout 이 type에?

같은 shape tensor라도 layout이 다르면 다른 type
layout 불일치 = verifier error → 자동 convert_layout 삽입
optimization이 layout을 "안다": register ↔ shared ↔ mma 변환 최소화
upstream MLIR은 layout을 attribute로만 취급 (별개)

5 Pipeline 비교

축	upstream MLIR (IREE)	Triton
시작점	linalg on tensors	TT (DSL)
block 추상	gpu.launch → nvvm	TTG (내장)
layout	attribute(약함)	type(강제)
Tensor Core	nvgpu.mma.sync	tt.dot → mma layout
codegen	mlir-translate + llc	자체 + LLVM NVPTX
pipeline stage	많음 (유연)	적음 (직행)

6 Triton이 쓰는 upstream 기능

Pass infra · ConversionPattern · TableGen ODS
arith, scf 일부 (control flow)
LLVM dialect + NVPTX translate
DialectConversion legalization
최근: nvgpu/nvvm 일부 통합 (TMA / WGMMA)

핵심: Triton은 MLIR을 프레임워크로 재활용했을 뿐, upstream GPU dialect 계층과 독립된 자기만의 2-dialect 스택이다. Triton IR을 읽을 때 upstream MLIR의 linalg → scf 흐름을 기대하면 안 됨.

1 RewritePattern 계열 ★

base	용도
`OpRewritePattern<Op>`	한 op 루트 매칭
`RewritePattern`	일반 (op name·interface)
`ConversionPattern`	dialect conversion용
`OpConversionPattern<Op>`	conv + 특정 op

2 matchAndRewrite 표준형 ★

struct FoldAddZero
    : OpRewritePattern<arith::AddIOp> {
  using OpRewritePattern::OpRewritePattern;

  LogicalResult matchAndRewrite(
      arith::AddIOp op,
      PatternRewriter &rewriter) const override {
    // match
    Value rhs = op.getRhs();
    if (!matchPattern(rhs, m_Zero()))
      return failure();

    // rewrite
    rewriter.replaceOp(op, op.getLhs());
    return success();
  }
};

3 Rewriter API 핵심

API	의미
`create<Op>(loc, args)`	새 op 삽입
`replaceOp(op, vals)`	op 제거 + 사용처 값 치환
`replaceOpWithNewOp<Op>(...)`	한 줄 치환
`eraseOp(op)`	삭제만 (no-use)
`startOpModification / finalize`	in-place 수정
`inlineRegionBefore(...)`	region 이식
`getRemappedValue(v)`	conversion 중 value 추적

4 ConversionPattern 차이

struct ConvertMyOp
    : OpConversionPattern<MyOp> {
  using OpConversionPattern::OpConversionPattern;

  LogicalResult matchAndRewrite(
      MyOp op, OpAdaptor adaptor,   // 변환된 operand
      ConversionPatternRewriter &rw) const override {
    Value x = adaptor.getInput();   // 이미 새 type
    Type ty = getTypeConverter()->convertType(
                op.getType());
    rw.replaceOpWithNewOp<LoweredOp>(op, ty, x);
    return success();
  }
};

OpAdaptor: 이미 converted된 operand 접근
getTypeConverter: 타입 변환 rule 조회

5 Pattern 등록 ★

RewritePatternSet patterns(ctx);
patterns.add<FoldAddZero,
             FoldMulOne,
             ConvertMyOp>(ctx);

// greedy
(void)applyPatternsGreedily(op, std::move(patterns));

// conversion
ConversionTarget target(*ctx);
target.addLegalDialect<LLVM::LLVMDialect>();
target.addIllegalOp<MyOp>();
if (failed(applyPartialConversion(
    op, target, std::move(patterns))))
  signalPassFailure();

6 PDL 선언적 rewrite

pdl.pattern @fold_add_zero : benefit(1) {
  %zero = pdl.attribute = 0 : i32
  %ty   = pdl.type
  %c0op = pdl.operation "arith.constant"
          {"value" = %zero} -> (%ty : !pdl.type)
  %z    = pdl.result 0 of %c0op
  %lhs  = pdl.operand
  %add  = pdl.operation "arith.addi"
          (%lhs, %z : !pdl.value, !pdl.value)
          -> (%ty : !pdl.type)
  pdl.rewrite %add {
    pdl.replace %add with (%lhs : !pdl.value)
  }
}

C++ 없이 pattern 작성
pdl_interp로 interp 또는 자동 코드 생성

주의: 같은 IR shape 위를 rewriter.create + old op 유지 조합으로 찍으면 infinite loop. 항상 replaceOp / eraseOp로 원 op 제거 확인.

1 전체 dialect 계층 맵 ★★

LEVEL           DIALECT         주요 op / type
────────────────────────────────────────────────
HIGH (value)    tensor          tensor.empty/extract
                linalg          linalg.matmul, generic
                (tosa, mhlo)    tosa.conv2d

MID  (loop)     scf             scf.for/if/yield
                affine          affine.for/load
                memref          memref.alloc/load

LOW  (tile)     vector          vector.contract
                                vector.transfer_read
                gpu             gpu.launch, thread_id
                                gpu.barrier, shuffle

TARGET (NV)     nvgpu           nvgpu.mma.sync
                                nvgpu.ldmatrix
                                nvgpu.device_async_copy
                nvvm            nvvm.barrier0
                                nvvm.mma.sync, shfl
                                nvvm.ldmatrix

LLVM            llvm            llvm.func, call
                                llvm.ptr<N>  (addrspace)
────────────────────────────────────────────────
mlir-translate → LLVM IR
llc (nvptx64)  → PTX
ptxas          → cubin (SASS)

2 Conversion pass 체인 (표준)

linalg → scf: convert-linalg-to-loops
tensor → memref: one-shot-bufferize
scf → cf: convert-scf-to-cf
vector → gpu: convert-vector-to-gpu
gpu → nvvm: convert-gpu-to-nvvm
nvgpu → nvvm: convert-nvgpu-to-nvvm
arith/cf → llvm: convert-arith/cf-to-llvm
func → llvm: convert-func-to-llvm

3 Address space · PTX 표 ★

AS	MLIR type	PTX
0	`ptr` / `memref<T>`	generic
1	`ptr addrspace(1)`	`.global`
3	`memref<T, 3>`	`.shared`
4	`addrspace(4)`	`.const`
5	`addrspace(5)`	`.local`
7	`addrspace(7)`	`.param`

4 자주 쓰는 NVVM intrinsic

op	PTX
`read.ptx.sreg.tid.x`	`%tid.x`
`read.ptx.sreg.ctaid.x`	`%ctaid.x`
`barrier0`	`bar.sync 0`
`shfl.sync.down`	`shfl.sync.down.b32`
`mma.m16n8k16.*`	`mma.sync.aligned.*`
`cp.async.cg.shared.global`	`cp.async.cg.*`
`ldmatrix.x4.b16`	`ldmatrix.sync.aligned.x4.*`

5 !nvvm.annotations tag

"kernel" → .entry
"maxntidx/y/z"
"reqntidx"
"minctasm"
"cluster_dim_x" (sm_90)

6 최소 빌드 체인 한 줄씩

# 1. MLIR lower
mlir-opt kern.mlir \
 --convert-linalg-to-loops \
 --one-shot-bufferize \
 --convert-scf-to-cf \
 --convert-vector-to-gpu \
 --convert-gpu-to-nvvm \
 --convert-nvgpu-to-nvvm \
 --convert-func-to-llvm \
 -o lowered.mlir

# 2. LLVM IR
mlir-translate --mlir-to-llvmir \
  lowered.mlir -o k.ll

# 3. PTX
llc -march=nvptx64 -mcpu=sm_90a \
    k.ll -o k.ptx

# 4. cubin
ptxas -arch=sm_90a -O3 \
      k.ptx -o k.cubin

7 참조 → 다음 권

Triton 컴파일러 전체 → ↗ V11
TorchInductor · FX → ↗ V13
XLA · TVM · polyhedral → ↗ V14
PTX / SASS 자체 → ↗ V03 · V04
Tensor Core shape → ↗ V03 §7

8 흔한 실수 6 (!) AS·legal·pure·loop·sm_a·layout

generic AS 남발 → slow global load
legal op 선언 누락 → 무한 rewrite
side-effect op에 Pure trait → CSE 삭제
infinite loop: 같은 op 재생성
sm_90 vs sm_90a 혼동
layout mismatch: Triton type error

핵심 문장: MLIR은 compiler가 아니라 IR 프레임워크. dialect 선택 + conversion pass 체인 + NVPTX backend 설정이 하나의 kernel을 PTX로 낸다.