《GPU Mode》 L104 2025 · Triton + Gluon High priority transcript · failed

Gluon and Linear Layouts

Triton 의 가장 큰 장점 — layout 결정의 자동화 — 가 동시에 한계가 되는 자리에서 Gluon 이 들어온다. “Triton 과 같은 syntax, 그러나 layout 을 사용자가 명시적으로 control”. 같은 자리에서 도는 새 추상 — Linear Layouts — 이 — 𝔽₂ 위의 binary matrix 로 — Triton 의 모든 layout (blocked, shared, MMA, swizzle) 을 한 framework 안에 합친다. Peter Bell · Mario Lezcano · Keren Zhou (Triton core, PyTorch) 가 발표한 — 자막은 실패했지만 슬라이드 (gluon.pdf) 와 “Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using 𝔽₂” paper, Triton repo 의 issue / experimental/gluon code 를 근거로 재구성.

Gluon Triton Linear Layouts 𝔽₂ explicit layout TTGIR warp specialization producer-consumer

Speakers

Peter Bell · Mario Lezcano · Keren Zhou

Triton core · PyTorch · Linear Layouts paper authors

강의 번호

L104

스피커

Peter Bell · Mario Lezcano · Keren Zhou

자막

failed · slide 기반

priority

High · 정독

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제왜 Gluon 02Triton 의 layout 추상이 부족한 자리heuristic 의 한계 03Gluon 의 위치Triton 안 lower-level 04Linear Layout 의 표현력𝔽₂ 위 binary matrix 05예시 · producer-consumerwarp-specialized pipeline 06backward 호환기존 Triton 과의 관계 07채택 사례FlashAttention · vLLM 08CuTe DSL 과 비교두 stack 의 trade-off 09다음AMD · TMA · per-thread 10기억할 메모key takeaways 11다른 강의와 잇기connections 12열린 질문open questions

§ 01강의가 풀려는 문제· 왜 Gluon

“Triton 의 자동화가 막히는 한 줄, 그리고 그 한 줄을 풀려고 CUTLASS 로 넘어가지 않아도 되는 길”

Triton 의 핵심 매력 — “tile 단위 op 만 쓰면 layout / scheduling / register 할당이 자동”. 이 자동화가 L001 시점부터 GPU 커널을 짜는 자세를 바꿨다. 그러나 — Hopper / Blackwell 의 새 instruction (TMA, WGMMA, multicast) 이 들어오면서 — “heuristic 이 잡지 못하는 자리” 가 생겼다. 그 자리에서 사용자가 layout 을 직접 control 할 수 있는 — “Triton 과 같은 syntax, 그러나 layout 명시” — 가 Gluon 의 자리.

강의가 던지는 두 개의 질문.

Triton 의 자동 layout 결정이 부족한 자리는 정확히 어디인가 — 그리고 거기서 사용자가 무엇을 더 control 해야 하는가.
Linear Layouts 라는 새 추상이 — Triton 의 기존 layout 들 (blocked, shared, NVidia MMA) 을 어떻게 통합하는가 — 그리고 그것이 layout conversion 의 quadratic 복잡도를 어떻게 푸는가.

강의의 인지적 frame

Gluon 은 “Triton 의 대체가 아니라 Triton 안의 lower-level layer”. 일반 Triton 으로 충분한 자리는 그대로, 자동화가 막힌 자리에만 Gluon 으로 내려간다. CUTLASS 로 넘어가지 않고도 Hopper 의 모든 기능을 — 같은 Python 환경에서 — 활용할 수 있게 한다.

“Triton 의 자동화는 Ampere 까지 충분했다. Hopper 의 producer-consumer pipeline 을 자동으로 lower 하는 건 어렵다 — 그래서 Gluon.”학습 노트 · 추정

§ 02Triton 의 layout 추상이 부족한 자리· heuristic 의 한계

자동 layout 이 잡지 못하는 4 가지 자리

Triton 의 사용자 모델은 — “tile 의 shape 만 정하면 layout 은 컴파일러가 결정”. 이 모델이 못 잡는 자리들이 있다. 모두 새 hardware 기능과 연결.

warp specialization — Hopper 의 producer-consumer pattern 은 warp 별로 다른 일을 한다. 한 warp 가 TMA load, 다른 warp 가 WGMMA. Triton 의 “모든 thread 가 같은 일” 모델로는 자연스러운 표현이 없음.
TMA descriptor 의 fine control — multicast, swizzle mode, prefetch 의 상세한 결정. 자동 lowering 이 best 를 항상 못 찾음.
shared memory 의 정확한 layout — bank conflict 회피용 swizzle 의 mode (3, 0, 3) 또는 (5, 0, 5) 의 결정. 자동 결정이 보수적이거나 잘못된 선택.
register tile 의 partition — WGMMA 의 native shape (m64n*k16) 와 사용자 tile 의 align. 사용자가 tile 을 의도한 모양으로 못 강제하는 자리.

실전 증상

“같은 GEMM 의 Triton 과 CUTLASS 의 throughput 차이가 30%” 같은 보고가 — Hopper 부터 일관되게 보고됨. 원인은 거의 항상 layout / scheduling 의 자동 결정 차이. Gluon 의 motivation 이 정확히 이 격차의 해소.

이 자리들은 — heuristic 이 영원히 못 잡는다 는 게 아니라 — 지금 시점에서 jaw bone 의 자동 결정이 hand-tuned 만큼 좋지 않다. 시간이 가면 일부는 자동화로 다시 옮겨갈 가능성. 단 hardware 의 진화 속도가 너무 빨라서 — “항상 자동화의 한 발 뒤” 라는 격차가 구조적으로 존재.

§ 03Gluon 의 위치· Triton 내 lower-level

같은 Python, 같은 컴파일러 백엔드 — 다른 정도의 명시성

FIG · Triton 컴파일 stack 안 Gluon 의 위치same backend, different surface

L0 · USER Triton DSLtl.load · tl.dot · tl.store 자동 layout — 사용자는 tile shape 만

L0' · USER Gluon DSLtl.load + 명시적 layout argument 사용자가 layout 을 직접 지정

L1 Triton-IRtile-level operations Triton 과 Gluon 이 같은 IR 로 lower

L2 TritonGPU-IR (TTGIR)layout attribute 가 type 위에 layout = blocked / shared / MMA / linear

L3 LLVM-IR / NVVMregister / SMEM allocation 결정 layout conversion · pipelining

L4 PTX → SASSarch 별 머신코드 실제 GPU 위 머신코드

Gluon 은 L0 의 다른 surface 일 뿐. L1 (Triton-IR) 부터는 Triton 과 같은 stack. 이게 가장 큰 unique selling point — “CUTLASS 로 넘어갈 필요 없음”.

Gluon 의 위치를 한 줄로 — “Triton 의 layer L0 위에 명시적 layout argument 를 붙인 변형”. tl.load 의 같은 operation 이 — Triton 에서는 자동 layout, Gluon 에서는 사용자가 layout 명시.

왜 같은 backend 가 가치 있는가

같은 backend 라는 사실이 — (1) 학습 곡선이 가파르지 않다 (Triton 코드를 그대로 Gluon 으로 옮겨가며 한 줄씩 변경 가능). (2) 두 stack 을 한 codebase 에서 섞을 수 있다 (자동화로 충분한 자리는 Triton, 정밀한 자리는 Gluon). (3) Triton 의 모든 hardware support 를 그대로 받음.

§ 04Linear Layout 의 표현력· 𝔽₂ 위의 binary matrix

Triton 의 모든 layout 을 한 frame 에 합치는 시각

paper 의 main thesis — “tensor layout 은 𝔽₂ (binary) 위의 행렬로 표현 가능하고, 그렇게 하면 모든 layout 변환이 행렬곱으로 통일된다”. 기존의 blocked / shared / MMA / dot operand 같은 layout 들이 특수 케이스의 행렬이 된다.

왜 𝔽₂ 인가

현대 GPU 의 hardware 표현은 — power-of-two 에 친화적. tensor 의 element 갯수가 2^N, thread 수가 2^M, register 수가 2^P. 이 모든 자리에서 bit 단위 사고 가 자연스럽다.

Linear Layout 의 정의 — 한 layout 은 (layoutInputs, layoutOutputs) 의 매핑. 좌표의 비트와 출력의 비트가 𝔽₂ 행렬로 연결.

예: “thread index 의 bit i 가 register index 의 bit j 로 mapping” 같은 단순한 규칙이 — 행렬 한 entry 로 표현.

layout conversion = 행렬 곱

한 layout 에서 다른 layout 으로 변환할 때 — 두 layout 의 행렬을 곱한다 (𝔽₂ 위에서, 즉 XOR 와 AND 로). 결과 행렬이 변환의 cost / pattern.

이게 왜 중요한가 — 기존 Triton 은 layout 별로 특수 conversion 코드가 있었다. N 개 layout 이면 N² 개의 conversion. Linear Layout 으로 통일하면 — 한 알고리즘이 모든 conversion 을 처리. quadratic 복잡도가 풀린다.

paper 의 실측

“Linear Layouts 를 도입한 후 — Triton 의 layout 관련 버그 여러 개 fix, 새 hardware 추가의 engineering effort 감소”. paper 의 main impact 는 새 layout 의 표현력보다 기존 layout 의 통합.

FIG · 같은 8-element vector 의 4 가지 layout 을 binary matrix 로thread 0–7, register 0–0

blocked
thread → reg 1:1

strided
thread bit reorder

swizzled
XOR-permuted

MMA
native instruction shape

네 layout 은 — 하나의 8-element vector 를 8 thread 에 분배하는 — 다른 패턴들. 각 layout 이 𝔽₂ 위 3×3 binary matrix 로 표현 (thread index 3 bit → register/value index 의 비트). 한 통일된 표현이 모든 conversion 을 행렬곱으로 풀어준다.

§ 05예시 · producer-consumer· warp-specialized pipeline

같은 GEMM 을 Triton 으로, 그리고 Gluon 으로 — 직접 비교

강의의 가장 구체적인 자리. Hopper 의 producer-consumer pattern 을 Triton 으로 짜는 것 vs Gluon 으로 짜는 것의 직접 대조. 의도적으로 단순화한 의사 코드 — 정확한 API 는 Triton repo 의 experimental/gluon 참조.

Triton (자동 layout)

# 단순한 GEMM tile
@triton.jit
def gemm_tt(A, B, C, M, N, K,
            BM: tl.constexpr,
            BN: tl.constexpr,
            BK: tl.constexpr):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    a = tl.load(A_ptrs)        # layout 자동
    b = tl.load(B_ptrs)        # layout 자동
    acc = tl.zeros((BM, BN))
    for k in range(0, K, BK):
        a = tl.load(...)
        b = tl.load(...)
        acc += tl.dot(a, b)    # MMA layout 자동
    tl.store(C_ptrs, acc)

사용자는 BM/BN/BK 만. layout 은 컴파일러가 결정.

Gluon (layout 명시)

# 명시적 layout · pseudo
@gluon.jit
def gemm_gl(A, B, C, M, N, K, BM, BN, BK):
    # shared memory 의 swizzle 명시
    smem_a = gl.alloc_shared(
        (BM, BK), dtype=fp16,
        layout=SwizzledShared(3,0,3))
    smem_b = gl.alloc_shared(
        (BK, BN), dtype=fp16,
        layout=SwizzledShared(3,0,3))
    # producer warp: TMA load
    if warp_id == 0:
        gl.tma_load(A_desc, smem_a, ...)
        gl.tma_load(B_desc, smem_b, ...)
        gl.barrier_arrive(bar)
    # consumer warps: WGMMA
    else:
        gl.barrier_wait(bar)
        acc = gl.wgmma(smem_a, smem_b,
            layout=MMALayout(m64n128k16))
        gl.store(C_ptrs, acc)

producer / consumer 가 명시. swizzle 모드, MMA shape 모두 사용자 제어.

두 코드의 비교

Triton 은 — 짧다. heuristic 이 잡으면 충분히 빠르다. Gluon 은 — 길지만 — heuristic 이 잡지 못하는 자리에서 정확. 학습 path: Triton 으로 짜본 뒤 NCU 로 측정 → heuristic 이 막힌 부분만 Gluon 으로 옮긴다. 처음부터 Gluon 으로 짜는 건 권장되지 않는다 (확인 필요).

실제 API 는 — Triton repo 의 python/triton/experimental/gluon 디렉토리에서 확인. experimental 이라는 단어가 보여주듯 — 2025 시점에서는 active development. API 의 detail 은 빠르게 변할 수 있다.

§ 06backward 호환· 기존 Triton 과의 관계

같은 codebase 에서 두 stack 을 섞을 수 있는가

새 추상이 들어올 때 가장 흔한 문제 — “기존 코드를 다 다시 짜야 하는가”. Gluon 의 답은 아니다. Triton 코드는 그대로 도는 동시에, 일부 함수만 Gluon 으로 바꿔도 된다.

같은 .py 파일 — Triton 의 @triton.jit 와 Gluon 의 @gluon.jit 가 같은 파일에 공존.
같은 launcher — kernel 호출의 syntax 가 거의 동일. tile shape, grid, num_warps 등.
같은 IR 로 lowering — Triton-IR / TTGIR 까지 두 stack 이 통합. Linear Layout 추상이 두 surface 의 통합을 가능하게 함.
profiling 도 통합 — Triton 의 TRITON_INTERPRET, ~/.triton/cache 등 모두 Gluon 에도 적용.
점진적 migration — 한 GEMM kernel 의 mainloop 만 Gluon 으로, epilogue 는 Triton 그대로. 충분히 가능.

“Triton 사용자는 Gluon 을 ‘선택지로’ 갖는다 — 강제로 옮길 필요 없음. heuristic 으로 충분한 자리는 그대로.”학습 노트

§ 07채택 사례· FlashAttention · vLLM

새 stack 이 production 코드에 어떻게 들어가고 있는가

2025 시점에서 Gluon 은 active development 중이지만 — 일부 production 사용처가 보이기 시작. 강의 시점에서 거론될 가능성이 큰 사례들.

FlashAttention 의 attention kernel 일부

Triton-based variant 의 일부 자리에서 — Gluon 으로 producer-consumer 를 명시. CUTLASS 기반 reference 와 throughput gap 줄이기 (확인 필요).

vLLM 의 customizable kernel

PagedAttention 의 일부 변형. Hopper 위에서의 fine-tuning 을 위해 Gluon 사용 가능성 (확인 필요).

PyTorch torch.compile 의 fallback

자동 lowering 으로 부족한 자리에 — 사용자가 Gluon 으로 짠 kernel 을 substitute. 실험적 path.

Liger / AO 류 kernel 라이브러리

Triton 기반 kernel 묶음에 Gluon path 가 추가될 가능성.

research kernel

새 attention 변형 / 새 quantization scheme 의 빠른 prototyping. Triton 으로 시작, 측정, Gluon 으로 fine-tune.

Linear Layouts 의 internal impact

Triton 자체의 codebase 에서 — paper 가 보고하듯 — 여러 버그 수정과 새 hardware 추가의 effort 감소. 사용자에게는 indirect 한 가치.

§ 08CuTe DSL 과 비교· 두 stack 의 trade-off

“같은 자리” 에서 도는 두 추상의 정확한 차이

FIG · Gluon vs CuTe DSL · Python 측 비교같은 일을 다른 frame

차원

Gluon (Triton)

CuTe DSL (CUTLASS)

layout algebra

Linear Layouts (𝔽₂)

Shape × Stride

사용자 모델

tile-level

tile + thread + warp

backend

Triton MLIR

CUTLASS C++

코드 길이

짧음

긺

control 정밀도

중

높음

성능 ceiling

CUTLASS 와 거의 비슷

최대

Python-only

YES

DSL 통해 가능

학습 곡선

완만

가파름

통합

torch.compile

QuTLASS / 직접

새 hardware support

backend 의존

CUTLASS 가 직접

debug

interpret 모드

printf · NCU

한 표 한 줄 — “성능 ceiling 에서 CUTLASS 가 약간 우위, 학습 곡선과 생산성에서 Gluon 이 우위”. 두 stack 은 경쟁이 아니라 — 사용자의 목적에 따라 다른 자리. 같은 회사가 두 stack 을 동시에 쓰는 게 자연스럽다.

실전 결정의 한 가지 trick — “이 kernel 의 maintenance window 가 얼마나 긴가”. 짧으면 (research 단계) Gluon, 길면 (production 다년간 도는 stack) CUTLASS 의 깊이 정확한 control 이 가치 있을 수 있음. 단 — Gluon 의 Linear Layouts 가 maturity 를 갖추면 이 결정이 흐려질 가능성.

§ 09다음· AMD · TMA · per-thread

2025 의 다음 step 들

AMD support — Triton 이 AMD MI300 에 도착. Gluon 도 AMD 의 layout 결정에 같은 추상으로 적용. Linear Layouts 가 vendor-neutral.
TMA descriptor 의 first-class API — 현재는 partial. multicast / swizzle mode / prefetch 의 fine control 이 Python 에서 자연스럽게.
per-thread operation — Triton issue #8580 — Gluon 안에서 thread 별로 다른 operation. warp specialization 의 더 깊이.
aggregate / OOP — issue #8781 — Gluon 안에서 type 추상 (struct, class). 큰 kernel 의 modularity.
TTGIR layout attribute 의 노출 — issue #9789 — sizePerThread, warpsPerCTA 의 직접 접근. 사용자 control 의 더 깊이.
Blackwell 의 새 instruction — sm_100a 의 MMA 변형, FP4 native, TMEM (tensor memory). 새 layout 추상이 추가될 가능성.

Triton 의 진화 방향

Gluon + Linear Layouts 가 — Triton 을 “high-level tile DSL” 에서 “high-to-low level full stack” 로 확장. CUTLASS 의 자리를 침범하기보다는, Python 안의 production-grade kernel 작성을 가능하게 하는 frame.

§ 10기억할 메모· key takeaways · refs

다시 열었을 때 5분 안에 손에 잡혀야 할 것

Gluon 의 위치

Triton 안의 lower-level surface. 같은 backend, 다른 명시성. CUTLASS 로 넘어가지 않고도 Hopper 의 모든 기능 활용.

Triton heuristic 의 한계

warp specialization, TMA descriptor 의 fine control, swizzle mode, MMA shape align — 자동 결정이 hand-tuned 만큼 좋지 않은 자리.

Linear Layouts

𝔽₂ 위의 binary matrix 로 모든 layout 을 표현. blocked / shared / MMA / swizzle 이 특수 케이스. layout conversion = 행렬곱.

N² → N 통합

기존: N 개 layout 이면 N² 개 conversion. Linear Layout: 한 알고리즘이 모든 conversion. engineering effort 큰 감소.

backward 호환

Triton 과 Gluon 이 같은 파일에 공존. 점진적 migration 가능. 강제 변경 없음.

CuTe DSL 과의 비교

Gluon = 짧고 완만, ceiling 약간 낮음. CUTLASS CuTe = 길고 가파름, ceiling 최대. 같은 회사의 다른 자리.

학습 path

Triton 먼저, NCU 측정 후 막히는 자리만 Gluon. 처음부터 Gluon 으로 짜는 건 권장 안 함.

paper 의 main impact

사용자 측 새 표현력 + Triton 내부의 layout 통합. 두 효과 모두.

YouTube youtube.com/watch?v=oYs_qtuk2Pg · transcript failed

Slides gluon.pdf · 강의 동반

Linear Layouts paper arxiv.org/abs/2505.23819 · Robust Code Generation Using 𝔽₂

Gluon code triton/python/triton/experimental/gluon · experimental API

Triton github.com/triton-lang/triton

L029 L029 — Triton Internals · Triton compile stack 의 깊이

L103 L103 — CuTe Layout Algebra · CUTLASS 측의 layout algebra (비교)

손에 새기기 — 실습 시퀀스

Triton GEMM baseline — Triton 으로 simple GEMM. NCU 로 측정. Hopper GPU (H100) 에서 CUTLASS reference 와 throughput gap 측정.
Gluon experimental setup — Triton 의 experimental/gluon 디렉토리 코드를 빌드. 단순한 vector add 부터.
Gluon GEMM 한 줄 — 같은 GEMM 을 Gluon 으로 변환. shared layout, MMA layout 명시. throughput 비교.
warp specialization — producer (TMA load) + consumer (WGMMA) 분리. mbarrier 의 sync. throughput 의 다음 step.
Linear Layout 손계산 — 8-element vector 의 4 layout (blocked, strided, swizzled, MMA) 을 𝔽₂ 위 3×3 행렬로 직접 적어본다.
conversion = matmul — 두 layout 사이의 conversion 행렬을 곱해 결과 layout 의 행렬 도출. 코드 없이 손으로.
점진적 migration — 자기 Triton kernel 의 mainloop 만 Gluon 으로 옮기고 나머지는 그대로. 둘이 같은 파일에서 도는지 확인.

§ 11다른 강의로 이어지는 길· connections

L104 가 다른 강의들과 어떻게 엮이는가

L029

Triton Internals

Triton compile stack — Gluon 이 어디 위에 build 되는지의 베이스

L001

CUDA kernel profiling

Triton vs Gluon 의 측정 — same backend 의 throughput 비교

L103

CuTe Layout Algebra

CUTLASS 측의 layout algebra — Linear Layouts 와 비교 시각

L101

Learning CUTLASS

CUTLASS path — Gluon 의 “CUTLASS 안 가도 됨” 의 비교 baseline

L086

FlashAttention 3

attention 의 producer-consumer 패턴 — Gluon 의 응용 자리

L014

Practitioner's Guide to Triton

Triton 의 자동 layout — Gluon 이 풀려는 자리의 baseline

L053

torch.compile Q&A

torch.compile 의 lowering — Gluon 으로의 fallback 가능성

L102

Quartet v2

FP4 kernel — Gluon 의 주요 응용 후보

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

Gluon 의 정확한 surface API — slides 와 강의 라이브 데모로 확인 필요. experimental/gluon 의 detail 이 빠르게 변하는 단계. (확인 필요)
구체적 throughput gap — Hopper 위 GEMM 의 Triton vs Gluon vs CUTLASS 의 측정 — 강의 안에서 보였을 가능성. (확인 필요)
Linear Layout 의 행렬 표현 detail — 강의에서 어떤 행렬 예시를 보였는지. paper 본문과 강의 슬라이드의 일치 / 차이. (확인 필요)
AMD 측 layout — AMD MI300 의 MFMA layout 이 Linear Layout 으로 어떻게 표현되는지. (확인 필요)
future API stability — Gluon 이 stable Triton API 로 졸업하는 timeline. experimental → stable. (확인 필요)
torch.compile 와의 통합 — torch.compile 이 Gluon 코드를 직접 lowering target 으로 쓸 수 있는지. (확인 필요)
per-thread operation 의 모양 — issue #8580 의 진행 상태 — 강의 시점에서 어디까지. (확인 필요)

검증 메모

이 노트의 Gluon 코드 예시 (§ 05) 는 의사코드이며 실제 API 는 Triton repo 의 python/triton/experimental/gluon 에서 확인 필요. Linear Layouts 의 설명 (§ 04) 은 arXiv 2505.23819 paper 의 abstract 와 일반적 𝔽₂ 추상으로부터 정리. 강의 자체의 자막은 실패. 실제 발표 안에서 사용된 정확한 syntax / 행렬 예시는 슬라이드 (gluon.pdf) 정독 + 영상 재시청 필요.

← Lecture 103 CuTe Layout Algebra · Category-theoretic Interpretation