《GPU Mode》 L096 2025 · 후반 High priority transcript · failed · TLX repo 보강 · 스피커 명단 확인 필요

TLX — Triton 의 hardware-near 확장

Triton 의 강점은 “compiler 가 알아서” 였다 — BLOCK_SIZE, num_warps, num_stages 만 정하면 나머지는 lowering 이 풀어준다. 그러나 Hopper/Blackwell 의 새 기능 (TMA · async tensor core · warp specialization) 은 compiler heuristic 만으로는 못 짠다. TLX (Tile-Level eXtensions / Triton Low-level eXtensions) 는 expert 사용자가 그 자리에 직접 손을 대게 해주는 layer. transcript 가 실패해서 본 노트는 facebookexperimental/triton:tlx branch 와 일반 GPU kernel 도메인 지식 기반.

TLX warp specialization async tensor core TMA Hopper Blackwell flash attention pipelined GEMM

Speaker

스피커 명단 (확인 필요)

notes 에 speaker missing · facebookexperimental/triton 의 TLX 메인테이너로 추정

강의 번호

L096

스피커

확인 필요

학습 우선순위

High · 정독

자막 상태

failed · repo 보강

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제why this lecture exists 02TLX 의 추상core abstractions 03lower-level 컨트롤 — local memory · async tensor coreprimitives 04Triton 과의 비교 — 무엇을 잃고 무엇을 얻나trade-offs 05example — pipelined GEMMwalkthrough 06example — warp-specialized flash attentionwalkthrough 07채택 — 누가 TLX 를 쓰나in the wild 08한계 — 휴대성 · 학습 곡선limitations 09다음 — Triton 본가와의 관계future 10기억할 메모와 실습key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· why this lecture exists

Hopper 의 새 기능을 Triton 의 자동화로는 못 잡는다

강의의 출발점. Triton 의 lowering 은 “compiler heuristic 이 충분히 똑똑하다” 는 가정 위에 산다. 그러나 H100 의 진짜 성능은 — warp specialization · async tensor core · TMA · multi-stage pipelining 같은 새 기능을 직접 조합해야 나온다. 이게 compiler 자동화의 한계.

강의의 frame

“Triton 은 80% 의 사용자에게 80% 의 성능을 준다. 나머지 20%—frontier kernel의 마지막 20% 는 compiler 가 못 한다. TLX 는 그 자리에 expert 가 손을 댈 수 있게 한다.”

강의에서 자주 인용되는 예 — flash attention 3. CUTLASS 로 짠 hand-tuned 버전이 H100 에서 700+ TFLOP/s. Triton 으로 같은 로직을 짜면 60–70% 수준. 그 차이가 warp specialization · async tensor core 의 직접 조작에서 온다. TLX 가 그걸 Triton 안에서 가능하게 한다.

“Triton 의 자동화가 모든 걸 해주는 건 아니다 — 그러나 자동화를 선택적으로 끌 수 있어야 hand-tuned 와 경쟁할 수 있다.”학습 노트

정확한 이름 — 확인 필요

TLX 의 표준 풀네임은 “Triton Low-level eXtensions” 또는 “Tile-Level eXtensions” 로 불린다 — 본 노트는 facebookexperimental/triton 의 tlx branch 의 README 표현을 따른다. 강의 시점에 정확한 이름을 영상에서 확인 필요. 둘 다 같은 의도 — Triton 의 high-level 자동화 위에 hardware-near 의 직접 제어 layer.

§ 02TLX 의 추상· core abstractions

4가지 영역 — local memory · async ops · sync · warp specialization

TLX repo 의 README 가 정리하는 4영역. Triton 의 high-level 추상 위에 추가되는 hardware-near 도구들.

A · local memory tlx.local_alloc() — shared/tensor memory 에 buffer 직접 할당. tlx.local_load/local_store 로 layer 간 직접 transfer. compiler 가 알아서 풀던 자리에 사용자가 손.

B · async ops 비차단 메모리 transfer + async tensor core. global → local, local → tensor core 가 token 으로 추적. token = tlx.async_copy(...) 후 tlx.wait(token).

C · synchronization barrier · named barrier · phase-based protocol. warpgroup 사이의 producer/consumer 관계를 명시적으로.

D · warp specialization tlx.async_tasks(...) — block 의 thread 를 task 별로 분할. 어떤 warp 는 load 만, 어떤 warp 는 compute 만. Hopper/Blackwell 의 핵심 기법.

이 4영역의 공통점 — Triton 의 자동화를 부분적으로 끄고, 사용자가 명시적으로 통제. 그래서 TLX 코드는 vanilla Triton 보다 길고 복잡하지만, compiler 가 못 잡는 자리를 잡는다.

TLX 의 위치 — Triton 의 fork branch

TLX 는 standalone 라이브러리가 아니다. facebookexperimental/triton 의 tlx branch 안에 들어간 확장. import triton.language as tl 옆에 import triton.language.extra.tlx as tlx 같은 형태로 사용. Triton 본가와 별개로 진화하는 실험 가지.

§ 03lower-level 컨트롤· local memory · async tensor core

Hopper 의 새 hardware feature 가 직접 코드에 노출된다

TLX 가 가능하게 하는 핵심 — H100 의 hardware feature 를 Triton 안에서 직접 호출. CUTLASS 의 정신을 Python 으로.

TMA (Tensor Memory Accelerator) — Hopper 의 새 DMA 엔진. 큰 tile 을 한 명령으로 global → shared 로. CPU 가 아닌 hardware descriptor 가 multi-D 의 indexing 자동 풀이.
WGMMA (Warp Group MMA) — H100 의 async tensor core. async 로 시작해서 token 으로 완료 대기. CUDA 의 일반적 mma 보다 훨씬 큰 tile 한 번에.
warp specialization — block 의 warp 를 그룹으로 나눠 다른 일. 한 그룹은 load (memory unit), 다른 그룹은 mma (tensor core), 또 다른 그룹은 epilogue. memory 와 compute 의 overlap 을 hardware 레벨로.
multi-stage pipeline — 같은 buffer 를 N 개로 두고, 각 stage 가 다른 위치를 처리. async transfer 의 latency 가 다음 stage 의 compute 로 hide.

vanilla Triton 의 한계

vanilla Triton 도 H100 위에서 도는 PTX 를 생성한다. 그러나 그 PTX 가 이 4가지 feature 를 충분히 활용하지 않는다. compiler 가 conservative 하게 schedule 하기 때문. TLX 는 — “여기는 async, 여기는 producer, 여기는 consumer” 를 사용자가 직접 적게 한다.

§ 04Triton 과의 비교· trade-offs

같은 일을 다른 추상 — vanilla 가 깨끗, TLX 가 빠름

차원

vanilla Triton

TLX

memory 관리

tl.load · tl.store — compiler 가 shared memory 사용 결정

tlx.local_alloc — buffer 직접 할당, layer 간 transfer 명시

async

implicit (compiler heuristic 으로 일부 async)

explicit token-based — async_copy / wait

warp 관리

num_warps 만 명시. 모든 warp 가 같은 일

async_tasks 로 warp group 분리. 각 group 이 다른 일

코드 길이

짧음 (50–100 줄 typical)

길어짐 (200–400 줄 typical)

성능

peak 의 60–80%

peak 의 85–95% (well-tuned 시)

debug 난이도

interpret 모드, breakpoint OK

async / sync 의 deadlock 가능성

portability

A100 · H100 · Blackwell · MI300 (대체로)

arch-specific (특히 Hopper 이상)

대상 사용자

대부분의 ML 엔지니어

kernel 라이브러리 저자, frontier kernel

“TLX 는 Triton 을 ‘C 같은’ DSL 로 바꾸지 않는다 — 같은 DSL 에 ‘이 자리는 hardware-near’ 라고 표시할 수 있게 해준다.”학습 노트

언제 TLX 를 쓸 가치

1. frontier kernel (flash attention, GEMM-K) — 마지막 20% 성능이 의미 있을 때. 2. Hopper/Blackwell 전용 — 새 hardware 의 feature 를 적극 활용. 3. kernel 라이브러리 저자 — 한 번 짜고 많이 호출되는 코드. 일반 모델 코드는 vanilla Triton 이 더 효율적.

§ 05example — pipelined GEMM· walkthrough

multi-stage buffer 로 load 와 compute 를 overlap

TLX repo 의 첫 example — pipelined GEMM. vanilla Triton 으로 짜면 compiler 가 일부 pipeline 을 만들지만, TLX 로 직접 짜면 N-stage 의 timing 을 사용자가 통제.

FIG · 3-stage pipelined GEMM 의 시간축load / compute / store overlap

stage 0 buffer

load tile 0

mma 0

idle

stage 1 buffer

idle

load tile 1

mma 1

idle

stage 2 buffer

idle

load tile 2

mma 2

store

tensor core

idle (warmup)

mma 0 → mma 1 → mma 2 → … (continuous)

warmup phase 후 tensor core 가 거의 비지 않는다. async load 가 다음 tile 을 미리 가져오고, mma 가 이전 tile 위에서 도는 동안 그것이 끝남. 3-stage 면 보통 충분. 4 이상 stage 는 메모리 사용만 늘고 이득 적음.

# pseudo-code — TLX pipelined GEMM 의 골격
@triton.jit
def gemm_kernel(...):
    # 3-stage shared memory buffer
    a_buf = tlx.local_alloc(
        shape=(3, BM, BK),
        dtype=tl.float16,
    )
    b_buf = tlx.local_alloc(
        shape=(3, BK, BN),
        dtype=tl.float16,
    )

    # warmup — 첫 2 tile 미리 load
    tok0 = tlx.async_copy(a_ptr, a_buf[0])
    tok1 = tlx.async_copy(a_ptr+BK, a_buf[1])

    acc = tl.zeros(...)

    for k in range(K // BK):
        slot = k % 3
        next_slot = (k + 2) % 3

        # 다음 tile async load 시작
        tok = tlx.async_copy(
            a_ptr + (k+2)*BK, a_buf[next_slot]
        )

        # 현재 tile 의 load 완료 대기
        tlx.wait(toks[slot])

        # mma — async tensor core
        acc = tlx.async_mma(
            a_buf[slot], b_buf[slot], acc
        )

    tlx.wait_all()
    tl.store(c_ptr, acc)

이 패턴이 가르치는 것 — async + buffer rotation이 multi-stage pipelining 의 본질. compiler 가 못 보던 timing 을 사용자가 직접 표현. 실제 코드는 더 길지만 (boundary 처리, prologue/epilogue) 핵심은 위 골격.

§ 06example — warp-specialized flash attention· walkthrough

한 block 안의 warp 를 producer/consumer 로 분리 — async 의 진짜 활용

TLX 의 가장 중요한 example. flash attention 의 새 변형이 — TLX 의 async_tasks 로 자연스럽게 표현. 한 block 의 warp 들을 두 그룹으로 분리.

load warpgroup — Q, K, V tile 을 global → shared 로 async 가져옴. TMA 사용. tensor core 는 안 만짐.
compute warpgroup — load warpgroup 이 가져온 tile 위에서 QK^T → softmax → AV 의 attention. WGMMA 호출. load 는 안 만짐.
epilogue warpgroup (옵션) — compute 결과를 global 로 store. compute 와 동시에 도는 게 가능.

이 분리가 만들어내는 효과 — memory 와 compute 가 다른 hardware unit 에서 동시에. CUTLASS 의 producer-consumer pattern 을 Triton 안에서.

# TLX warp specialization 의 골격
@triton.jit
def attention(...):
    # shared memory + barrier 셋업
    q_buf = tlx.local_alloc(...)
    k_buf = tlx.local_alloc(...)
    v_buf = tlx.local_alloc(...)
    barrier = tlx.named_barrier()

    with tlx.async_tasks() as tasks:
        @tasks.task(num_warps=4)
        def load_loop():
            for i in range(N_BLOCKS):
                tlx.tma_load(k_ptr, k_buf[i])
                tlx.tma_load(v_ptr, v_buf[i])
                barrier.arrive(i)

        @tasks.task(num_warps=4)
        def compute_loop():
            for i in range(N_BLOCKS):
                barrier.wait(i)
                qk = tlx.async_mma(q_buf, k_buf[i])
                p  = softmax(qk)
                acc = tlx.async_mma(p, v_buf[i], acc)

    tl.store(out_ptr, acc)

이 코드의 mental model.

async_tasks 안에 두 def 함수 가 있음 — 두 warpgroup 의 코드.
각 warpgroup 은 num_warps 로 지정된 warp 수를 차지.
named_barrier 가 두 warpgroup 사이의 producer-consumer 동기.
load warpgroup 은 TMA 만, compute warpgroup 은 WGMMA 만 — hardware unit 의 자연 분리.

vanilla Triton 으로 같은 효과를 내려면 — compiler 가 알아서 warp 를 분리해줘야 한다. 그게 일반화하기 어려운 자리.

왜 이 패턴이 큰 차이를 만드는가

H100 의 tensor core 와 memory unit 은 hardware 레벨에서 별도 unit. SPMD 적인 코드 (모든 warp 가 같은 일) 면 한쪽이 항상 idle. warp specialization 으로 두 unit 을 동시에 굴리면 50% 의 hardware utilization 이 90%+ 로. 그게 flash attention 3 이 700+ TFLOP/s 를 내는 본질.

§ 07채택· in the wild

아직 실험적 — 그러나 frontier kernel 의 자리

TLX 가 facebookexperimental 이라는 이름에서 드러나듯 아직 실험. 그러나 frontier kernel 영역에서는 흥미로운 자리.

flash attention 변형 — TLX 로 짠 attention 이 vanilla Triton 대비 큰 성능 향상. CUTLASS 의 hand-tuned 와 거의 동등.
pipelined GEMM — kernel 라이브러리의 핵심 op. cuBLAS 와 경쟁 가능한 자리.
MoE GEMM — sparse routing 이 결합된 MoE 의 GEMM. expert 별 batch 가 다양한 자리에서 TLX 의 명시적 통제가 도움.
학술 연구 — Triton + TLX 의 lowering 자체에 대한 논문이 등장 시작.

§ 08한계 — 휴대성 · 학습 곡선· limitations

TLX 는 비싸다 — 학습 시간, 코드 길이, portability

arch-specific — Hopper/Blackwell 의 새 feature 에 의존. A100 위에서는 일부 기능이 emulation 또는 미동작. AMD 위에서는 거의 다른 코드.
학습 곡선 — async / sync / barrier / warp group 의 mental model 이 vanilla Triton 보다 한 단계 더. CUTLASS 사용 경험이 있으면 친숙.
deadlock 가능성 — barrier 와 wait 가 순환 의존을 만들 수 있음. 디버깅이 어려움 — interpret 모드도 한계.
portability 의 손실 — TLX 코드가 다른 GPU 에서 안 돌면 fallback 코드를 별도로 유지해야. “하나의 코드가 모든 GPU 에서” 의 Triton 약속이 깨짐.
upstream 미통합 — Triton 본가에 들어가지 않은 실험. Triton 본가의 다른 변경과 충돌 가능.
학습 자료 부족 — vanilla Triton 의 풍부한 예제와 비교해 example/tutorial 이 적다.

cost-benefit

TLX 의 가치는 한 kernel 이 매우 자주 호출되는 자리에서만 회수. flash attention 같은 자리는 LLM serving 의 hot path 라 회수 빠름. 일반 모델 코드는 vanilla Triton 또는 torch.compile 이 더 합리적. “너의 kernel 이 백만 번 도는가? 그렇다면 TLX. 아니면 vanilla.”

§ 09다음 — Triton 본가와의 관계· future

TLX 의 일부 추상이 Triton 본가에 흡수될 가능성

현재 TLX 는 facebookexperimental fork. 그러나 Triton 본가가 Hopper/Blackwell 지원을 깊게 가져가면 — TLX 의 일부 추상 (특히 async tensor core, warp specialization) 이 흡수될 가능성.

warp specialization 의 표준화 — Triton 본가가 tl.async_tasks 같은 형태로 native 지원할 가능성.
TMA / WGMMA — 이미 Triton 3.x 에서 일부 들어옴. TLX 의 token-based async 가 표준에 합류할 가능성.
compiler heuristic 의 학습 — TLX 코드의 패턴을 compiler 가 자동 인식해 vanilla Triton 코드에서도 같은 lowering 을 만들 가능성.
CUTLASS 와의 수렴 — Triton + TLX 와 CUTLASS Python 이 같은 영역으로 수렴. 둘이 서로 영향.

§ 10기억할 메모와 실습· key takeaways

다시 열었을 때 5분 안에 잡혀야 할 것

4 영역의 추상

local memory · async ops · sync · warp specialization. Triton 의 자동화 위에 hardware-near 통제.

tlx.local_alloc

shared memory 직접 할당. multi-stage buffer 로 pipelining 표현.

async_copy + token + wait

explicit async memory transfer. compiler 가 못 잡는 timing 을 직접.

async_tasks

block 의 warp 를 그룹으로 분리. 각 group 이 다른 일 (load / compute / store).

named_barrier

warpgroup 사이 producer-consumer 동기. phase-based protocol.

TMA + WGMMA

Hopper 의 새 hardware unit 을 직접 호출. memory 와 compute 의 동시 가동.

arch-specific

Hopper/Blackwell 우선. A100 / AMD 는 fallback 별도.

언제 채택

frontier kernel + 매우 자주 호출 + Hopper 이상. 일반 모델 코드는 vanilla.

실습 시퀀스

repo clone + build — facebookexperimental/triton 의 tlx branch. cmake build, H100 에서 hello world example 실행.
vanilla GEMM vs TLX pipelined GEMM — 같은 size 의 GEMM 을 두 버전으로. NCU 로 SM occupancy / tensor core utilization 비교.
flash attention 비교 — TLX 의 attention example 과 vanilla Triton 의 attention 의 throughput. CUTLASS hand-tuned 와도 비교.
warp specialization 직접 짜기 — 단순 vector add 를 load/compute warpgroup 으로 분리. 작은 예제로 mental model 정착.
NCU 로 hardware utilization — TLX 코드와 vanilla 의 NCU dump 비교. tensor core idle time, memory unit busy time 의 차이.

YouTubeL096 강의 영상

Repofacebookexperimental/triton (tlx branch)

ReferenceCUTLASS · flash attention 3 paper · Hopper architecture whitepaper

§ 11다른 강의로 이어지는 길· connections

Triton / kernel 시리즈 안에서

L001

How to profile CUDA kernels in PyTorch

Triton 의 lowering 사다리 — TLX 가 어디에 들어가는지의 background

L014

Practitioner's Guide to Triton

vanilla Triton 의 기본 — TLX 학습 전 필수

L029

Triton Internals

Triton 의 lowering passes — TLX 가 그 위에 어떻게 layered 되는지

L028

Liger Kernel

Triton kernel 의 LLM 학습 응용 — TLX 의 잠재 사용처

L097

HipKittens

AMD 위의 비슷한 layer — kernel DSL 의 multi-vendor 풍경

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

스피커 명단 확인 필요 — notes 에 “speaker missing”. 영상에서 발표자 확인.
TLX 의 정확한 풀네임 — “Triton Low-level eXtensions” vs “Tile-Level eXtensions”. 강의에서 어떤 표현 썼는지.
TLX 코드의 정확한 API — 본 노트의 예시는 README + 일반 패턴의 추정. 실제 함수 시그니처는 repo example 에서.
Triton 본가와의 관계 — 강의 시점에 TLX 의 어떤 부분이 본가에 들어갈 계획이었는지.
vanilla Triton 대비 정확한 throughput 비율 — flash attention 의 60–70% 같은 비율은 대략적 추정. 강의에서 정확한 숫자가 있었을 가능성.
Blackwell 지원 — 강의 시점에 Blackwell (B100/B200) 의 새 기능 — fp8 tensor core, 더 큰 register file — 이 어떻게 다뤄졌는지.
다른 lower-level Triton 변형 — Triton 의 다른 fork (예: AITemplate 등) 과의 비교가 있었는지.

검증 메모

본 노트의 4영역 분류 (local memory · async ops · sync · warp specialization) 와 코드 예시는 facebookexperimental/triton 의 tlx branch README 의 정리를 기반으로 한다. 강의에서 다른 framing 이나 이름이 등장했을 수 있다 — 영상 직접 확인 후 보강 필요.

← Lecture 095Single controller programming with Monarch Lecture 097 →HipKittens — William Hu