《The Stack》 EP.43 2026 · APR · 20 42 MIN IMAGINARY

Triton 이 숨기는 것,
노출하는 것

파일 일곱 개를 탁자에 펼쳐놓고 한 줄씩 읽어본다. 같은 커널들이 CUDA 에서 Triton 으로 옮겨갈 때 코드가 어떻게 접히는지 — 어떤 디테일이 컴파일러 밑으로 숨고, 어떤 것이 여전히 네 손에 남는지.

Host

샘 (Sam)

"그래서 왜 이게 편한 거지?"

Guest

젠슨 (Jensen)

속으론 반기지 않았던, 어쩔 수 없이 받아들인 물건

CHAPTERS · 7 ACTS · 7 FILES42:08

01:2001가장 작은 Triton 프로그램smoke_vector_add.py 06:4802tl.sum 한 줄이 warp shuffle 전체를 대체reduction.py 13:0203한 프로그램 = 한 행softmax.py 18:4104Triton 이 CUDA 를 이기는 한 가지matmul.py 26:1005Flash Attention 이 40 줄flash_attention.py 31:3506constexpr 한 줄의 마법flash_attention_mha.py 37:2007torch.ops.* 로 올라가기flash_attention_mha_op.py

00 · 00:28Cold Open

파일 일곱 개. 한 줄씩.

SAM

지난 편에서 내가 "5 년 뒤에 매트릭스곱 말고 뭘 최적화할 거냐"고 물었는데 네가 다음 편으로 미뤘지.

JENSEN

(웃음) 오늘도 안 말해줄 거야. 대신 더 재밌는 거. 그 친구가 이번엔 같은 커널들을 Triton 으로 다시 짰어. 파일 일곱 개 열어놓고 한 줄씩 읽어보자고.

FIG 0 · 오늘의 자료 · triton_kernels/7 files

01 · 01:20가장 작은 Triton 프로그램

한 `program` = 한 쓰레드가 아니라, 한 블록.

CUDA 의 threadIdx.x 는 사라지지 않았다. 컴파일러 아래로 숨었을 뿐. Triton 은 block-level SPMD — 블록 안의 쓰레드 병렬성은 num_warps 만 보고 알아서 결정한다.

smoke_vector_add.py · Tritonblock-level SPMD

@triton.jit
def vector_add_kernel(x_ptr, y_ptr, out_ptr, n,
                      BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

같은 주소 계산 · CUDAthread-level

int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) out[idx] = x[idx] + y[idx];

SAM

CUDA 에서 threadIdx.x 갖고 뭘 하던 건 Triton 에선 어디 갔어?

JENSEN

사라졌어. 정확히는 컴파일러가 숨겼어. tl.arange(0, 1024) 는 "이 블록이 처리할 1024개 인덱스 전체" 를 한 번에 가리키는 벡터. 각 lane 을 어느 쓰레드가 처리할진 Triton 이 결정해.

SAM

SIMD 같네.

JENSEN

정확히 그거야. "block-level SPMD." 각 쓰레드가 아니라 각 블록이 프로그램. 그 안의 쓰레드 병렬성은 컴파일러 몫.

두 패턴이 같은 PTX 로 내려가.
다만 Triton 코드는 "인덱스 공간" 수준에서 사고하게 해줘.— Jensen, 04:12

FIG 1 · 추상화의 높이where does threadIdx.x live?

↳ Lesson 08 · Triton port

02 · 06:48tl.sum 한 줄이 warp shuffle 전체를 대체

근데 대가가 세 가지 있다.

FIG 2 · 같은 reduction — CUDA vs Triton~15 lines → 2 lines

SAM

그럼 진짜 공짜네?

JENSEN

(웃음) 아니야. 세 가지 대가.

① launch overhead

67M 원소 reduce — CUDA v4 = 1.039 ms, Triton = 1.097 ms (5% 느림). Python → autotune 캐시 → JIT 캐시 → argument binding → cuLaunchKernel 까지 ~50–100 µs. 작은 커널에선 이 overhead 가 연산 시간보다 커질 수 있어. → element-wise 30 개를 각각 Triton 커널로 짜면 망한다.

② autotune 의 footgun

autotune 은 각 config 를 순차 실행해 같은 output 버퍼에 쓴다. 이전 시도가 남긴 stale partial sum 이 결과에 섞임. 해법 — reset_to_zero=["partial_ptr"]. 문서에 희미하게 있고, 안 읽으면 몇 시간 디버깅.

③ autotune 과 2-pass reduce 의 관용구

BLOCK_SIZE 가 바뀌면 num_programs 도 바뀐다. 최소 블록 기준으로 최대 크기 partial 버퍼를 잡고, 실제 선택된 config 의 prefix 만 슬라이싱한다.

"Triton 은 고수준이지만 얇아."
추상화가 얕아서 내부 동작이 자꾸 새어나와. 이 누출을 잘 다루는 감각이 Triton 엔지니어의 가치.— Jensen, 12:04

reduction.py · footgun 처방reset_to_zero

@triton.autotune(
    configs=AUTOTUNE_CONFIGS,
    key=["n_elements"],
    reset_to_zero=["partial_ptr"],  # ← 이거 중요
)

↳ Lesson 03 · Reduction

03 · 13:02한 프로그램 = 한 행

마스크 로직이 데이터 값 에 자연스럽게 녹아든다.

SAM

그럼 N=1000 인 행은 어떻게 처리해?

JENSEN

BLOCK_SIZE=1024 로 잡고, 마스크로 뒤쪽 24 개를 걸러. 여기 other=-float("inf") 가 핵심이야. OOB lane 이 -inf 면 tl.max 에 영향 없고, exp(-inf)=0 이라 sum 에도 기여 안 해. 마스크 로직이 데이터 값에 녹아드는 거.

SAM

autotune 키가 BLOCK_SIZE 네. N 을 직접 키로 안 잡은 게 영리한 거네.

JENSEN

그래. BLOCK_SIZE = _next_pow2(N) 이라 N=513~1024 가 모두 1024 로 bucket 돼. 캐시 효율적인 autotune 키 설계 — Triton 배울 때 제일 늦게 배우는 기술이야.

FIG 3 · N=1000 행에서 OOB lane 다루기mask = data value

softmax.py · 핵심 세 줄

offs = tl.arange(0, BLOCK_SIZE)       # 0..1023
mask = offs < n_cols                  # 앞 1000 만 True
x = tl.load(in_row + offs, mask=mask,
            other=-float("inf"))       # OOB → -inf

↳ Lesson 05 · Softmax

04 · 18:41Triton 이 CUDA 를 이기는 한 가지

Grouped Program ID swizzling — CUDA 에서도 짤 수 있지만, 안 짠다.

출력 C 의 타일을 도는 순서를 바꾸는 것만으로 L2 재사용률이 크게 달라진다. row-major 로 돌면 B 의 열들이 캐시에서 쓸려나간다. 그룹 단위로 돌면 같은 B 열이 여러 번 재사용된다.

FIG 4A · row-major · B 열이 매번 교체naive

타일 0→1→2→… A 의 같은 행, B 의 다른 열. B 가 L2 에 안 들어가면 쓸어버림.

FIG 4B · grouped · GROUP_SIZE_M=4swizzled

타일 0→1→2→3 이 B 의 같은 열 을 4 번 재사용. L2 효율 ↑.

SAM

이거 CUDA 에서도 짤 수 있잖아. 왜 CUDA 엔 없다고 해?

JENSEN

짤 수 있지, 근데 안 짜. 세 가지 이유 — ① CUDA 엔 blockIdx.x 가 그냥 하드웨어 순서대로. 수식을 커널 맨 앞에 손으로 풀어써야 함. ② 그 수식이 읽기가 정말 나빠. ③ GROUP_SIZE_M 이 바뀌면 재컴파일. Triton 에선 autotune 파라미터고 표준 관용구야.

FIG 4C · 측정치 — 4096³ matmulL4 sm_89

variant	TFLOPS	note
우리 CUDA v3 (FMA only)	3.9	register blocking
torch.matmul (cuBLAS + TF32)	25.8	NVIDIA 수년 튜닝
Triton fp32	28.9	cuBLAS + 12%
우리 CUDA v4 (WMMA fp16)	18.5	직접 짠 mma
cuBLAS fp16	51.8	—
Triton fp16	54.0	cuBLAS + 4% · 40 줄

20 년 묵은 cuBLAS 가 40 줄짜리 Python 에 진다.
autotune 이 사람보다 config 공간을 잘 탐색해. 측정이 이론을 이기는 전형적인 경우.— Jensen, 23:58

↳ Lesson 04 · Matmul Lesson 08 · Triton port

05 · 26:10Flash Attention 이 40 줄

절반. 그리고 6.1× 빨라.

FIG 5A · line count커널 본체만

acc update · 한 줄로 표현된 online + P@V

acc = acc * alpha[:, None] + tl.dot(p.to(v.dtype), v)

CUDA 버전에선 이 로직이 30+ 줄에 걸쳐 있다. 추상화가 맞는 자리에 있으면 복잡도가 죽는다.

FIG 5B · 성능 — N=8192, seq · single headL4

impl	time (ms)	speedup
CUDA FA v1 (fp32)	3.045	1.00×
Triton FA (fp16)	0.496	6.14×

6 배 차이는 어디서 왔나

① tl.dot 이 Tensor Core 씀 (CUDA v1 은 fp32 FMA). ② autotune 이 (BLOCK_M, BLOCK_N, num_warps, num_stages) 6 개 config 탐색 — CUDA 로 sweep 하려면 재컴파일 6 번. ③ tl.trans, 2-D 포인터 브로드캐스트, swizzled smem layout 모두 자동.

↳ Lesson 06 · CUDA FA Lesson 08 · Triton port

06 · 31:35constexpr 한 줄의 마법

causal mask 는 "채우는 것" 이 아니라 "루프에서 빼는 것."

마스크로 -inf 채워도 QKᵀ 는 전체를 계산한다. FA-v2 의 실제 이득은 상삼각 전체에 들어가는 K 타일을 이터레이션 자체에서 뺀다는 데 있다.

FIG 6A · causal 루프 · pid_m 별로 상한이 다름skipped tiles ↓ 절반

constexpr · 커널 두 개로 JITno runtime branch

def flash_attention_mha_fwd_kernel(..., 
    IS_CAUSAL: tl.constexpr,     # ← 핵심
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr):
    if IS_CAUSAL:
        end_n = tl.minimum(N, (pid_m+1) * BLOCK_M)
    else:
        end_n = N

→ IS_CAUSAL=True 용 커널과 =False 용 커널 별도로 컴파일. 런타임에 if 자체가 없음. C++ template specialization 과 동치.

FIG 6B · 우리 Triton vs SDPA (cuDNN FA-2)LLaMA-7B shape

(B, H, N, d)	ours	SDPA	ratio
(1, 32, 2048, 128)	0.784	0.613	0.78×
(1, 32, 4096, 128)	2.964	2.559	0.86×
(16, 12, 512, 64)	0.249	0.282	1.13×

268 줄 파일 하나로 cuDNN 의 78–90%. 못 따라잡는 이유 — async copy, persistent kernel, warp specialization 이 아직 experimental.

↳ Lesson 09 · MHA causal FA

07 · 37:20torch.ops.* 로 올라가기

70 줄. 이게 Triton 커널을 부품 으로 만드는 마지막 접착제.

flash_attention_mha_op.py · 70 lines

@custom_op(
    "triton_training::flash_attention_mha",
    mutates_args=(),
    device_types="cuda",
)
def flash_attention_mha_op(q, k, v, is_causal=False):
    return triton_flash_attention_mha(q, k, v, is_causal=is_causal)

@flash_attention_mha_op.register_fake
def _fake(q, k, v, is_causal=False):
    return torch.empty_like(q)   # ← Dynamo 용 shape 선언

왜 register_fake 가 핵심인가

torch.compile 이 모델을 트레이싱할 때 FakeTensor (shape, dtype, device 만) 를 쓴다. 진짜 데이터가 없으니 우리 Triton 커널은 못 돌린다. 대신 "이 op 의 출력 shape 은 이것이다" 를 선언 → Dynamo 가 그래프를 안 끊는다. 없으면 fullgraph=True fail.

FIG 7 · 이 70 줄이 여는 것vLLM pattern

SAM

그러니까 Brian 이 지금 만든 게 거의 vLLM 스타일 프로덕션 op.

JENSEN

거의 그래. 빠진 건 backward (autograd) 랑 GQA 지원. 둘 다 설계가 명확해서 다음 레슨에서 추가 가능. 그리고 그 두 개 붙이면 vLLM PagedAttention 포팅 이 다음 목표.

↳ Lesson 07 · custom_op Lesson 09 · MHA

Triton 이 숨기는 것 · 노출하는 것

"고수준 DSL" 이 아니라 "추상화 높이가 딱 그 자리에 있는 언어". 아래 다섯 개는 컴파일러 밑으로 숨고, 위 다섯 개는 여전히 네 손에 남는다.

숨기는 것 · HIDES컴파일러 아래

쓰레드 수준 병렬성 — threadIdx.x 사라짐
warp shuffle, smem tree reduction — tl.sum
Tensor Core 인스트럭션 선택 — tl.dot 이 dtype 보고 자동
smem layout swizzle — bank conflict 자동 회피
launch config 튜닝 — autotune 에 위임

노출하는 것 · EXPOSES여전히 네 손

블록 크기, grid 구조 — program_id, grid=lambda meta
메모리 계층 의식 — tl.load(mask=...) 의 HBM 패턴
컴파일 타임 vs 런타임 경계 — tl.constexpr
autotune 키 설계 — 너무 넓으면 튜닝 폭발, 좁으면 놓침
수치적 동작 — online softmax, fp16 vs fp32 accumulator

CUDA 가 어셈블리, Triton 이 C, PyTorch 가 Python.
대부분은 C 로 짜고, 핫패스만 어셈블리로 내려간다.— Sam & Jensen, 41:02

CUDA 를 계속 배워야 하는 이유 · 네 가지

Triton 이 막히는 순간 — 새 mma (Blackwell FP4), persistent kernel, async copy 정밀 제어 — 아직 CUTLASS/CUDA.
Triton 이 생성한 PTX 를 읽을 줄 알아야 디버그 — TRITON_CACHE_DIR 밑 *.ptx.
vLLM, FlashAttention-3, Mamba 커널이 아직 CUDA 기반. 이 코드 읽으려면 CUDA 가 모국어.
"왜 이 Triton 이 느린가" 추적 — bank conflict, register spill, occupancy. 답은 CUDA 개념에.

코드 참조 · triton_kernels/ 전체 7 파일 · L4 sm_89, CUDA 13.0, PyTorch 2.11, Triton 3.6.

← Ep.42 너희 GPT는 왜 내 GPU를 이렇게 쓰게 됐을까 Back → Index · 11편 기록

《The Stack》 EP.43 2026 · APR · 20 42 MIN IMAGINARY

What Triton hides,
and what it exposes

Spread seven files on the table and read them line by line. When the same kernels move from CUDA to Triton — how does the code fold? What details disappear under the compiler, and what stays in your hands?

Host

Sam

"OK, so why is this the comfortable one?"

Guest

Jensen

Wasn't thrilled about it internally, but has come to accept it

CHAPTERS · 7 ACTS · 7 FILES42:08

01:2001Smallest Triton programsmoke_vector_add.py 06:4802One tl.sum replaces an entire warp shufflereduction.py 13:0203One program = one rowsoftmax.py 18:4104One thing where Triton beats CUDAmatmul.py 26:1005Flash Attention in 40 linesflash_attention.py 31:3506The magic of one constexprflash_attention_mha.py 37:2007Climbing up to torch.ops.*flash_attention_mha_op.py

00 · 00:28Cold Open

Seven files. One line at a time.

SAM

Last episode I asked "five years from now what else will you be optimizing besides matmul," and you punted to the next one.

JENSEN

(laughs) Still not telling. Something more fun instead. This time our friend rewrote the same kernels in Triton. Seven files — let's open them and read line by line.

FIG 0 · today's material · triton_kernels/7 files

01 · 01:20Smallest Triton program

One `program` = one block, not one thread.

CUDA's threadIdx.x didn't vanish — it just slipped under the compiler. Triton is block-level SPMD: the thread parallelism inside a block is decided by the compiler from num_warps.

smoke_vector_add.py · Tritonblock-level SPMD

@triton.jit
def vector_add_kernel(x_ptr, y_ptr, out_ptr, n,
                      BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

Same address math · CUDAthread-level

int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) out[idx] = x[idx] + y[idx];

SAM

Whatever I was doing with threadIdx.x in CUDA — where did it go in Triton?

JENSEN

Gone. More precisely, the compiler hid it. tl.arange(0, 1024) is a vector pointing to "all 1024 indices this block handles." Which thread handles which lane — Triton decides.

SAM

Sounds like SIMD.

JENSEN

Exactly that. "Block-level SPMD." The program is per block, not per thread. Thread parallelism inside is the compiler's problem.

Both patterns lower to the same PTX.
But Triton lets you think at the "index space" level.— Jensen, 04:12

FIG 1 · abstraction heightwhere does threadIdx.x live?

↳ Lesson 08 · Triton port

02 · 06:48One tl.sum replaces an entire warp shuffle

But it charges you three things.

FIG 2 · same reduction — CUDA vs Triton~15 lines → 2 lines

SAM

So it really is free?

JENSEN

(laughs) Not quite. Three costs.

① launch overhead

Reducing 67M elements — CUDA v4 = 1.039 ms, Triton = 1.097 ms (5% slower). Python → autotune cache → JIT cache → argument binding → cuLaunchKernel eats ~50–100 µs. For tiny kernels that overhead can exceed the compute time. → Launch 30 element-wise ops as separate Triton kernels and you're done.

② autotune footgun

Autotune runs configs sequentially against the same output buffer. Leftover stale partial sums from previous attempts mix into the result. Fix — reset_to_zero=["partial_ptr"]. It's faintly mentioned in the docs; miss it and you debug for hours.

③ autotune + 2-pass reduce idiom

Change BLOCK_SIZE and num_programs changes too. Size the partial buffer for the maximum case, then slice it to the prefix that matches the chosen config.

"Triton is high-level but thin."
Its abstraction is shallow, so the internals keep leaking out. Handling the leaks well is the value of a Triton engineer.— Jensen, 12:04

reduction.py · footgun remedyreset_to_zero

@triton.autotune(
    configs=AUTOTUNE_CONFIGS,
    key=["n_elements"],
    reset_to_zero=["partial_ptr"],  # ← matters
)

↳ Lesson 03 · Reduction

03 · 13:02One program = one row

Mask logic dissolves naturally into the data values.

SAM

So how do you handle rows with N=1000?

JENSEN

BLOCK_SIZE=1024, mask out the last 24. The trick is other=-float("inf"). If OOB lanes hold -inf, tl.max is unaffected and exp(-inf)=0 so sum doesn't get contributions either. Mask logic melts into the data values.

SAM

The autotune key is BLOCK_SIZE. Clever not to use N directly.

JENSEN

Right. With BLOCK_SIZE = _next_pow2(N), N=513–1024 all bucket into 1024. Cache-friendly autotune key design — the last skill you pick up when learning Triton.

FIG 3 · handling OOB lanes at N=1000mask = data value

softmax.py · the three core lines

offs = tl.arange(0, BLOCK_SIZE)       # 0..1023
mask = offs < n_cols                  # only first 1000 True
x = tl.load(in_row + offs, mask=mask,
            other=-float("inf"))       # OOB → -inf

↳ Lesson 05 · Softmax

04 · 18:41One thing where Triton beats CUDA

Grouped program-id swizzling — possible in CUDA, but nobody writes it.

Change only the order in which output C tiles are visited, and L2 reuse changes dramatically. Row-major eviction sweeps B's columns. Group-wise traversal lets the same B columns be reused.

FIG 4A · row-major · B columns evicted every rownaive

Tile 0→1→2→… same row of A, different column of B. If B doesn't fit in L2, it gets swept out.

FIG 4B · grouped · GROUP_SIZE_M=4swizzled

Tiles 0→1→2→3 reuse the same B column four times. L2 efficiency ↑.

SAM

You can write this in CUDA too. Why say CUDA doesn't have it?

JENSEN

You can, but nobody does. Three reasons — ① In CUDA, blockIdx.x is just linear hardware order. You have to write the math by hand at the top of the kernel. ② That math is painful to read. ③ Change GROUP_SIZE_M and you recompile. In Triton it's an autotune parameter and a standard idiom.

FIG 4C · measured — 4096³ matmulL4 sm_89

variant	TFLOPS	note
our CUDA v3 (FMA only)	3.9	register blocking
torch.matmul (cuBLAS + TF32)	25.8	years of NVIDIA tuning
Triton fp32	28.9	cuBLAS + 12%
our CUDA v4 (WMMA fp16)	18.5	hand-written mma
cuBLAS fp16	51.8	—
Triton fp16	54.0	cuBLAS + 4% · 40 lines

Twenty years of cuBLAS loses to 40 lines of Python.
Autotune explores the config space better than a human. A textbook case of measurement beating theory.— Jensen, 23:58

↳ Lesson 04 · Matmul Lesson 08 · Triton port

05 · 26:10Flash Attention in 40 lines

Half the code. And 6.1× faster.

FIG 5A · line countkernel body only

acc update · online + P@V in one line

acc = acc * alpha[:, None] + tl.dot(p.to(v.dtype), v)

In CUDA this logic stretches over 30+ lines. Complexity collapses when the abstraction is at the right height.

FIG 5B · perf — N=8192, seq · single headL4

impl	time (ms)	speedup
CUDA FA v1 (fp32)	3.045	1.00×
Triton FA (fp16)	0.496	6.14×

Where does the 6× come from

① tl.dot uses Tensor Cores (our CUDA v1 is fp32 FMA). ② Autotune sweeps 6 configs of (BLOCK_M, BLOCK_N, num_warps, num_stages) — sweeping that by hand in CUDA means 6 recompiles. ③ tl.trans, 2-D pointer broadcasts, swizzled smem layouts — all automatic.

↳ Lesson 06 · CUDA FA Lesson 08 · Triton port

06 · 31:35The magic of one constexpr

Causal mask isn't about "filling" — it's about "removing from the loop."

Filling with -inf still computes the full QKᵀ. FA-v2's real win comes from pulling the upper-triangle K tiles out of the iteration itself.

FIG 6A · causal loop · different end_n per pid_mhalf the tiles skipped

constexpr · two kernels JIT-compiledno runtime branch

def flash_attention_mha_fwd_kernel(...,
    IS_CAUSAL: tl.constexpr,     # ← key
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr):
    if IS_CAUSAL:
        end_n = tl.minimum(N, (pid_m+1) * BLOCK_M)
    else:
        end_n = N

→ Compile separate kernels for IS_CAUSAL=True and =False. The if doesn't exist at runtime. Equivalent to C++ template specialization.

FIG 6B · our Triton vs SDPA (cuDNN FA-2)LLaMA-7B shape

(B, H, N, d)	ours	SDPA	ratio
(1, 32, 2048, 128)	0.784	0.613	0.78×
(1, 32, 4096, 128)	2.964	2.559	0.86×
(16, 12, 512, 64)	0.249	0.282	1.13×

One 268-line file at 78–90% of cuDNN. What's missing — async copy, persistent kernel, warp specialization — all still experimental in Triton.

↳ Lesson 09 · MHA causal FA

07 · 37:20Climbing up to torch.ops.*

70 lines. The last glue that turns a Triton kernel into a component.

flash_attention_mha_op.py · 70 lines

@custom_op(
    "triton_training::flash_attention_mha",
    mutates_args=(),
    device_types="cuda",
)
def flash_attention_mha_op(q, k, v, is_causal=False):
    return triton_flash_attention_mha(q, k, v, is_causal=is_causal)

@flash_attention_mha_op.register_fake
def _fake(q, k, v, is_causal=False):
    return torch.empty_like(q)   # ← shape decl for Dynamo

Why register_fake matters

When torch.compile traces a model, it uses FakeTensors (shape, dtype, device — no data). Our Triton kernel can't run on those. Instead, we declare "this op's output shape is this" → Dynamo doesn't break the graph. Without it, fullgraph=True fails.

FIG 7 · what these 70 lines unlockvLLM pattern

SAM

So what Brian built is essentially a vLLM-style production op.

JENSEN

Nearly. What's missing is backward (autograd) and GQA support. Both have clear designs — next-lesson material. Bolt those two on and the next goal is porting vLLM's PagedAttention.

↳ Lesson 07 · custom_op Lesson 09 · MHA

What Triton hides · what it exposes

Not a "high-level DSL." A language that sits at exactly the right abstraction height. The five below go under the compiler; the five above stay in your hands.

HIDESunder the compiler

Thread-level parallelism — threadIdx.x is gone
Warp shuffle, smem tree reduction — tl.sum
Tensor Core instruction selection — tl.dot picks by dtype
Smem layout swizzling — bank conflicts dodged automatically
Launch config tuning — delegated to autotune

EXPOSESstill in your hands

Block size, grid structure — program_id, grid=lambda meta
Memory-hierarchy awareness — HBM patterns of tl.load(mask=...)
Compile-time vs runtime boundary — tl.constexpr
Autotune key design — too broad explodes; too narrow misses
Numerical behavior — online softmax, fp16 vs fp32 accumulator

CUDA is assembly, Triton is C, PyTorch is Python.
Write most of it in C; drop to assembly only for hot paths.— Sam & Jensen, 41:02

Four reasons you still need CUDA

When Triton hits a wall — new mma (Blackwell FP4), persistent kernels, async copy fine control — still CUTLASS/CUDA.
To debug, you need to read the PTX Triton emits — *.ptx under TRITON_CACHE_DIR.
vLLM, FlashAttention-3, Mamba kernels are still CUDA-based. Reading them needs CUDA as first language.
Chasing "why is this Triton slow" lands on bank conflict, register spill, occupancy. Answers live in CUDA concepts.

Code ref · triton_kernels/ all 7 files · L4 sm_89, CUDA 13.0, PyTorch 2.11, Triton 3.6.

← Ep.42 Why did your GPT end up using my GPU this way Back → Index · 11-post log

Triton 이 숨기는 것,노출하는 것

파일 일곱 개. 한 줄씩.

한 program = 한 쓰레드가 아니라, 한 블록.

근데 대가가 세 가지 있다.

마스크 로직이 데이터 값 에 자연스럽게 녹아든다.

Grouped Program ID swizzling — CUDA 에서도 짤 수 있지만, 안 짠다.

절반. 그리고 6.1× 빨라.

causal mask 는 "채우는 것" 이 아니라 "루프에서 빼는 것."

70 줄. 이게 Triton 커널을 부품 으로 만드는 마지막 접착제.

Triton 이 숨기는 것 · 노출하는 것

What Triton hides,and what it exposes

Seven files. One line at a time.

One program = one block, not one thread.

But it charges you three things.

Mask logic dissolves naturally into the data values.

Grouped program-id swizzling — possible in CUDA, but nobody writes it.

Half the code. And 6.1× faster.

Causal mask isn't about "filling" — it's about "removing from the loop."

70 lines. The last glue that turns a Triton kernel into a component.

What Triton hides · what it exposes

Triton 이 숨기는 것,
노출하는 것

한 `program` = 한 쓰레드가 아니라, 한 블록.

What Triton hides,
and what it exposes

One `program` = one block, not one thread.