gpu
mode · 강의 아카이브
Repo
YouTube
《GPU Mode》
ARCHIVE
104 lectures
CUDA · Triton · CUTLASS · Compilers
L001 – L025
L001
How to profile CUDA kernels in PyTorch
Mark Saroufim
High
L002
Ch1-3 PMPP book
Andreas Koepf
High
L003
Getting Started With CUDA for Python Programmers
Jeremy Howard
High
L004
Compute and Memory Basics
Thomas Viehmann
High
L005
Going Further with CUDA for Python Programmers
Jeremy Howard
High
L006
Optimizing Optimizers
Jane Xu
High
L007
Advanced Quantization
Charles Hernandez
High
L008
CUDA Performance Checklist
Mark Saroufim
High
L009
Reductions
Mark Saroufim
High
L010
Build a Prod Ready CUDA library
Oscar Amoros Huguet
High
L011
Sparsity
Jesse Cai
High
L012
Flash Attention
Thomas Viehmann
High
L013
Ring Attention
Andreas Koepf
High
L014
Practitioners Guide to Triton
Umer Adil
High
L015
CUTLASS
Eric Auld
High
L016
On Hands Profiling
Taylor Robbie
High
L017
NCCL
Dan Johnson
High
L018
Fusing Kernels
Kapil Sharma
High
L019
Data Processing on GPUs
Devavret Makkar
High
L020
Scan Algorithm
Izzat El Haj
High
L021
Scan Algorithm Part 2
Izzat El Haj
High
L022
Hacker's Guide to Speculative Decoding in VLLM
Cade Daniel
High
L023
Tensor Cores
Vijay Thakkar & Pradeep Ramani
High
L024
Scan at the Speed of Light
Jake Hemstad & Georgii Evtushenko
High
L025
Speaking Composable Kernel (CK)
Haocong Wang
High
L026 – L050
L026
SYCL Mode (Intel GPU)
Patric Zhao
High
L027
gpu.cpp - Portable GPU compute using WebGPU
Austin Huang
High
L028
Liger Kernel - Efficient Triton Kernels for LLM Training
Byron Hsu
High
L029
Triton Internals
Kapil Sharma
High
L030
Quantized Training
Thien Tran
High
L031
Beginners Guide to Metal
Nikita Shulga
High
L032
Unsloth
Daniel Han
High
L033
Bitblas
Wang Lei
High
L034
Low Bit Triton Kernels
Hicham Badri
High
L035
SGLang
Yineng Zhang
High
L036
CUTLASS and Flash Attention 3
Jay Shah
High
L037
Introduction to SASS & GPU Microarchitecture
Arun Demeure
High
L038
Low Bit ARM kernels
Scott Roy
High
L039
Torchtitan
Mark Saroufim and Tianyu Liu
High
L040
CUDA Docs for Humans
Zihao Ye
High
L041
FlashInfer
Charles Frye
High
L042
Mosaic GPU
Adam Paszke
High
L043
int8 tensorcore matmul for Turing
Erik Schultheis
High
L044
NVIDIA Profiling
—
Low
L045
Outperforming cuBLAS on H100
pranjalssh
Low
L046
Distributed GEMM
Ali Hassani
High
L047
KernelBot Benchmark GPU Kernels on Discord
—
High
L048
The Ultra Scale Playbook
Nouamane Tazi
Low
L049
Low Bit Metal Kernels
Manuel Candales
High
L050
A learning journey CUDA, Triton, Flash Attention
Umar Jamil
High
L051 – L075
L051
Consumer GPU performance
Jake Cannell
Low
L052
Scaling Laws for Low Precision
Tanishq Kumar
Low
L053
torch.compile Q&A
Richard Zou
Low
L054
Small RL Models at the Speed of Light with LeanRL
—
Low
L055
Modular’s unified device accelerator language
—
Low
L056
Kernel Benchmarking Tales
Georgii Evtushenko
High
L057
CuTe
Cris Cecka
High
L058
Disaggregated LLM Inference
Junda Chen
Low
L059
FastVideo
—
Low
L060
Optimizing Linear Attention
Songlin Yang
Low
L061
D-Matrix Corsair
—
High
L062
Exo 2 Growing a scheduling language
Yuka Ikarashi
High
L063
Search-Based Deep Learning Compilers
Joe Fioti
Low
L064
Multi-GPU programming
Markus Hrywniak
Low
L065
Neighborhood Attention
Ali Hassani
Low
L066
Game Arena
Lanxiang Hu
Low
L067
NCCL and NVSHMEM
Jeff Hammond
High
L068
Landscape of GPU Centric communication
Didem Unat
Low
L069
Quartet 4 bit training
Roberto Castro and Andrei Panferov
Low
L070
PCCL Fault tolerant collectives
mike64t
High
L071
[ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use
Sewon Min
High
L072
[ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models
Guangxuan Xiao
High
L073
[ScaleML Series] Quantization in Large Models
—
High
L074
[ScaleML Series] Positional Encodings and PaTH Attention
Songlin Yang
High
L075
[ScaleML Series] GPU Programming Fundamentals + ThunderKittens
William Brandon; Simran Arora
High
L076 – L104
L076
BackendBench fixing the LLM kernel correctness problem
Mark Saroufim
High
L077
Domain Specific Languages for GPU Kernels
Tri Dao
High
L078
Iris: Multi-GPU Programming in Triton
Muhammad Awad, Muhammad Osama & Brandon Potter
High
L079
Mirage (MPK): Compiling LLMs into Mega Kernels
Mengdi Wu, Xinhao Cheng
High
L080
How FlashAttention 4 Works
Charles Frye
High
L081
High-performance purely functional data-parallel array programming
Troels Henriksen
Low
L082
Helion: A high-level DSL for ML kernels
Jason Ansel, Oguz Ulgen, Will Feng
High
L083
Formalized Kernel Derivation
—
High
L084
Numerics and AI
Paulius Micikevicius
Low
L085
Factorio Learning Environment
Jack Hopkins
Low
L086
Getting Started with CuTe DSL
Vicki Wang
High
L087
Low Latency Communication Kernels with NVSHMEM
Prajwal Singhania
High
L088
TinyTPU
William Zhang
Low
L089
cuTile (from friends at NVIDIA)
Mehdi Amini, Jared Roesch
Low
L090
Building resilient ML Engineering skills
Stas Bekman
Low
L091
Mega Lecture 91: Reinforcement Learning, Agents & OpenEnv
—
Low
L092
Smol Training Playbook
Loubna Ben Allal
Low
L093
Cornserve Easy, Fast and Scalable Multimodal AI
Jeff Ma
Low
L094
tvm-ffi
Tianqi Chen
Low
L095
Single controller programming with Monarch
Allen Wang and Colin Taylor
Low
L096
TLX
—
High
L097
HipKittens
William Hu
Low
L098
GPU Observability
Yusheng (郑昱笙) Zheng
Low
L099
Distributed ML on consumer devices
Matt Beton
Low
L100
InferenceX Continuous OSS Inference Benchmarking
Kimbo Chen, Cam Quilici, Bryan Shan
High
L101
Learning CUTLASS the hard way
Kapil Sharma
High
L102
quartet v2
Andrei Panferov and Erik Schultheis
Low
L103
Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation
Jack Carlisle and Jay Shah
High
L104
Gluon and Linear Layouts
Peter Bell, Mario Lezcano, Keren Zhou
Low