gpumode · 강의 아카이브
《GPU Mode》 ARCHIVE 104 lectures CUDA · Triton · CUTLASS · Compilers
L001 – L025
L001 How to profile CUDA kernels in PyTorch Mark Saroufim High L002 Ch1-3 PMPP book Andreas Koepf High L003 Getting Started With CUDA for Python Programmers Jeremy Howard High L004 Compute and Memory Basics Thomas Viehmann High L005 Going Further with CUDA for Python Programmers Jeremy Howard High L006 Optimizing Optimizers Jane Xu High L007 Advanced Quantization Charles Hernandez High L008 CUDA Performance Checklist Mark Saroufim High L009 Reductions Mark Saroufim High L010 Build a Prod Ready CUDA library Oscar Amoros Huguet High L011 Sparsity Jesse Cai High L012 Flash Attention Thomas Viehmann High L013 Ring Attention Andreas Koepf High L014 Practitioners Guide to Triton Umer Adil High L015 CUTLASS Eric Auld High L016 On Hands Profiling Taylor Robbie High L017 NCCL Dan Johnson High L018 Fusing Kernels Kapil Sharma High L019 Data Processing on GPUs Devavret Makkar High L020 Scan Algorithm Izzat El Haj High L021 Scan Algorithm Part 2 Izzat El Haj High L022 Hacker's Guide to Speculative Decoding in VLLM Cade Daniel High L023 Tensor Cores Vijay Thakkar & Pradeep Ramani High L024 Scan at the Speed of Light Jake Hemstad & Georgii Evtushenko High L025 Speaking Composable Kernel (CK) Haocong Wang High
L026 – L050
L026 SYCL Mode (Intel GPU) Patric Zhao High L027 gpu.cpp - Portable GPU compute using WebGPU Austin Huang High L028 Liger Kernel - Efficient Triton Kernels for LLM Training Byron Hsu High L029 Triton Internals Kapil Sharma High L030 Quantized Training Thien Tran High L031 Beginners Guide to Metal Nikita Shulga High L032 Unsloth Daniel Han High L033 Bitblas Wang Lei High L034 Low Bit Triton Kernels Hicham Badri High L035 SGLang Yineng Zhang High L036 CUTLASS and Flash Attention 3 Jay Shah High L037 Introduction to SASS & GPU Microarchitecture Arun Demeure High L038 Low Bit ARM kernels Scott Roy High L039 Torchtitan Mark Saroufim and Tianyu Liu High L040 CUDA Docs for Humans Zihao Ye High L041 FlashInfer Charles Frye High L042 Mosaic GPU Adam Paszke High L043 int8 tensorcore matmul for Turing Erik Schultheis High L044 NVIDIA Profiling Low L045 Outperforming cuBLAS on H100 pranjalssh Low L046 Distributed GEMM Ali Hassani High L047 KernelBot Benchmark GPU Kernels on Discord High L048 The Ultra Scale Playbook Nouamane Tazi Low L049 Low Bit Metal Kernels Manuel Candales High L050 A learning journey CUDA, Triton, Flash Attention Umar Jamil High
L051 – L075
L051 Consumer GPU performance Jake Cannell Low L052 Scaling Laws for Low Precision Tanishq Kumar Low L053 torch.compile Q&A Richard Zou Low L054 Small RL Models at the Speed of Light with LeanRL Low L055 Modular’s unified device accelerator language Low L056 Kernel Benchmarking Tales Georgii Evtushenko High L057 CuTe Cris Cecka High L058 Disaggregated LLM Inference Junda Chen Low L059 FastVideo Low L060 Optimizing Linear Attention Songlin Yang Low L061 D-Matrix Corsair High L062 Exo 2 Growing a scheduling language Yuka Ikarashi High L063 Search-Based Deep Learning Compilers Joe Fioti Low L064 Multi-GPU programming Markus Hrywniak Low L065 Neighborhood Attention Ali Hassani Low L066 Game Arena Lanxiang Hu Low L067 NCCL and NVSHMEM Jeff Hammond High L068 Landscape of GPU Centric communication Didem Unat Low L069 Quartet 4 bit training Roberto Castro and Andrei Panferov Low L070 PCCL Fault tolerant collectives mike64t High L071 [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use Sewon Min High L072 [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models Guangxuan Xiao High L073 [ScaleML Series] Quantization in Large Models High L074 [ScaleML Series] Positional Encodings and PaTH Attention Songlin Yang High L075 [ScaleML Series] GPU Programming Fundamentals + ThunderKittens William Brandon; Simran Arora High
L076 – L104
L076 BackendBench fixing the LLM kernel correctness problem Mark Saroufim High L077 Domain Specific Languages for GPU Kernels Tri Dao High L078 Iris: Multi-GPU Programming in Triton Muhammad Awad, Muhammad Osama & Brandon Potter High L079 Mirage (MPK): Compiling LLMs into Mega Kernels Mengdi Wu, Xinhao Cheng High L080 How FlashAttention 4 Works Charles Frye High L081 High-performance purely functional data-parallel array programming Troels Henriksen Low L082 Helion: A high-level DSL for ML kernels Jason Ansel, Oguz Ulgen, Will Feng High L083 Formalized Kernel Derivation High L084 Numerics and AI Paulius Micikevicius Low L085 Factorio Learning Environment Jack Hopkins Low L086 Getting Started with CuTe DSL Vicki Wang High L087 Low Latency Communication Kernels with NVSHMEM Prajwal Singhania High L088 TinyTPU William Zhang Low L089 cuTile (from friends at NVIDIA) Mehdi Amini, Jared Roesch Low L090 Building resilient ML Engineering skills Stas Bekman Low L091 Mega Lecture 91: Reinforcement Learning, Agents & OpenEnv Low L092 Smol Training Playbook Loubna Ben Allal Low L093 Cornserve Easy, Fast and Scalable Multimodal AI Jeff Ma Low L094 tvm-ffi Tianqi Chen Low L095 Single controller programming with Monarch Allen Wang and Colin Taylor Low L096 TLX High L097 HipKittens William Hu Low L098 GPU Observability Yusheng (郑昱笙) Zheng Low L099 Distributed ML on consumer devices Matt Beton Low L100 InferenceX Continuous OSS Inference Benchmarking Kimbo Chen, Cam Quilici, Bryan Shan High L101 Learning CUTLASS the hard way Kapil Sharma High L102 quartet v2 Andrei Panferov and Erik Schultheis Low L103 Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation Jack Carlisle and Jay Shah High L104 Gluon and Linear Layouts Peter Bell, Mario Lezcano, Keren Zhou Low