《GPU Mode》 L031 2024 · OCT · 12 High priority transcript · available

Beginners Guide to Metal

MacBook 의 GPU 위에서 어떻게 ML 코드를 짜고 디버깅할 것인가 — Nikita Shulga 가 PyTorch 의 MPS backend 를 만든 사람의 시점에서 깐 Metal compute pipeline, MPS · Metal Performance Shaders 의 이름 정리, unified memory 의 의미, Apple Silicon 의 GPU 모델의 학습 노트. CUDA 시점에서 매핑되는 자리를 명시적으로 짚으면서 — 그리고 디버깅 도구가 어디까지 있는지 정직하게 말하면서.

Metal MPS backend Apple Silicon unified memory Metal Performance Shaders PyTorch MPS JIT compile no profiler

Speaker

Nikita Shulga

Meta · PyTorch core · MPS backend 주 저자

강의 번호

L031

스피커

Nikita Shulga

학습 우선순위

High · 정독

다시 볼 때

Metal shader 직접 짜본다

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제CUDA-only 의 자리에서 02Apple Silicon 의 GPU 모델M1 · M2 · M3 의 단위 03Metal compute pipelinedevice · queue · shader 04PyTorch MPS backenddevice="mps" 안의 path 05unified memory architectureCPU/GPU 가 같은 RAM 06디버깅 도구없는 것 · 있는 것 07CUDA 와 비교단어 mapping 표 08LLM 추론 사례llama.cpp · MLX 09한계와 미래Tensor Core 등가물 10기억할 메모와 코드key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· CUDA-only 의 자리에서

“대부분의 학생이 가진 GPU 는 NVIDIA 가 아니다” — MacBook 의 GPU 를 쓸모 있게 만드는 길

강의의 출발점은 — Apple Silicon 의 GPU 가 사실 매우 강력한데(M2 Max 의 GPU 는 이론적 13.6 TFLOPS FP32) 대부분의 ML tutorial 이 이 자리를 가르치지 않는다. Nikita 의 입장은 명시적으로 “agnostic 하게 GPU 를 쓰는 길의 일부”.

강의가 답하려는 두 줄 —

PyTorch 안에서 Apple GPU 를 어떻게 쓰는가 — device="mps" 가 무엇이고 어떤 길로 도는가.
그 위에서 직접 커널을 짜야 할 때 어떻게 하는가 — Metal shader 를 PyTorch extension 으로 박는 길, 디버깅 가능한 자리.

강의의 frame 은 정직하다 — “나도 advanced engineer 가 아니다, 그저 코드를 쓰는 사람이다”(Nikita 의 자기소개). 그래서 강의의 톤이 입문자 시점의 정직한 재구성. CUDA 와 mapping 되는 자리는 명시적으로 짚되, “여기서는 cudaProfiler 같은 게 없습니다” 같이 갭도 가린 적 없다.

강의의 frame

Apple Silicon 의 매력 — unified memory architecture. CPU 와 GPU 가 같은 RAM 을 공유한다. 64 GB / 128 GB 의 unified memory 가 — NVIDIA 의 80 GB H100 같은 자리에 — 훨씬 싼 가격에 같이 산다. “VRAM 80 GB 인 GPU 를 가질 수 없는 학생/연구자가 LLM inference 를 직접 돌리는 첫 번째 길” 이 Apple 위에 있다는 게 강의의 큰 메시지.

“Apple Silicon GPU 의 이론적 throughput 은 결코 작지 않습니다. 다만 — Tensor Core 같은 specialized 행렬 명령이 없어서 mma 짠 NVIDIA 와 직접 비교는 안 됩니다. 메모리 model 은 거꾸로 — unified memory 가 큰 강점이에요.”Nikita Shulga · 09:11

§ 02Apple Silicon 의 GPU 모델· M1 · M2 · M3 의 단위

“NVIDIA SM” 의 자리에 박히는 단어들 — `GPU core`, `thread group`, `SIMD group`

M1 부터 M3 까지의 GPU 는 NVIDIA 와는 다른 단위 분해를 쓴다. 강의에서 Nikita 가 깐 매핑 — 이름은 다르지만 같은 자리에서 같은 일을 한다.

UNITCUDA / NVIDIAMETAL / Apple

EXEC unitStreaming MultiprocessorGPU core

WARP / sub-warpwarp = 32 threadsSIMD group = 32 (보통)

BLOCKCTA / thread blockthread group

on-chip SRAMshared memory (block 단위)threadgroup memory

DEVICE memoryHBM (전용 GPU 메모리)unified memory (CPU 와 공유)

tensor 가속Tensor Core (mma)없음 (M3 에 SIMD-group matrix)

streamcudaStreamcommand buffer

kernel launchkernel<<<...>>>commandEncoder.dispatch

shader 언어CUDA C++Metal Shading Language (MSL)

M2 / M3 chip 의 한 단계 더 —

GPU core 수 — M2 Pro: 19, M2 Max: 38, M2 Ultra: 76. M3 Max: 40. 한 GPU core 는 NVIDIA SM 의 한 partition 정도.
SIMD group width — 보통 32, M3 에서는 일부 워크로드가 dynamic.
threadgroup memory — M-series 는 보통 32 KB / threadgroup. NVIDIA 의 shared memory 와 같은 자리.
memory bandwidth — M2 Max 400 GB/s, M2 Ultra 800 GB/s, M3 Max 400 GB/s. 같은 칩이 CPU+GPU 모두에 노출.
SIMD-group matrix (M3+) — 8×8 matrix multiply 를 한 SIMD group 단위로. NVIDIA mma 와 같은 자리지만 폭이 작음.

강의의 인상적 디테일

M2 Ultra 의 메모리 대역폭 800 GB/s 는 — A100 의 1.5–2 TB/s 의 절반. 그런데 unified memory 라 CPU↔GPU 복사 비용이 0. memory-bound 워크로드(LLM inference 의 prefill, KV cache 처리)에서는 PCIe 왕복이 없어서 NVIDIA 와의 비교가 미묘해진다.

§ 03Metal compute pipeline· device · queue · shader

Metal 의 5 객체 — CUDA 의 driver/context/stream/kernel/cudaMalloc 의 자리

Metal 은 graphics + compute 를 한 API 로 묶은 Apple 의 native GPU 추상. 강의에서 Nikita 가 한 화면씩 보여준 객체들 — 각 자리에서 무엇이 일어나는지.

FIG · Metal compute 의 5 객체device → queue → buffer → encoder → dispatch

L0 · DEVICE MTLDeviceMTLCreateSystemDefaultDevice(). CUDA 의 device handle 의 자리. M-series 에서는 보통 1개 (integrated GPU). ≡ cudaDevice

L1 · QUEUE MTLCommandQueuedevice.makeCommandQueue(). 비동기 작업의 channel. 한 device 위에 여러 queue. ≡ cudaStream

L2 · BUFFER MTLBufferdevice.makeBuffer(length:options:). storageModeShared 면 CPU↔GPU 가 같은 메모리 (unified). storageModePrivate 면 GPU 전용. storageModeManaged 는 macOS Intel 시절의 이중복사 (M-series 에서는 deprecated). ≡ cudaMalloc 변종

L3 · COMMAND BUFFER MTLCommandBufferqueue.makeCommandBuffer(). 한 dispatch 의 묶음. 여러 encoder 가 한 buffer 에 들어감. ≡ cudaGraph 의 single instance

L4 · ENCODER MTLComputeCommandEncodercb.makeComputeCommandEncoder(). setComputePipelineState(컴파일된 셰이더), setBuffer(인자 binding), dispatchThreadgroups(launch). ≡ kernel launch

CUDA 의 한 줄(kernel<<<grid,block>>>(args))이 — Metal 에서는 (encoder 생성 → pipeline 설정 → buffer binding → dispatch → encoder.endEncoding → buffer.commit) 의 6 단계로 분해된다. boilerplate 가 더 많지만 explicit.

// Metal compute — vector add 의 Objective-C++ (간략)
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLCommandQueue> queue = [device newCommandQueue];

// 1. shader 컴파일 — 문자열에서 jit
NSString* src = @"kernel void add(device float* a, ...) { ... }";
id<MTLLibrary> lib = [device newLibraryWithSource:src options:nil error:&err];
id<MTLFunction> fn  = [lib newFunctionWithName:@"add"];
id<MTLComputePipelineState> pipe = [device newComputePipelineStateWithFunction:fn error:&err];

// 2. buffer — unified memory 면 CPU 에서 직접 fill 가능
id<MTLBuffer> bufA = [device newBufferWithLength:N*sizeof(float) options:MTLResourceStorageModeShared];

// 3. encode + dispatch
id<MTLCommandBuffer> cb = [queue commandBuffer];
id<MTLComputeCommandEncoder> enc = [cb computeCommandEncoder];
[enc setComputePipelineState:pipe];
[enc setBuffer:bufA offset:0 atIndex:0];
[enc dispatchThreadgroups:MTLSizeMake(N/256,1,1)
    threadsPerThreadgroup:MTLSizeMake(256,1,1)];
[enc endEncoding];
[cb commit];
[cb waitUntilCompleted];          // ≡ cudaDeviceSynchronize

MSL 의 첫 시그널

Metal Shading Language 는 — C++14 의 부분집합. CUDA C++ 와 매우 비슷한 syntax. kernel, device, threadgroup, thread address space attribute. thread_position_in_grid, threadgroup_position_in_grid 같은 builtin. 학습 곡선이 매우 짧다는 게 강의의 작은 메시지.

§ 04PyTorch MPS backend· device="mps" 안의 path

PyTorch 코드 한 줄 — 그 안에서 무엇이 일어나는가

PyTorch 는 1.12 부터 MPS backend 를 가진다. 이름의 미묘한 점 — MPS 는 사실 Metal Performance Shaders(Apple 의 vendor 라이브러리)이지만, PyTorch 의 device="mps" 안에는 Metal 직접 호출과 MPS 호출이 모두 섞여 있다.

한 줄 PyTorch 코드의 dispatch 분해 —

x = torch.randn(...).to("mps") — MTLBuffer 가 unified memory 위에 생긴다 (storageModeShared).
y = x @ w — ATen 의 MPS dispatch key 가 호출됨. MPS 라이브러리의 MPSMatrixMultiplication 으로 mapping. 사용자 코드 그대로.
y.softmax(dim=-1) — softmax 처럼 vendor 라이브러리에 없는 op 는 — PyTorch 가 직접 짠 Metal shader 가 호출됨.
torch.mps.synchronize() — CUDA 의 sync 와 같은 자리.

즉 MPS backend = Metal 직접 호출 + MPS 라이브러리 호출의 hybrid. 강의의 인상적 디테일 — “MPS 라이브러리만으로는 모든 PyTorch op 를 cover 못 하니, 부족한 자리는 직접 Metal shader 를 짜서 PyTorch 안에 넣는 작업이 계속 진행 중”.

# PyTorch MPS — 사용자 시점
import torch

assert torch.backends.mps.is_available()
device = "mps"

x = torch.randn(8192, 8192, device=device)
y = torch.randn(8192, 8192, device=device)

# MPS backend dispatch — MPSMatrixMultiplication
z = x @ y

torch.mps.synchronize()

# 직접 shader 끼워넣기 — torch.utils.cpp_extension 의 MPS 버전
from torch.utils.cpp_extension import load
ext = load(name="my_mps",
            sources=["my_kernel.mm"],
            extra_cflags=["-fobjc-arc"])

MPSGraph — 한 단계 더 위

Apple 이 ML 워크로드에 맞춰 만든 MPSGraph — Metal 위 graph compiler. operator 단위 fusion, lazy execution. PyTorch 의 일부 op 가 MPSGraph 로 dispatch 된다 (MPSGraph.matrixMultiplication). MLX 도 이 layer 위에 산다는 점.

§ 05unified memory architecture· CPU/GPU 가 같은 RAM

“cudaMemcpy 가 없는 GPU” — 메모리 모델의 근본적 차이

Apple Silicon 의 가장 큰 차이. NVIDIA GPU 는 별도 VRAM(HBM)이 있고 PCIe 를 통해 host RAM 과 통신. M-series 는 — CPU 와 GPU 가 같은 LPDDR5 메모리를 공유. 이게 디자인의 거의 모든 결정에 영향.

전통적 discrete GPU

Host (CPU) RAM ⇄ PCIe ⇄ Device (GPU) VRAM

memcpy H→D / D→H 가 모든 작업의 시작과 끝. PCIe 4.0 ≈ 32 GB/s — VRAM 1 TB/s 와 비교하면 30 배 느린 병목.

큰 모델은 VRAM 안에 들어가야. 들어가지 못하면 페이지 단위 스왑 (느림).

Apple Silicon unified

CPU + GPU + Neural Engine 이 같은 LPDDR5 풀 위.

memcpy 자체가 없다. storageModeShared buffer 는 CPU 와 GPU 가 동일 주소로 접근.

모델 크기 한계 = unified RAM 전체. M2 Ultra 192 GB 면 192 GB 모델이 통째로 들어간다.
대역폭 = 같은 메모리 컨트롤러 공유 (M2 Ultra 800 GB/s).

이 차이가 프로그래밍에 어떻게 흘러가는가 —

memcpy 코드 없음. x.to("mps") 의 to 는 사실 — Buffer 의 metadata 만 바꾸는 일에 가깝다 (storageModeShared 라면).
큰 모델이 가능. M2 Ultra 192 GB 면 Llama 70B (BF16) 가 통째로 RAM 에 산다. NVIDIA H100 80GB 한 개로는 불가능.
memory bandwidth 가 모든 자리에서 공유. CPU 가 RAM 에 access 중이면 GPU 가 잠깐 stall 가능.
cache coherence 가 비싸다. CPU 가 buffer 를 쓴 직후 GPU 가 읽으면 — 캐시 무효화가 필요. storageModeShared 의 cost 가 작은 자리는 아니다.

강의의 인상적 자리

Q&A 에서 “CPU pinned memory 가 unified 에서도 의미가 있나?” 라는 질문. Nikita 의 답: “pinned 의 ‘이 페이지를 swap-out 하지 마라’ 의 의미가 unified 에서는 — GPU 가 이 페이지를 보고 있으니 CPU 가 캐시 invalidate 하지 말라 — 로 변환된다. 마음은 같다”.

실용적 충격 — “M2 Ultra MacBook Pro 192GB 가 70B BF16 LLM 의 inference 가 가능한 가장 작은 form factor”. ChatGPT-class 모델을 자기 노트북 위에서 부분 inference 가능.

§ 06디버깅 도구· 없는 것 · 있는 것

“NCU 는 없다 — 그 자리에 무엇이 있는가”

강의에서 Nikita 가 가장 정직하게 깐 자리. CUDA 의 ncu, nsys, compute-sanitizer 의 자리에 — Metal 위에는 무엇이 있고 무엇이 없는가.

Metal Frame Capture (Xcode)

Xcode 에서 attach 한 다음 “Capture GPU Frame”. CUDA 의 nsys 와 비슷한 timeline 뷰. 셰이더 코드와 함께 GPU 실행 표시.

Metal Shader Debugger

Xcode 안에 셰이더 디버거. 변수 값, register 사용. Triton interpret 모드의 자리. 단 — 사용 경험이 좋다 보긴 어렵다는 게 강의의 솔직한 평가.

os_signpost

셰이더 안에 marker 박는 길. 사용자가 직접 instrumentation. Apple 의 Instruments 앱 안에서 timeline 으로 보임.

없는 것: ncu 등가물

NCU 의 actionable hint (occupancy, scoreboard stall) 가 없다. memory bandwidth, cache miss 를 직접 볼 수 있는 깊이의 도구가 빈약.

없는 것: compute-sanitizer

memory race, OOB access 검출 도구가 NVIDIA compute-sanitizer 만큼 정교하지 않다. 잘못된 메모리 접근이 silent corruption 으로 이어질 수 있음.

PyTorch profiler

torch.profiler 가 MPS activity 를 잡는다. CPU + MPS 의 timeline. 가장 일반적 진입점.

“정직하게 — Apple 의 GPU 디버깅 도구는 NVIDIA 의 NCU 만큼 깊지 못 합니다. graphics 워크로드 는 Frame Capture 가 잘 되어 있는데, ML 워크로드 는 그 도구가 적합한 모양이 아니에요.”Nikita Shulga · 47:05

실전 디버깅 사다리

(1) logical bug — CPU 위에서 같은 코드를 돌려본다 (PyTorch 는 device='mps' 와 'cpu' 가 같은 ATen op). 결과가 다르면 MPS backend 의 버그 일 가능성. (2) numerical drift — bf16/fp32 의 자리를 명시적으로. (3) perf bug — torch.profiler + Metal Frame Capture. 더 깊이는 — os_signpost 로 직접 instrumentation.

§ 07CUDA 와 비교· 단어 mapping 표

CUDA 사용자가 Metal 코드 읽을 때 손에 들고 가야 할 표

__global__ void kernel

kernel void kernel(...) — MSL 의 함수 attribute

__device__ helper

device float helper(...) — address space attribute 가 동시에 storage 표시

__shared__ float s[N]

threadgroup float s[N] — block 단위 SRAM

__syncthreads()

threadgroup_barrier(mem_flags::mem_threadgroup)

threadIdx.x

thread_position_in_threadgroup attribute

blockIdx.x

threadgroup_position_in_grid

blockDim.x

threads_per_threadgroup

__shfl_xor_sync

simd_shuffle_xor — SIMD-group 내 shuffle

cudaMalloc

device.makeBuffer(length: options:)

cudaMemcpy

storageModeShared 면 직접 포인터 access. private 면 blitCommandEncoder.copy

kernel<<<grid,block>>>()

encoder.dispatchThreadgroups(grid, threadsPerThreadgroup: block)

cudaStream

MTLCommandQueue + MTLCommandBuffer

cudaEvent

MTLEvent · queue 간 동기화

cudaDeviceSynchronize

commandBuffer.waitUntilCompleted 또는 torch.mps.synchronize

nvcc

xcrun -sdk macosx metal 또는 jit 으로 device.makeLibrary(source:)

cuBLAS / cuDNN

MetalPerformanceShaders / MPSGraph

강의의 큰 메시지 — “이 표만 옆에 두고 보면, 일주일 안에 CUDA 사용자가 Metal 커널을 읽을 수 있다”. 단어가 다를 뿐 — 모델은 거의 그대로.

§ 08LLM 추론 사례· llama.cpp · MLX

“이미 Apple GPU 위 LLM 이 잘 돌고 있다” — 무엇이 그렇게 만들었는가

강의 후반부의 본론. Apple Silicon 위 LLM inference 가 — 2024 시점에 — 어떻게 구체적으로 도는지. 두 큰 stack 이 있다 — llama.cpp(C++) 와 MLX(Apple 의 ML 프레임워크).

llama.cpp 의 Metal backend —

모든 attention, RMSNorm, RoPE, GEMM 을 직접 짠 Metal shader.
양자화 (Q4_K, Q5_K, GGUF) 가 모두 Metal shader 안에 박혀 있음. Apple GPU 위에서 INT4 inference 가 이미 안정 동작.
Llama 7B Q4 가 M2 Max 위에서 ~30 tok/s 정도. 매우 실용적.
코드가 작아서 — Metal shader 의 좋은 예제 자료로도 활용. ggml-metal.m 한 파일에 거의 다.

MLX — Apple 이 직접 만든 array 라이브러리. NumPy + PyTorch 의 mix. Metal native. lazy evaluation. Apple Silicon 전용이라 NVIDIA 와의 portability 는 의도적으로 포기.

FIG · LLM inference perf — M2 MaxLlama 7B · 토큰/초

PyTorch MPS BF16

~14 tok/s

PyTorch MPS Q4 (직접)

~22 tok/s

llama.cpp BF16

~24 tok/s

llama.cpp Q4_K_M

~38 tok/s

MLX BF16

~30 tok/s

MLX 4-bit

~46 tok/s

M2 Max 64GB. 양자화 win 이 dominant. memory-bound 이라 INT4 가 BF16 의 ~2× 빠름. MLX 가 Apple-native 라 가장 빠름.

unified memory 의 결정적 자리

Llama 70B BF16 ≈ 140 GB. M2 Ultra 192GB MacBook Pro 위에서 — 한 노트북에 통째로 적재 + inference. 같은 모델을 NVIDIA 로 돌리려면 H100 80GB × 2 (multi-GPU). Apple 의 unified memory 가 “집에서 70B 모델 돌리는” 가장 싼 길.

“Apple GPU 가 매끄러운 LLM inference 가 가능한 건 — Metal shader 가 잘 짜여서가 아니라, unified memory 덕분에 모델이 통째로 메모리에 들어가서입니다. memory bandwidth 의 게임이에요.”Nikita Shulga · 1:18:42

§ 09한계와 미래· Tensor Core 등가물

Apple GPU 가 NVIDIA 와 “같이 빠른” 자리가 어디인가 — 그리고 어디서 갈리는가

강의의 정직한 끝맺음. Apple GPU 가 빛나는 자리와 그렇지 않은 자리.

memory-bound 워크로드

Apple 의 강점. LLM inference 의 prefill + decode, attention, memory copy 등. unified memory 가 큰 도움.

compute-bound GEMM

Apple 의 약점. Tensor Core 가 없으니(M3 의 SIMD-group matrix 가 작은 시작) 같은 fp32/fp16 GEMM 이 한 자리수 이상 느림. 훈련에 적합하지 않은 자리.

큰 모델 inference

Apple 만의 자리. 70B 모델을 단일 노트북에 적재 가능. 다른 길이 없다.

multi-GPU 학습

Apple 의 한계. NVLink 같은 고속 GPU 간 interconnect 가 없음. multi-Mac 학습은 사실상 불가능.

batch inference

NVIDIA 의 자리. 큰 batch GEMM 이 dominant 면 Tensor Core 의 throughput 이 결정적.

실험·개인 프로젝트

Apple 의 자리. 노트북 위에서 LLM inference, 작은 fine-tuning, prototype 모두 가능. 개발 사이클이 빠름.

M3 와 Apple 의 다음 단계

M3 부터 — SIMD-group matrix 명령. 8×8 fp16 matmul 을 한 SIMD group 단위로. NVIDIA mma 의 첫 등가물. 폭은 작지만(NVIDIA Ampere mma 가 16×8×16) 의미 있는 시작. “M4/M5 에서 더 큰 matrix unit 이 들어올 가능성” 이 강의의 추측.

§ 10기억할 메모와 코드· key takeaways

다시 열었을 때 5분 안에 손에 잡혀야 할 것

Apple GPU 단위

GPU core (≡SM) · SIMD group=32 (≡warp) · thread group (≡block) · threadgroup memory (≡shared).

Metal 5 객체

device → queue → buffer → command buffer → encoder. CUDA 의 5 객체와 1:1 mapping.

MSL syntax

C++14 의 부분집합. kernel, device, threadgroup address space. thread_position_in_grid builtin.

PyTorch MPS

device="mps". ATen 의 MPS dispatch key. MPS 라이브러리 + 직접 Metal shader 의 hybrid.

unified memory

CPU + GPU 가 같은 RAM. memcpy 자체가 없음. 큰 모델이 노트북 위에 적재 가능.

storage mode

Shared (CPU+GPU access), Private (GPU 전용), Managed (deprecated on M-series).

디버깅 도구

Xcode Frame Capture, Metal Shader Debugger, os_signpost, torch.profiler. NCU 등가물 없음.

실전 사례

llama.cpp Metal backend (Q4_K), MLX (Apple-native). 둘 다 양자화로 메모리 bandwidth 의 게임을 우호적으로.

YouTube youtube.com/watch?v=cGtiaJjLkAI

Slides Beginner's guide to Metal kernels.pdf

Code gemm_perf_studies.mm

Apple docs Metal · MPS · MSL spec PDF

실전 stack llama.cpp · MLX

손에 새기기 — 실습 시퀀스

PyTorch MPS 첫 코드 — Mac 에서 device="mps" 로 LLaMA-2 7B inference. 토큰/초 측정.
storage mode 의 영향 측정 — 같은 buffer 를 Shared vs Private 로 — 작은 sync 워크로드에서 cache coherence 비용 직접 관찰.
vector add Metal shader 직접 짜기 — gemm_perf_studies.mm 같은 형식. kernel void add(...), jit 컴파일, dispatch.
tiled GEMM 짜기 — 강의 repo 의 gemm_perf_studies.mm 을 따라서. naive → threadgroup tiled → SIMD-group matrix(M3+).
llama.cpp Metal 코드 읽기 — ggml-metal.m 한 파일을 정독. Q4_K 의 dequant + GEMM 이 한 shader 안에 어떻게 짜여 있는지.
MLX 로 같은 모델 — 같은 LLaMA-2 7B 를 MLX 로. PyTorch MPS 와 throughput 비교.
Frame Capture 로 timeline — Xcode 의 “Capture GPU Frame”. PyTorch profiler 와 합쳐서 하나의 step 의 분포 확인.
os_signpost instrumentation — 자기 shader 안 핵심 지점에 marker. Instruments 의 timeline 에서 보기.

§ 11다른 강의로 이어지는 길· connections

이 강의의 도구가 시리즈 안에 어떻게 다시 등장하는지

L026

SYCL Mode (Intel GPU)

Patric Zhao — vendor portability 의 또 다른 사례. Apple Silicon = Intel GPU 와 같은 “NVIDIA 너머” 의 자리

L027

gpu.cpp / WebGPU

Austin Huang — WebGPU 의 macOS backend 가 결국 Metal. § 03 의 5 객체와 같은 자리

L001

CUDA profiling

Mark Saroufim — § 07 의 mapping 표의 reference. § 06 의 디버깅 사다리도 PyTorch profiler 의 기본 패턴

L018

Fusing Kernels

Mark · Charles — fusion 의 일반론. Metal shader 위에서도 같은 자리

L028

Liger Kernel

Byron Hsu — Triton 위 학습 커널. Apple 위 같은 자리(MPS/Metal)에서는 vendor 라이브러리 호출이 dominant

L032

Unsloth

Daniel Han — fine-tuning 의 backend 선택. Apple 은 fine-tuning 에 적합하지 않은 자리의 사례

L033

Bitblas

Wang Lei — low-bit GEMM. llama.cpp 의 Q4_K Metal 구현이 같은 자리의 Apple 측 사례

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

M3 의 SIMD-group matrix 의 perf — 8×8 fp16 mma 가 NVIDIA mma 와 같은 자리의 throughput 비교. 실제로 GEMM perf 가 어디까지 따라오는가.
PyTorch MPS 의 op coverage — 강의 시점 이후 어디까지 늘었는지. 특히 학습용 op (autograd, FSDP)의 지원 상태.
MLX 와 PyTorch MPS 의 perf 갭 — Apple-native 디자인이 정확히 어디서 win 인가. lazy graph compile 의 효과.
multi-Mac 학습 가능성 — Thunderbolt 5 + 향후 unified compute 발전이 multi-node 학습을 가능하게 할지. 현재는 사실상 단일 노드.
Vulkan/MoltenVK 의 자리 — Apple 위에 Vulkan layer 를 깔면 portable 코드가 Metal 로 변환됨. 손실은 어디서?
Apple Neural Engine (ANE) 의 자리 — GPU 와 별개 가속기. CoreML 만이 직접 노출. PyTorch 가 ANE 를 backend 로 쓰는 길이 있는가.
Apple 의 ML 투자 방향 — Apple Intelligence 가 발표된 이후 Metal 의 ML 워크로드 도구가 개선되는 속도.

검증 메모

이 노트의 perf 수치(Llama 7B M2 Max 38 tok/s 등)는 강의 시점(2024 10월) 의 인터넷 데이터를 인용한 재구성. M3/M4 와 양자화 stack 의 변화가 빠르니 자기 환경에서 직접 측정해야 의미 있는 비교가 된다.

← Lecture 030 Quantized Training — Thien Tran 의 학습 양자화 Lecture 032 → Unsloth — Daniel Han 의 fine-tuning 가속