《GPU Mode》 L022 Speculative Decoding · vLLM High priority transcript · slides · available

Hacker's Guide to Speculative Decoding in vLLM

LLM decoding 이 작은 batch 에서 memory-bound 라는 사실 위에 세워진 트릭 — draft 모델로 K 토큰 미리 만든 뒤, target 모델 한 번의 forward 로 검증·승인. Cade Daniel (Anyscale) 이 vLLM 안 구현 사례를 들고 와서 보여주는 알고리즘 정확성 보존 + throughput 회계 + 시스템 통합의 워크플로 학습 노트.

speculative decoding vLLM draft + verify acceptance rate memory-bound PagedAttention Medusa / Eagle batching latency vs throughput

Speaker

Cade Daniel

Anyscale · ex-vLLM speculative decoding maintainer

강의 번호

L022

스피커

Cade Daniel

학습 우선순위

High · 정독

다시 볼 때

vLLM source 까기

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제왜 spec decoding 인가 02decoding latency 의 구조prefill vs decode 03memory-bound 가 핵심 전제왜 추측이 빠른가 04draft + verify 패턴알고리즘 골격 05acceptance 회계언제 빨라지는가 06vLLM 안 구현scheduler · proposer · scorer 07batching 과 KV cache 의 충돌PagedAttention 위에서 08Medusa · Eagle 변종draft 가 head 가 되는 길 09throughput vs latency 회계어디서 spec decoding 이 진다 10기억할 메모와 코드 자료key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· 왜 spec decoding 인가

“같은 모델, 같은 출력 — 그런데 더 빠르게”

Speculative decoding 의 가장 강한 약속 — output distribution 을 보존하면서 latency 를 줄인다. 모델 quality 가 안 떨어진다. 그 약속이 어떻게 가능한가가 강의의 큰 prompt.

강의가 깐 큰 질문 세 개.

왜 작은 batch LLM decoding 이 memory-bound 인가 — 이게 spec decoding 이 동작하는 전제(§ 02–03).
draft + verify 가 정확히 어떻게 “정확성을 잃지 않으면서 빠른가” — Leviathan et al. 의 rejection sampling 트릭(§ 04).
vLLM 같은 production serving system 안에 어떻게 통합하는가 — scheduler/PagedAttention/batching 과의 미묘한 충돌(§ 06–07).

강의의 인지적 frame

spec decoding 은 “더 작은 모델로 추측 → 큰 모델로 검증” 의 단순한 idea 처럼 보이지만, 그게 well-defined output distribution 을 보존하는 이유는 rejection sampling 의 정교한 설계에 있다. 그리고 production 에서는 — 수학적 정확성보다 system overhead 가 이득을 잡아먹지 않게 하는 것이 더 어렵다.

“speculative decoding 은 알고리즘이 쉽고 시스템이 어렵다 — vLLM 안에서 충분히 복잡하게 통합하는 데 1년이 걸렸다.”Cade Daniel (요약)

§ 02decoding latency 의 구조· prefill vs decode

같은 모델의 두 phase — 한쪽은 compute-bound, 한쪽은 memory-bound

LLM 추론은 항상 두 phase. prefill — 입력 prompt 전체를 한 번에 forward (긴 sequence, 많은 token). decode — 한 token 씩 autoregressive 생성 (sequence 길이 1).

FIG · prefill vs decode 의 cost 구조같은 모델, 다른 bottleneck

prefill — compute큰 batch · GEMM

prefill — memoryweight 한 번 read

decode — compute한 token · GEMV

decode — memory매 token 마다 weight 전체 read

decode 는 산술 양은 작은데 모든 weight 를 매 token 마다 HBM 에서 읽어야 한다. token 당 산술 < 모델 크기 ÷ HBM 대역폭. 명확한 memory-bound. 이게 spec decoding 이 의미가 있는 영역.

예시 산수 — Llama-7B (FP16, ~14GB weight), HBM3 (3 TB/s). decode 한 token 의 이론 하한 = 14 GB ÷ 3 TB/s ≈ 4.7 ms. 실제로는 batch size, KV cache 크기, attention overhead 등으로 늘어나지만, 산술 자체는 GPU peak 의 1% 도 안 쓴다.

§ 03memory-bound 가 핵심 전제· 왜 추측이 빠른가

한 번 weight read = 1 token 이든 K token 이든 비슷

spec decoding 의 마법이 작동하는 이유. memory-bound 일 때 — 한 번 weight read 비용으로 K token 을 batch 처리 할 수 있다 (sequence 차원 위 batch). 산술은 K 배 늘어도 시간은 거의 똑같다.

결정적 관찰

target 모델의 1-token decode 와 K-token batch forward 의 시간이 거의 같다 (memory-bound 영역에서). 그래서 K 개 candidate token 을 한 번에 verify 하는 건 사실상 공짜. 1 token 씩 K 번 도는 것보다 훨씬 빠르다.

FIG · 같은 weight read 비용으로 multiple token verifymemory-bound 의 회계 트릭

target 1-token decode~5 ms

target 4-token batch verify~6 ms

target 8-token batch verify~7 ms

draft 1-token (작은 모델)~1 ms

target 의 시간은 K 가 늘어도 거의 안 늘어남. draft 는 작은 모델이라 1 token 이 매우 빠름. K draft + 1 batch verify ≈ 1.5 target step — 만약 K 토큰 중 평균 3 개가 채택되면 3× speedup.

§ 04draft + verify 패턴· 알고리즘 골격

“추측 → 검증” 의 한 round

FIG · spec decoding 의 한 round timeline4 token draft + verify + accept

draft model

target model

— waiting for draft —

batch verify d1..d4

결과

accept d1

accept d2

accept d3

reject d4

target's t4'

한 round 에서 4 draft + 1 verify forward. accept rate 가 높으면 (예: 75%) 4 round 중 평균 3 token + 1 “bonus” target token = round 당 ~4 token. 일반 decoding 의 round 당 1 token 대비 ~4× throughput.

정확성의 비밀 — rejection sampling

draft 모델 q(x) 와 target 모델 p(x) 가 다른 분포. 각 token 위치에서 — uniform [0,1] sample u 를 그리고, 만약 u ≤ p(x_d)/q(x_d) 이면 accept (draft 의 token 그대로). 아니면 reject 하고 그 자리에서 “수정 분포” max(0, p(x) - q(x)) / Z 에서 새 token 을 sample. 이 절차의 출력 분포가 정확히 p(x) 라는 게 Leviathan et al. 의 핵심 정리.

“output distribution 이 보존된다 — 이게 quality 보존의 수학적 보장이고, spec decoding 이 단순 lossy approximation 이 아닌 이유.”Cade Daniel (요약)

§ 05acceptance 회계· 언제 빨라지는가

K, draft latency, acceptance rate — 세 변수의 함수

spec decoding 의 throughput 이득을 결정하는 변수.

K — draft 가 한 번에 만드는 token 수.
α — token 당 acceptance rate (모델/도메인 의존, 보통 0.5–0.85).
r = T_draft / T_target — draft 와 target 의 forward 시간 비율 (보통 0.05–0.2).

round 당 채택되는 평균 token 수 (geometric distribution) ≈ (1 - α^(K+1)) / (1 - α). round 시간 = K·T_draft + T_target. speedup ≈ accepted / (1 + K·r).

draft K=4 tokenα=0.7, r=0.1100%

target verify (1 batch forward)K-token 입력, K+1 logits 출력~80%

accept d1 (P=0.7)70%

accept d2 | d1 (P≈0.7)49%

accept d3 | d1,d234%

accept d4 | d1,d2,d324%

round 당 평균 accepted ≈ 2.3 token+ bonus target token = ~3.3≈ 3.3×

α 가 낮으면 spec decoding 이 손해다

α < 0.5 면 reject 가 잦아져 draft 시간이 거의 낭비. α < r/(1-r) 같은 임계값 아래에서는 speedup 이 1 미만 (즉 더 느려짐). 그래서 도메인별로 α 를 측정하지 않고 spec decoding 을 켜는 건 위험.

§ 06vLLM 안 구현· scheduler · proposer · scorer

알고리즘이 통합되는 자리 — production serving 의 component 들

vLLM 에서 Cade 가 짠 framework 의 추상화 — 세 컴포넌트로 분해.

Proposer

draft token 을 만드는 자. small LLM (Llama-68M 같은), prompt-lookup, ngram, Medusa head, Eagle 등 여러 구현. 같은 interface (propose(K) → token list).

Scorer

target 모델의 batch forward — K candidate token 입력 → K+1 logit 출력. PagedAttention 위에서 “여러 sequence 의 여러 candidate” 를 한 번에 처리.

Spec Engine

scheduler 의 sub-component. 각 request 별로 “지금 spec 을 켤지 끌지” 결정 + Proposer 호출 + Scorer 호출 + rejection sampling + KV cache 업데이트.

vLLM 의 큰 trick — 모든 proposer 가 같은 interface 로 plug-in. ngram, draft model, Medusa, Eagle 가 한 줄로 swap 가능. 새 알고리즘 등장하면 Proposer 만 새로 짜면 됨.

scheduler 가 결정하는 것

(1) request 마다 K 를 동적으로 — 짧은 sequence 는 K 작게. (2) batch 의 일부만 spec 을 켜기 (다른 request 의 throughput 을 해치지 않게). (3) acceptance rate 가 너무 낮으면 spec 자동으로 끄기 (online α 추정).

§ 07batching 과 KV cache 의 충돌· PagedAttention 위에서

spec 이 가장 미묘하게 깨지는 자리 — 시스템과의 통합

알고리즘은 한 sequence 를 가정하지만 — vLLM 은 수백 개 sequence 를 같은 batch 에 넣는다. 거기서 spec decoding 이 어떻게 동작하는가가 강의의 가장 어려운 자리.

한 sequence per request — 단순

K draft token 추가 → K KV slot 더 alloc
verify 실패 → 그 slot revoke
혼자 도는 환경에서는 문제 없음

continuous batching — 어려움

같은 batch 안 다른 request 가 1-token decode 만 하는데, 이 request 만 K-token verify
attention 의 query length 가 row 별로 다름 — variable-length attention
prefix shared (e.g. system prompt) 가 spec 과 어떻게 결합?
K 토큰 중 일부만 accept 됐을 때 KV cache 에 반쯤 적힌 KV 의 처리

vLLM 의 답 — “spec 도 그냥 한 batch row 의 query length 가 K+1 이 되는 일반 케이스”. PagedAttention 의 paging 단위가 token 인 게 자연스럽게 맞아 떨어진다 (block 안 K slot 알고 있으면 됨). 한 attention 커널이 다른 request 들의 다른 query length 를 동시에 처리.

“PagedAttention 이 token 단위 paging 이라서 — spec decoding 의 K-token verify 가 special case 가 아니라 그냥 query-length=K+1 인 일반 row 가 된다.”Cade Daniel (요약)

§ 08Medusa · Eagle 변종· draft 가 head 가 되는 길

“draft 모델” 이 사실은 target 모델의 일부라면?

spec decoding 의 발전된 변종들. draft 를 별도 모델 두는 대신, target 모델 자체에 추가 head 를 붙여서 한 forward 로 K token 을 한꺼번에 만든다.

Medusa

target 모델 마지막 hidden state 위에 K 개의 작은 prediction head 추가. head_k 가 “k step 뒤 token” 을 predict. inference 시 한 forward 로 K candidate. tree-style 추측.

Eagle

Medusa 의 발전. hidden state 자체를 입력으로 받는 작은 transformer 가 다음 hidden state 를 predict. token 보다 hidden 위에서 추측. acceptance rate 가 더 높음 (~80–90%).

N-gram lookup

draft 가 모델이 아니라 prompt 의 n-gram 검색. RAG/code completion 처럼 입력에 가까운 token 이 자주 나오는 도메인에서 잘 됨. 매우 빠른 draft.

prompt-lookup

현재 생성된 텍스트의 마지막 부분을 prompt 안에서 검색해 다음 token 을 추측. 간단하고 cost 0.

trade-off의 큰 그림

(1) draft 가 가벼울수록 r↓ 좋지만 α↓ 도 같이 떨어짐. (2) draft 가 무거울수록 α↑ 좋지만 r↑. (3) Medusa/Eagle 는 draft compute 를 거의 0 으로 누르면서 α 를 ~0.8 까지 올림 — 두 마리 토끼. 단점은 fine-tuning 이 필요.

§ 09throughput vs latency 회계· 어디서 spec decoding 이 진다

“latency 줄이지만 throughput 도 줄일 수 있다” 의 미묘함

강의의 가장 솔직한 자리. spec decoding 은 latency (token-by-token 시간) 는 거의 항상 줄인다. 그런데 throughput (cluster 의 총 token/s) 은 — batch 가 이미 큰 환경에서는 줄어든다.

왜 그런가

batch 가 충분히 크면 target 모델이 더 이상 memory-bound 가 아니다 (compute-bound 로 전환). 그러면 “K 토큰 verify 가 1 토큰 verify 와 비슷한 시간” 의 전제가 깨진다. K 토큰은 K 배의 compute. 그리고 reject 된 토큰의 compute 는 낭비. throughput 으로는 손해.

FIG · batch size 에 따른 spec decoding 의 가성비memory-bound → compute-bound 전이

batch=1 — latency 우위3× speedup

batch=42× speedup

batch=16 — 전이 영역~1.2×

batch=64 — throughput 손해0.85×

batch=256 — spec decoding off0.6×

% 는 강의 narrative 를 재구성한 개념적 값(확인 필요). 핵심 — batch 가 클수록 spec decoding 이 손해. low-latency 인터랙티브 use case 와 high-throughput batch use case 가 서로 다른 답을 요구한다. vLLM 의 scheduler 가 자동으로 켜고 끄는 이유.

“spec decoding 은 token 분포가 ‘추측 가능’ 하고 batch 가 작은 환경에서 가장 효과적이다 — 그렇지 않은 영역에서 자동으로 꺼야 한다.”Cade Daniel (요약)

§ 10기억할 메모와 코드 자료· key takeaways

memory-bound 전제

decode 가 memory-bound (작은 batch) 일 때만 spec decoding 이 빠르다. compute-bound 면 손해.

draft + verify

K token draft, 1 batch forward verify, rejection sampling 으로 token 별 accept/reject. output distribution 보존.

rejection sampling

u ~ U(0,1), accept if u ≤ p(x)/q(x). reject 시 max(0, p−q)/Z 에서 새 sample. 출력 분포 = p 보존.

acceptance rate α

domain/모델 의존. 0.7 정도가 보통. 측정 안 하고 켜면 위험. 0.5 미만이면 손해.

Proposer/Scorer/SpecEngine

vLLM 의 추상화. proposer 만 swap 하면 algorithm 변경. ngram, draft model, Medusa, Eagle 모두 같은 interface.

PagedAttention 의 호환

spec 의 K-token verify = query length K+1 인 일반 row. paging 이 token 단위라 자연스럽게 맞음.

Medusa / Eagle

draft 를 target 의 일부 head 로 통합. compute cost 거의 0, α ~0.8. fine-tuning 필요.

throughput 손해

batch 가 크면 compute-bound 로 전환 — spec decoding 이 throughput 을 줄임. scheduler 가 동적으로 끄기.

YouTube youtube.com/watch?v=9wNAgpX6z_4

Slides Google Slides — Hacker's Guide to Spec Decoding

기초 논문 Leviathan et al. 2022 — Fast Inference from Transformers via Speculative Decoding · Chen et al. 2023 — Accelerating LLM with Speculative Sampling

변종 Medusa · Eagle

코드 vllm-project/vllm · vllm/spec_decode/ 디렉토리

손에 새기기 — 실습 시퀀스

memory-bound 직접 측정 — Llama-7B 의 1-token decode 시간을 측정하고 “14GB ÷ HBM bandwidth” 와 비교. 그 비율이 1 에 가까운지 확인.
K-token verify 시간 — 같은 모델로 K=1, 4, 8, 16 의 batch forward 시간 측정. 거의 평탄한 영역이 어디까지인지.
α 측정 — Llama-7B (target) + Llama-1B (draft) 로 코드 도메인 / 자연어 도메인 별 acceptance rate. 도메인 간 차이가 얼마나 큰지.
rejection sampling 직접 짜기 — Python 으로 작은 toy. uniform u, p/q ratio, accept/reject. 결과 분포가 p 와 정확히 일치하는지 KL divergence 로 검증.
vLLM 에서 spec 켜기 — vllm.LLM(speculative_model="...", num_speculative_tokens=K) 로 endpoint 띄우기. throughput 과 latency 둘 다 측정.
n-gram proposer — 가장 단순한 proposer. 현재 prompt+생성 텍스트의 마지막 4-gram 을 prompt 안에서 검색. 검색된 n-gram 이 있으면 그 다음 token 을 draft 로.
Medusa head fine-tune — 작은 모델 (Llama-1B) 의 last hidden 위에 4 개 head 추가. 짧은 데이터셋에서 fine-tune. acceptance rate 측정.

§ 11다른 강의로 이어지는 길· connections

L012

Flash Attention

— § 07 의 variable-length attention 의 backbone. spec decoding 의 attention 은 FlashAttention 위에서 도는 게 표준

L018

Fusing Kernels

Kapil Sharma — vLLM 의 sampling/rejection-sampling 파트가 fused 커널로 짜여 있는 자리

L028

Liger Kernel

— LLM 학습 커널의 fused 패턴. Medusa head fine-tuning 에서 그 도구

L001

CUDA profiling

— § 02 의 prefill/decode bound 분석에서 그 도구

L023

Tensor Cores

— § 09 의 “batch 큰 영역에서 compute-bound” 의 그 compute 영역

L034

Low Bit Triton Kernels

— quantize 된 모델 위에서 spec decoding 의 acceptance 가 어떻게 변하는지의 자리

L050

Learning journey: CUDA, Triton, Flash Attention

— LLM inference systems 의 종합. spec decoding 이 그 안의 표준 구성 요소

§ 12열린 질문· open questions

구체적 acceptance rate 숫자 — 강의에서 model pair 별 정확한 α 수치는 명시 안 됨. Llama-2 7B/70B 같은 표준 pair 의 도메인별 α 자료 별도 추적 필요 (확인 필요).
batch transition point — § 09 의 “memory-bound → compute-bound 전이” 의 정확한 batch size. 모델 크기와 GPU 에 따라 다름. 자기 환경에서 NCU 로 측정 필요.
tree-attention 변종 — Medusa 의 K head 가 만드는 K-branch tree. token 한 개씩이 아니라 “K^2 candidate” 를 한 번에 verify. vLLM 의 통합 상태는 강의 시점에 진행 중이라고 언급.
multi-step lookahead — round 한 번이 아닌 여러 round 의 piped 처리. 강의에서 짧게 언급, 깊이 안 들어감.
FP8 / quantize 와의 결합 — quantized target 모델 위에서 acceptance 가 어떻게 변하는가. 학습 batch 의 분포 차이.
online α 추정 — § 06 의 “자동으로 끄기” 의 정확한 알고리즘 (window 길이, threshold 등) 은 vLLM source 에서 확인.

검증 메모

이 노트의 % 와 시간 수치는 강의 narrative 와 표준 hardware spec 으로 재구성한 개념값. 실제 production 에서는 모델/도메인/하드웨어 조합마다 측정값이 매우 다르다.

← Lecture 021 Scan Algorithm Part 2 — Izzat El Haj Lecture 023 → Tensor Cores — Vijay Thakkar & Pradeep Ramani 의 CUTLASS · MMA · CuTe