| phase | op | AI (FLOP/B) | bound |
|---|---|---|---|
| prefill | GEMM(QKV) | ~200 | FLOPs |
| prefill | FA forward | ~Np·d | FLOPs |
| decode | GEMM(proj) | ~2·B | HBM BW |
| decode | attn w/ KV | ~1 | HBM BW |
B = batch size. decode AI ≈ batch ↗로만 끌어올림 ↗ V18
dist hist (실제 트래픽 모양) input ▁▂▅█▇▄▂▁▁▁ ... (skew) output▂▄█▅▃▂▁▁▁▁ ... (short) p50 128 p99 4096
cu_seqlens = [0, 2000, 2001, 2002] varlen FA 사용 ↗ V07 §4Llama-3-70B FP16: L=80, Hkv=8, D=128, T=1 → 320 KB/token·req
| 지표 | 정의 | 지배 phase |
|---|---|---|
| TTFT | 첫 token 나올 때까지 | prefill |
| TPOT | tok/토큰 평균 간격 | decode |
| e2e | TTFT + (No−1)·TPOT | 둘 다 |
상세 분해는 ↗ §19 p.20
t0: B = {A, B, C} (lock)
step: forward(pad_to(max_len(B)))
step: ...
tN: A,B,C 모두 EOS → B = {}
tN+1: new batch {D, E, F}
A (out=10) ████▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫
B (out=2000) ██████████████████████
C (out=50) █████▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫
↑ A,C 끝나도 GPU idle
varlen kernel이면 waste=0 가능하지만, static batching은 step 단위 동기화가 핵심 문제
| 문제 | static 처방 | 잔존 |
|---|---|---|
| padding waste | length bucket | bucket 경계 낭비 |
| HoL blocking | early eviction | throughput ↓ |
| KV fragmentation | max_len 축소 | 잘림 위험 |
step = 한 번의 forward passiteration = decode 1 token (= 1 step)batch = 같은 step에 묶인 request 집합cu_seqlens# Continuous batching main loop (Orca-style) def engine_loop(): while has_requests(waiting | running): # 1. schedule a batch for THIS iteration batch, budget = scheduler.schedule( waiting, running, swapped, max_num_batched_tokens, max_num_seqs) # 2. prepare packed inputs input_ids, cu_seqlens, block_tables = \ prepare_inputs(batch) # 3. one forward step (prefill+decode 섞임) logits = model.forward( input_ids, cu_seqlens, block_tables) # 4. sample next tokens per seq next_tokens = sampler(logits, batch.params) # 5. update each seq & KV cache for seq, tok in zip(batch.seqs, next_tokens): seq.append(tok) if is_finish(seq, tok): finish(seq) else: return_to_running(seq) # 6. stream partial outputs to clients emit_stream_events(batch)
| 지표 | static | continuous |
|---|---|---|
| padding waste | 30–70% | ~0% |
| HoL wait p99 | longest req | batch 내 최단 |
| GPU util | 50–70% | 80–95% |
| effective batch | 고정 B | 가변 1..Bmax |
원문: Yu et al. Orca, OSDI'22. 수치는 논문 정성 요약
┌────────────────────────────────────┐
│ API Server (OpenAI-compat, async) │ Uvicorn/FastAPI
├─────────────▼──────────────────────┤
│ AsyncLLMEngine (event loop) │ req ingest
├─────────────▼──────────────────────┤
│ LLMEngine │ step() driver
│ ├─ Scheduler (queues, budget) │ §5
│ ├─ BlockMgr (KV allocator) │ §6
│ └─ SequenceGroup pool │
├─────────────▼──────────────────────┤
│ Executor (MP / Ray) │ multi-GPU
├─────────────▼──────────────────────┤
│ Worker × N (1 per GPU) │
│ └─ ModelRunner │
│ ├─ sampler │
│ └─ Attention Backend │ §7
│ (FA2/FA3/FlashInfer/XFM) │
├─────────────▼──────────────────────┤
│ CUDA kernels + NCCL (TP/PP) │ ↗ V07/V15
└────────────────────────────────────┘
1. API → AsyncEngine.add_request()
2. AsyncEngine.step() → LLMEngine.step()
3. Scheduler.schedule()
→ SchedulerOutput{batched_seqs,
blocks_to_swap_in,
blocks_to_swap_out,
blocks_to_copy}
4. Executor.execute_model(SchedOut)
→ broadcast to Workers (TP)
5. Worker.ModelRunner.execute_model()
a. prepare_inputs (gather tok,
positions, block_tables)
b. model.forward() ← graph?
c. sampler → next_tokens
d. return SamplerOutput
6. LLMEngine.process_outputs()
→ update SequenceGroup
→ BlockMgr.free / alloc
7. AsyncEngine yields stream chunks
| 객체 | 소유 | 역할 |
|---|---|---|
Sequence | token ids | 1 개별 생성 |
SequenceGroup | sampling params | beam · parallel n |
BlockTable | logical→phys | per Sequence |
SchedulerOutput | batch plan | step 입력 |
FlashAttention (FA2/FA3) · FlashInfer · xFormers · TorchSDPAforward(q, kv_cache, block_tables, metadata)| mode | 특징 | when |
|---|---|---|
UniprocExec | single GPU | TP=PP=1 |
MPExec | torchrun, fork | single node TP |
RayExec | remote actor | multi-node |
vLLM V1 design doc, 2024-09
┌─waiting──┐ new req, prefill 미시작 │ seq, seq │ └──┬───────┘ │ can_allocate? budget? ▼ ┌─running──┐ 매 step forward │ seq, seq │ └──┬───────┘ │ OOM / preempt-swap ▼ ┌─swapped──┐ KV를 CPU로 off │ seq, seq │ └──────────┘ 상승 복귀: swap-in 후 running으로
| from | to | 조건 |
|---|---|---|
| waiting | running | block 할당 가능 + budget |
| running | swapped | OOM 발생 + swap 정책 |
| running | waiting | preempt-recompute 정책 |
| swapped | running | swap-in 가능 |
| running | finished | EOS or max_tok |
| swap | recompute | |
|---|---|---|
| 동작 | KV → CPU/pinned | KV 버림, prefill 재실행 |
| 비용 | PCIe 전송 | prefill FLOPs |
| 적합 | 생성된 토큰 多 | prompt 짧고 decode 적음 |
def schedule(): budget = Budget(max_tok, max_seq) out = [] # 1. running 먼저 (decode 연속성) for s in running: if not budget.can_add(s): preempt(s); continue out.append(s) # 2. swapped 복귀 for s in swapped: if can_swap_in(s) and budget.ok(): swap_in(s); out.append(s) # 3. waiting 신규 prefill for s in waiting: if can_alloc(s) and budget.ok(s): alloc(s); out.append(s) return out
seq A logical: [0][1][2][3]
↓
block_table_A: [12, 07, 03, 21] (frame ids)
HBM pool:
frame 03 ████ ← A.logical[2]
frame 07 ████ ← A.logical[1]
frame 12 ████ ← A.logical[0]
frame 21 ████ ← A.logical[3]
| Bp | 장점 | 단점 |
|---|---|---|
| 16 | 공유 세밀, 내부단편화 ↓ | table 크기 ↑ |
| 32 | 중간 (default 다수) | — |
| 128 | HBM burst ↑ | 공유 거침 |
shared prefix (system prompt) block_P: refcount = 3 ← A, B, C 공유 A가 prefix 끝에서 분기 (write) 1. alloc new block_P' 2. copy block_P → block_P' 3. A.table[i] = P'; P.refcount-=1 "blocks_to_copy" in SchedulerOutput
gpu_memory_utilization
vLLM determine_num_available_blocks()
| 종류 | 기존 | paged |
|---|---|---|
| internal | max_len 예약 → 큼 | ≤ Bp-1 토큰 · 마지막 block만 |
| external | 연속 공간 요구 | frame 단위 → 0 |
allocate(seq, num_blocks) · free(seq)append_slot(seq) — decode 1 tok당 호출, block 경계 넘으면 new blockswap_in/out(blocks) — CPU pool 이동fork(parent, child) — refcount++, share tableq — 현재 step query (packed)kv_cache — 전역 pool tensorblock_tables — [B, max_blocks] int32context_lens — 각 seq의 Tcu_seqlens_q — prefill varlen 경계# per (seq_id, head, tok_idx) phys = block_tables[seq_id, tok_idx // B_p] off = tok_idx % B_p k = kv_cache[layer, 0, phys, off, head, :] v = kv_cache[layer, 1, phys, off, head, :]
vLLM kv_cache shape: [L, 2, n_blocks, Bp, Hkv, D]
| prefill | decode | |
|---|---|---|
| Q tokens | Np per seq | 1 per seq |
| K/V source | 현재 tok (new) | cache (old) + 현재 |
| kernel | FA varlen | paged kernel |
| parallel 축 | seq 내부 | seq 간 |
| AI | high (FLOPs) | low (BW) |
max_context_len과 B에 의존, backend가 결정자세한 split-K 수식 ↗ V07 §17
req 진입:
1. scheduler: hash(prompt_prefix)
→ hit 노드 N 탐색
2. N.block_ids 를 seq.block_table에
prepend (refcount++)
3. Q kernel은 "cached" span을
skip — 실제 compute는 새 tok만
4. 이 단계에서 TTFT ↓ (prefill 단축)
더 상세 → §9 prefix caching p.10
reshape_and_cache kernel 호출max_batched_tok = 512 req A: decode (1 tok) req B: decode (1 tok) req C: prefill 2000 tok (진행중 800/2000) → 이번 step tok = [1, 1, 510] chunk = 510 (512 − 2) → 다음 step: req C chunk 510 ... 최종 step에서 C prefill 완료 → decode 합류
Q: [chunk_k]
K,V: cache[0:chunk_k_start]
+ (chunk_k own K,V)
mask: causal within chunk_k
full on prior cached
| 지표 | no chunk | chunk |
|---|---|---|
| TPOT p99 spike | 크다 | 작다 |
| TTFT | 짧다 | 약간 ↑ (overhead) |
| throughput | prefill 시 idle | decode 병행 ↑ |
overhead 원천: chunk 경계마다 attention kernel re-launch
num_computed_tokens 갱신num_computed_tokens == prompt_len 시 decode 단계로 전환tok: [t0..t15][t16..t31][t32..t47]... hash: h0 h1 h2 table: h0→blk_A h1→blk_B ...
block 경계 토큰 수 불일치 시 hit 불가 — Bp 선택 중요
| 정책 | 동작 | trade |
|---|---|---|
| LRU on leaves | refcount=0 leaf 제거 | 공정 · 표준 |
| LFU | hit 횟수 기반 | hot prompt 유지 |
| TTL | 시간 만료 | stale 방지 |
vLLM 기본: LRU · refcount 0인 block만 free 대상
model_id, kv_dtype, lora_id 포함capture once per batch size:
G[1], G[2], G[4], G[8], G[16], G[32]
...각각 입력 tensor 고정
실행:
B_actual = current batch size
B_padded = next_pow2_or_bucket(B_actual)
copy inputs into G[B_padded].buffers
G[B_padded].launch()
slice outputs[:B_actual]
padding dummy seq가 KV에 쓰지 않도록 mask 필요
| 변수 | 처리 |
|---|---|
| batch size | bucketed graph pool |
| seq len (cached) | 상수 tensor에 넣고 kernel args로 |
| block_tables | max_blocks 고정, 나머지 -1 |
| prefill | graph 미사용 (eager) · varlen |
enforce_eager=False, cudagraph_capture_sizes--disable-cuda-graph로 끌 수 있음기본 decode step:
target.forward(1 tok) → 1 new tok
Spec decode step:
1. draft.forward(γ tok) → γ drafts
2. target.forward(γ+1 tok) → γ+1 logits
3. verify(drafts, target_logits)
→ accepted n ∈ [1, γ+1] tok
4. seq.append(accepted)
5. KV cache: accepted n만 commit
rejected tail은 discard (잘라냄)
| component | 역할 |
|---|---|
| DraftWorker | draft 모델 forward (작은 LM) |
| TargetWorker | target 모델 forward (γ+1 tok at once) |
| Scorer | rejection sampling · accept idx |
| KVManager | target cache commit/rollback |
| variant | draft 출처 | engine 비용 |
|---|---|---|
| Vanilla | 외부 작은 모델 | +1 worker |
| Medusa | target의 extra head | weight 공유 |
| EAGLE | hidden-state feature | small auxiliary net |
| n-gram | prompt match | GPU 없음 |
| 축 | 대상 | engine 영향 |
|---|---|---|
| weight | GPTQ/AWQ W4A16 | dequant GEMM kernel |
| act+wgt | FP8 / INT8 | low-prec GEMM |
| KV cache | INT8/FP8 KV | paged layout + scale |
quantization 감지Linear → QuantLinear 치환 (group_size, bits 메타 포함)Linear → Fp8Linear + per-tensor scalekv_dtype = "fp8_e4m3" block layout: [B_p, H_kv, D] data [per-block scale] side table write path: k_fp16 → scale_compute → k_fp8 scale 저장 (per block or per token) read (attention): k_fp8 → scale·k → k_fp16 → QK
정확도 위험: scale 위치 잘못 잡으면 붕괴 ↗ V07 §17 warn
| engine | W4A16 | FP8 | KV Q |
|---|---|---|---|
| vLLM | GPTQ/AWQ/Marlin | O (H100) | INT8/FP8 |
| SGLang | AWQ/FP8 | O | FP8 |
| TRT-LLM | AWQ/SQ | O (per-tensor) | INT8/FP8 |
send/recv activation (NCCL P2P)engine (rank 0 driver) Scheduler / BlockMgr ← 단일 ┌──────┬──────┐ │ │ │ Worker0 Worker1 WorkerN (TP=N) ModelRunner × TP KV cache per-worker (shard of H_kv or full) allreduce 동기식 engine.execute_model: broadcast SchedOut to all workers 각 worker: execute → gather output
| condition | executor |
|---|---|
| TP=1, PP=1 | Uniproc |
| single node, TP>1 | MP (torch.multiprocessing) |
| multi-node | Ray |
| TP + PP | Ray (권장) |
--tensor-parallel-size · --pipeline-parallel-size--distributed-executor-backend {mp, ray}--enable-expert-parallel (MoE EP 별도 축)root ├─"You are a helpful"─┬─"assistant" [A,B,C] rc=3 │ └─"expert" [D] rc=1 └─"###"─ ... ...
match(tokens) → (node, matched_len)insert(tokens, block_ids) — split edge if neededlock(node) / unlock(node) — refcount +/−evict(size) — refcount=0 leaf LRU 제거req 도착:
1. radix.match(prompt) → (node, L)
2. block_tbl = [ node... 조상 블록들 ]
3. prefill only tokens[L:] (tail)
4. 생성 중 매 block 완성마다:
radix.insert(new_block, tokens)
5. seq 종료:
unlock 조상 path
(leaf refcount 0 되면 evict 대상)
| vLLM (hash) | SGLang (trie) | |
|---|---|---|
| 구조 | block-hash → id | token trie |
| 공유 단위 | block-aligned | token-level |
| branching | 명시적 CoW | edge split |
| LRU | block list | leaf 기준 |
| lookup | O(prefix/Bp) | O(prefix) token |
Zheng et al. SGLang, 2024
gen, select, fork, regex, json 기본 primitive.
@function def multi_turn(s, q): s += system("You are...") s += user(q) s += gen("ans", max_tokens=256) if s["ans"].contains("?"): s += user("clarify") s += gen("clar")
per step: state = fsm.current() allowed = fsm.allowed_tokens(state) logits[~allowed] = -inf tok = sample(logits) state = fsm.step(state, tok)
| 유형 | 지원 |
|---|---|
| regex | Outlines-style FSM |
| JSON Schema | CFG / Pydantic → FSM |
| EBNF/CFG | pushdown automaton |
| XGrammar | incremental token masking (high perf) |
engine은 FSM step을 CPU에서 돌리고 GPU logit에 mask 브로드캐스트
| engine | 구조 생성 수단 |
|---|---|
| vLLM | Outlines/XGrammar plugin |
| TRT-LLM | logit processor (사용자 구현) |
| LMDeploy | custom grammar |
[build time] Python API (tensorrt_llm) → TRT Network Graph (DSL) → Plugin insertion (FA, GEMM, …) → TRT Engine optimizer → serialized .engine file [runtime] C++/Python runtime Executor (in-flight batching) GptSession / GptManager KV Cache Manager (paged) samplingConfig per request
IPlugin으로 wrapgpt_attention (FMHA/paged), gemm, rmsnorm, lookupkv_cache_reuse flag로 prefix 공유 on| 장점 | 단점 |
|---|---|
| kernel autotune | build 수 분~시간 |
| static optimization | dynamic shape 제약 (profile 필요) |
| layer fusion | 모델 구조 변경 시 재build |
| lowest latency | quant 설정마다 별 engine |
tensorrtllm_backend)| axis | vLLM | SGLang | TensorRT-LLM | LMDeploy | MLC-LLM |
|---|---|---|---|---|---|
| 언어·스택 | Python + C++ core | Python + Triton/CUDA | C++ runtime + Python build | C++ + Python | TVM / MLC IR |
| 스케줄러 | 3-queue (wait/run/swap) budget | RadixAttention-aware scheduler | In-flight Batching (GptManager) | Persistent batching | TVM runtime · 단순 |
| KV 구조 | PagedAttention (block_table) | Paged + Radix trie | Paged KV plugin | Block-based (TurboMind) | Contiguous · paged 옵션 |
| Prefix sharing | Hash prefix cache (auto) | Radix trie (token-level) | kv_cache_reuse flag | Session-level cache | 제한적 |
| Chunked prefill | O (default on) | O (Sarathi-style) | O (context chunking) | O | 제한적 |
| CUDA Graph | decode bucketed | decode bucketed | build-time static | decode | TVM runtime |
| Spec decoding | Medusa/EAGLE/n-gram | EAGLE/Medusa | Medusa/EAGLE | 미지원~제한 | 미지원 |
| Quant (weight) | GPTQ·AWQ·Marlin·Machete | AWQ·FP8 | AWQ·SQ·FP8·INT4 | AWQ·SQ | GPTQ·AWQ (TVM codegen) |
| Quant (KV) | INT8/FP8 | FP8 | INT8/FP8 per-block | INT8 | 제한적 |
| 분산 (TP/PP) | TP, PP, EP (MoE) | TP, EP | TP, PP, EP | TP | TP (단순) |
| MoE | Mixtral · Qwen MoE · DeepSeek | DeepSeek MLA 최적화 | Mixtral · DeepSeek | Mixtral | 제한적 |
| Structured gen | Outlines / XGrammar | native SGL + XGrammar | logit processor (수작업) | 제한적 | 제한적 |
| Sampling | temp/top-p/top-k/penalty/logitproc | 같음 + logprob | 같음 + bad_words | 제한적 API | 기본 |
| Plug-ability | attention backend · model code open | kernel 교체 · Python-centric | plugin system · build step | 자체 kernel library | IR lowering 레벨 |
| Sweet spot | general throughput · 연구 workhorse | multi-turn · RadixAttn prefix hit 高 | production low-latency · quant 극한 | NVIDIA 외 NPU/edge 함께 | cross-HW (Vulkan/Metal/AMD) |
출처: 각 프로젝트 공식 문서 · 릴리스 노트 (2024~2025). 기능 유무만 정리, 실측 수치는 제외 (out-of-scope).
| endpoint | 용도 |
|---|---|
POST /v1/completions | text completion |
POST /v1/chat/completions | chat role-based |
POST /v1/embeddings | 임베딩 |
GET /v1/models | list models |
POST /v1/rerank | reranker (일부 engine) |
HTTP req (JSON) → validate (pydantic) → apply chat template (jinja) → tokenize (prompt_ids) → SamplingParams 생성 → engine.add_request() → (async) stream chunks → detokenize → SSE event write → HTTP close
data: {chunk}\n\n 반복 후 data: [DONE]\n\noffset_mapping)# Jinja2 template (HF tokenizer_config.json) {% for m in messages %} <|start|>{{m.role}}\n{{m.content}}<|end|> {% endfor %} <|start|>assistant\n
모델마다 다름 · 잘못 적용 시 성능 큰 영향
tools 배열 (JSON schema)tool_calls 필드abort_request(req_id)| metric | 의미 |
|---|---|
| TTFT | 첫 토큰 지연 |
| TPOT | 토큰 간 평균 |
| e2e latency | 전체 응답 시간 |
| GPU cache usage | free blocks / total |
| num running/waiting | queue depth |
| prefix hit rate | cached / prompt token |
| component | AI | driver |
|---|---|---|
| QKV proj | high | GEMM (FLOPs) |
| Attention | Np·d | FA kernel |
| O proj | high | GEMM |
| MLP | high | GEMM×2 |
| norm/residual | low | memory BW |
prefill 전체: compute-bound (대부분 구성요소)
| component | 비율 | bound |
|---|---|---|
| attn (paged) | ~50% | HBM BW |
| GEMM (weight) | ~30% | HBM BW (W 로드) |
| norm + rotary | ~10% | BW / launch |
| sampler | ~5% | vocab scan |
| host overhead | ~5% | Python/schedule |
decode: memory-bound 지배. batch ↑ or quant ↓로 AI 상승 필요
tok/s │ ╱‾‾‾‾‾ saturation (FLOPs)
│ ╱
│ ╱ memory-bound linear (BW)
│ ╱
└─────────────────▶ B
TPOT │ ╱
│ ╱
│ ──────
└─────────────────▶ B
| knob | 영향 |
|---|---|
max_num_seqs | batch 상한 → throughput ceiling |
max_num_batched_tokens | step당 tok 총량 → TPOT spike |
max_model_len | KV 예약 → 동시 req ↓ |
gpu_memory_utilization | KV pool 크기 |
block_size | 공유 세밀도 vs overhead |
swap_space | preempt swap 용량 |
| workload | 우선 | 처방 |
|---|---|---|
| Chat interactive | TTFT | prefix cache · 큰 chunk 허용 |
| Bulk batch | throughput | max_num_seqs 최대 · quant |
| Long doc QA | TTFT balance | chunked prefill · PP |
| Agent loop | TPOT + cache | radix · spec decoding |
Q1. 목표가 production low-latency + quant 극한? YES → TensorRT-LLM (build 비용 감수) NO ↓ Q2. multi-turn / agent / tree search 중심? YES → SGLang (RadixAttention) NO ↓ Q3. Nvidia 외 HW도 필요? YES → MLC-LLM (cross-HW) or LMDeploy NO ↓ Q4. 일반 research/throughput, 넓은 모델 커버? YES → vLLM (workhorse) NO ↓ Q5. edge/mobile / WebGPU? YES → MLC-LLM
| 지표 | 식 |
|---|---|
| TTFT | req 도착 → 첫 tok |
| TPOT | (e2e − TTFT) / (No−1) |
| e2e | TTFT + (No−1)·TPOT |
| tok/s | No/e2e · per req |
| goodput | SLO 충족 req의 tok/s 합 |
| prefix hit | cached_tok / prompt_tok |
| KV util | used_blocks / total_blocks |
API → AsyncEngine → LLMEngine.step()
→ Scheduler.schedule()
[waiting|running|swapped]
budget(tok,seq,block)
→ Executor.execute_model()
broadcast to Workers (TP)
→ Worker.ModelRunner
prepare_inputs (packed, block_tables)
model.forward() [CUDAGraph?]
sampler
→ process_outputs (BlockMgr free/alloc)
→ stream chunks