CUDA 18VOL · TIER 5 DISTRIBUTED/SERVING · A4 LANDSCAPE · 22p

V16 Inference Serving 단권화

vLLM · SGLang · TensorRT-LLM engine internals — Scheduler · BlockManager · Continuous Batching

Volume 16 / 18

Tier T5 Distributed & Serving

선행 V07(attention) · V08(LLM kernels) · V15(distributed)

용도 engine path 전체 지도 — kernel 호출 궤적 추적

범례

핵심 용어 (노랑)

표 헤더 / 매우 중요

정의·공식 박스

예시·워크드 박스

★시험·실무 핵심

(!)니모닉 (권당 ≤5)

↗다른 권 참조 (xref)

→인과·흐름

인쇄 A4 가로 / 여백 없음 / 배경 그래픽 포함 · Ctrl(⌘)+P

vLLM · SGLang · TRT-LLM · Orca · PagedAttention · RadixAttention

1 Autoregressive 2-phase prefill · decode

정의 prefill: prompt N_p token 1-shot forward → O(N_p²) attention · FLOPs-bound. decode: 1 token/step autoregressive · O(N_p+t) attention · memory-bound.

prefill = parallel over seq, decode = sequential over time
한 request가 두 phase를 직렬 통과 → 스케줄러가 phase별 분기 필요

2 Arithmetic intensity 비대칭

phase	op	AI (FLOP/B)	bound
prefill	GEMM(QKV)	~200	FLOPs
prefill	FA forward	~N_p·d	FLOPs
decode	GEMM(proj)	~2·B	HBM BW
decode	attn w/ KV	~1	HBM BW

B = batch size. decode AI ≈ batch ↗로만 끌어올림 ↗ V18

3 Sequence 길이 분산 ★

production trace: input 길이 log-normal · output 길이 geometric · 상호 무관
long-tail: p50 ≪ p99 (e.g. 200 vs 4K token)
request 간 이질성이 serving의 핵심 난제

dist hist (실제 트래픽 모양)
 input ▁▂▅█▇▄▂▁▁▁ ... (skew)
 output▂▄█▅▃▂▁▁▁▁ ... (short)
 p50 128    p99 4096

4 Batch 내 이질성 mixed phase

같은 step에 R₁(prefill 2000 tok), R₂(decode 1 tok), R₃(decode 1 tok) 공존. 총 token = 2002, GPU에는 "flat token" 2002개만 보임 → 스케줄러가 2D 구조(batch×seq)를 1D packed로 평탄화.

cu_seqlens = [0, 2000, 2001, 2002] varlen FA 사용 ↗ V07 §4

5 KV cache 성장

KV_bytes(req) = 2·L·H_kv·D·T·dtype
total = Σ_r KV_bytes(r) L layer · H_kv KV head · D head dim · T 현재 seq 길이 · 2 = K+V

Llama-3-70B FP16: L=80, H_kv=8, D=128, T=1 → 320 KB/token·req

6 SLO 이원화

지표	정의	지배 phase
TTFT	첫 token 나올 때까지	prefill
TPOT	tok/토큰 평균 간격	decode
e2e	TTFT + (N_o−1)·TPOT	둘 다

상세 분해는 ↗ §19 p.20

7 Out-of-scope 재확인

attention kernel 자체 ↗ V07
분산 collective 내부 ↗ V15
quant 알고리즘 ↗ V10
일반 scheduler 알고리즘론 (CPU 이론) — 제외

1 Static batching 정의 request-level

정의 batch를 request 단위로 잠그고, 해당 batch 전체가 끝날 때까지 step을 반복. 새 request는 이전 batch 종료 후에만 합류.

t0: B = {A, B, C}   (lock)
 step: forward(pad_to(max_len(B)))
 step: ...
tN: A,B,C 모두 EOS → B = {}
tN+1: new batch {D, E, F}

2 Head-of-Line blocking ★

batch 내 최장 seq가 종료 지점을 결정
짧은 req (output 10 tok) 도 긴 req (output 2000 tok) 끝까지 대기
p99 latency = longest + queue wait

A (out=10)   ████▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫
B (out=2000) ██████████████████████
C (out=50)   █████▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫▫
             ↑ A,C 끝나도 GPU idle

3 Padding waste

waste = 1 − Σ N_i / (B · max_i N_i) N_i : 실제 길이. 길이 분산 ↑일수록 waste ↑

B=4, lengths=[100, 2000, 200, 50] → waste = 1 − 2350 / 8000 = 70.6%. padding 토큰이 FLOP을 낭비.

varlen kernel이면 waste=0 가능하지만, static batching은 step 단위 동기화가 핵심 문제

4 자원 점유 KV 슬롯

batch 고정 → KV 슬롯 B개 pre-allocate
짧은 req가 끝나도 슬롯 해제 불가 (다음 batch까지 잠김)
GPU 메모리는 항상 max batch 기준

5 Dynamic batching 중간 단계

정의 Triton/TF-Serving 계열. queue timeout까지 request를 모아 batch 구성. 한 batch 시작 후에는 여전히 static.

queue wait ↓하지만 HoL blocking 잔존
LLM 같이 output 길이가 가변이면 효과 제한

6 Static의 처방 공간

문제	static 처방	잔존
padding waste	length bucket	bucket 경계 낭비
HoL blocking	early eviction	throughput ↓
KV fragmentation	max_len 축소	잘림 위험

결론: LLM serving은 iteration-level scheduling 필요 → §3 Orca.

7 용어 정리

step = 한 번의 forward pass
iteration = decode 1 token (= 1 step)
batch = 같은 step에 묶인 request 집합

1 Iteration-level scheduling ★ step마다 재편성

정의 Orca (OSDI'22): 매 iteration마다 running set을 재구성. EOS 난 req는 즉시 빠지고, 새 req는 다음 step에 합류.

unit of scheduling = iteration, not request
padding 제거: packed 1D token stream + cu_seqlens
짧은 req wait ≈ 0, long req가 short req를 block하지 않음

2 Selective batching

token-wise op (GEMM, norm, MLP) → packed batch 가능
sequence-wise op (attention) → per-seq 실행 필요 (varlen FA)
layer 내부에서 두 모드 전환

token-wise : X[T_total, d] @ W
seq-wise : attn(Q_i, K_i, V_i) ∀ i

3 Core loop 의사코드 ★

# Continuous batching main loop (Orca-style)
def engine_loop():
  while has_requests(waiting | running):
    # 1. schedule a batch for THIS iteration
    batch, budget = scheduler.schedule(
      waiting, running, swapped,
      max_num_batched_tokens,
      max_num_seqs)

    # 2. prepare packed inputs
    input_ids, cu_seqlens, block_tables = \
      prepare_inputs(batch)

    # 3. one forward step (prefill+decode 섞임)
    logits = model.forward(
      input_ids, cu_seqlens, block_tables)

    # 4. sample next tokens per seq
    next_tokens = sampler(logits, batch.params)

    # 5. update each seq & KV cache
    for seq, tok in zip(batch.seqs, next_tokens):
      seq.append(tok)
      if is_finish(seq, tok): finish(seq)
      else: return_to_running(seq)

    # 6. stream partial outputs to clients
    emit_stream_events(batch)

4 효과 정량 개념적

지표	static	continuous
padding waste	30–70%	~0%
HoL wait p99	longest req	batch 내 최단
GPU util	50–70%	80–95%
effective batch	고정 B	가변 1..B_max

원문: Yu et al. Orca, OSDI'22. 수치는 논문 정성 요약

5 전제 조건

varlen attention kernel 지원 (FA varlen ↗ V07 §4)
KV cache가 seq 단위로 독립 할당·해제 가능 — PagedAttention 동기
sampler/decode logic이 per-seq parameter 지원

새 req가 prefill 진입 시 CUDA Graph 재캡처가 필요할 수 있음 → §10 p.11에서 해결.

continuous batching 3요소: ITER·PACK·PAGE (iteration 단위 · packed 입력 · paged KV)

1 6-Layer 구성 ★

┌────────────────────────────────────┐
│ API Server (OpenAI-compat, async)  │  Uvicorn/FastAPI
├─────────────▼──────────────────────┤
│ AsyncLLMEngine (event loop)        │  req ingest
├─────────────▼──────────────────────┤
│ LLMEngine                          │  step() driver
│  ├─ Scheduler  (queues, budget)    │  §5
│  ├─ BlockMgr   (KV allocator)      │  §6
│  └─ SequenceGroup pool             │
├─────────────▼──────────────────────┤
│ Executor (MP / Ray)                │  multi-GPU
├─────────────▼──────────────────────┤
│ Worker × N  (1 per GPU)            │
│  └─ ModelRunner                    │
│      ├─ sampler                    │
│      └─ Attention Backend          │  §7
│          (FA2/FA3/FlashInfer/XFM)  │
├─────────────▼──────────────────────┤
│ CUDA kernels + NCCL (TP/PP)        │  ↗ V07/V15
└────────────────────────────────────┘

2 Step 데이터 흐름 1 iteration

1. API → AsyncEngine.add_request()
2. AsyncEngine.step() → LLMEngine.step()
3. Scheduler.schedule()
    → SchedulerOutput{batched_seqs,
                      blocks_to_swap_in,
                      blocks_to_swap_out,
                      blocks_to_copy}
4. Executor.execute_model(SchedOut)
    → broadcast to Workers (TP)
5. Worker.ModelRunner.execute_model()
    a. prepare_inputs (gather tok,
       positions, block_tables)
    b. model.forward()        ← graph?
    c. sampler → next_tokens
    d. return SamplerOutput
6. LLMEngine.process_outputs()
    → update SequenceGroup
    → BlockMgr.free / alloc
7. AsyncEngine yields stream chunks

3 핵심 객체

객체	소유	역할
`Sequence`	token ids	1 개별 생성
`SequenceGroup`	sampling params	beam · parallel n
`BlockTable`	logical→phys	per Sequence
`SchedulerOutput`	batch plan	step 입력

4 Attention Backend 추상 plug-in

FlashAttention (FA2/FA3) · FlashInfer · xFormers · TorchSDPA
Backend 공통 interface: forward(q, kv_cache, block_tables, metadata)
metadata에 prefill/decode mask, varlen cumsum, causal 플래그 전달
hardware/dtype/head-dim별 auto-select

5 Executor 모드

mode	특징	when
`UniprocExec`	single GPU	TP=PP=1
`MPExec`	torchrun, fork	single node TP
`RayExec`	remote actor	multi-node

6 V1 엔진 재설계

V1 (2024+): 단일 스케줄러, prefill/decode 통합, async output proc
Python overhead ↓: C++ core 부분 확장
structured output · multi-modal 1급 지원

vLLM V1 design doc, 2024-09

1 3-Queue 모델 ★

┌─waiting──┐  new req, prefill 미시작
│ seq, seq │
└──┬───────┘
   │ can_allocate? budget?
   ▼
┌─running──┐  매 step forward
│ seq, seq │
└──┬───────┘
   │ OOM / preempt-swap
   ▼
┌─swapped──┐  KV를 CPU로 off
│ seq, seq │
└──────────┘
 상승 복귀: swap-in 후 running으로

2 Transition 조건

from	to	조건
waiting	running	block 할당 가능 + budget
running	swapped	OOM 발생 + swap 정책
running	waiting	preempt-recompute 정책
swapped	running	swap-in 가능
running	finished	EOS or max_tok

3 Budget 2축 token · seq

batch 진입 조건 s:
Σ tok(s) ≤ max_num_batched_tokens
|batch| ≤ max_num_seqs
free_blocks ≥ required(s) prefill seq는 tok = prompt_len, decode seq는 tok = 1

token budget — compute 상한 (one step 시간)
seq budget — sampler/dispatch overhead 상한
block budget — KV OOM 방지

4 Preemption 2가지 ★

	swap	recompute
동작	KV → CPU/pinned	KV 버림, prefill 재실행
비용	PCIe 전송	prefill FLOPs
적합	생성된 토큰 多	prompt 짧고 decode 적음

5 FCFS + priority

기본: first-come-first-serve (arrival time)
priority 지원 시 heap으로 순서 결정
starvation 방지: aging은 현재 명시적 미지원 (policy 책임)

6 Schedule 의사코드

def schedule():
  budget = Budget(max_tok, max_seq)
  out = []
  # 1. running 먼저 (decode 연속성)
  for s in running:
    if not budget.can_add(s):
      preempt(s); continue
    out.append(s)
  # 2. swapped 복귀
  for s in swapped:
    if can_swap_in(s) and budget.ok():
      swap_in(s); out.append(s)
  # 3. waiting 신규 prefill
  for s in waiting:
    if can_alloc(s) and budget.ok(s):
      alloc(s); out.append(s)
  return out

미묘함: chunked prefill 켜면 한 step 내 prefill·decode가 섞여 budget 계산이 token 단위로 통일됨 (§8 p.9).

1 Logical vs Physical ★

정의 logical block: seq 관점의 연속 토큰 창 (0, 1, 2, ...).
physical block: GPU HBM에 실제 할당된 page frame.
block table: logical idx → physical idx.

seq A logical: [0][1][2][3]
          ↓
block_table_A: [12, 07, 03, 21]  (frame ids)

HBM pool:
 frame 03 ████   ← A.logical[2]
 frame 07 ████   ← A.logical[1]
 frame 12 ████   ← A.logical[0]
 frame 21 ████   ← A.logical[3]

2 Block size B_p

B_p	장점	단점
16	공유 세밀, 내부단편화 ↓	table 크기 ↑
32	중간 (default 다수)	—
128	HBM burst ↑	공유 거침

3 Block 크기 바이트 공식

block_bytes = B_p · H_kv · D · 2 · sizeof(dtype) B_p token/block · 2 = K and V · per-layer 별도 pool

Llama-3-70B: H_kv=8, D=128, B_p=16, FP16 → block = 16·8·128·2·2 = 64 KiB per layer. 80 layer × 64 KiB = 5.12 MiB per block group.

4 Ref count & Copy-on-Write ★

shared prefix (system prompt)
 block_P: refcount = 3  ← A, B, C 공유

A가 prefix 끝에서 분기 (write)
  1. alloc new block_P'
  2. copy block_P → block_P'
  3. A.table[i] = P'; P.refcount-=1
"blocks_to_copy" in SchedulerOutput

beam search · parallel sampling · prefix cache에서 활용
CoW는 마지막 block에서만 발생 (완성된 block은 immutable)

5 메모리 프로파일링 기동시

gpu_mem_total − model_weight − activation
− workspace − safety = kv_pool
n_blocks = kv_pool / (L · block_bytes) activation은 dummy forward로 측정. safety = gpu_memory_utilization

vLLM determine_num_available_blocks()

6 Fragmentation 분석

종류	기존	paged
internal	max_len 예약 → 큼	≤ B_p-1 토큰 · 마지막 block만
external	연속 공간 요구	frame 단위 → 0

7 Allocator API 요약

allocate(seq, num_blocks) · free(seq)
append_slot(seq) — decode 1 tok당 호출, block 경계 넘으면 new block
swap_in/out(blocks) — CPU pool 이동
fork(parent, child) — refcount++, share table

CoW trigger를 놓치면 prefix 공유 seq끼리 서로의 KV를 덮어씀. 정확성 치명적.

1 Kernel 입력 구조

kernel 자체의 수식·내부 tiling은 ↗ V07 §16–17
engine이 kernel에 넘기는 것:
1. q — 현재 step query (packed)
2. kv_cache — 전역 pool tensor
3. block_tables — [B, max_blocks] int32
4. context_lens — 각 seq의 T
5. cu_seqlens_q — prefill varlen 경계

2 Gather pattern decode

# per (seq_id, head, tok_idx)
phys = block_tables[seq_id,
                     tok_idx // B_p]
off  = tok_idx % B_p
k = kv_cache[layer, 0,
              phys, off, head, :]
v = kv_cache[layer, 1,
              phys, off, head, :]

vLLM kv_cache shape: [L, 2, n_blocks, B_p, H_kv, D]

3 Prefill vs Decode path 분기

	prefill	decode
Q tokens	N_p per seq	1 per seq
K/V source	현재 tok (new)	cache (old) + 현재
kernel	FA varlen	paged kernel
parallel 축	seq 내부	seq 간
AI	high (FLOPs)	low (BW)

4 v1 vs v2 선택 split-K

v1: 1 CTA = (seq, head) · 짧은 T 효율
v2: K축 P partitions · 긴 T에서 SM 포화 ↑
threshold는 max_context_len과 B에 의존, backend가 결정

자세한 split-K 수식 ↗ V07 §17

5 Prefix cache hit path ★

req 진입:
  1. scheduler: hash(prompt_prefix)
       → hit 노드 N 탐색
  2. N.block_ids 를 seq.block_table에
       prepend (refcount++)
  3. Q kernel은 "cached" span을
       skip — 실제 compute는 새 tok만
  4. 이 단계에서 TTFT ↓ (prefill 단축)

더 상세 → §9 prefix caching p.10

6 KV write path per step

model.forward가 현재 step의 new K, V 계산
attention backend가 reshape_and_cache kernel 호출
block_tables의 "현재 slot"에 write (scatter)
attention 자체는 그 다음 kernel 호출에서 paged read

write → read 의존성 동일 kernel 내부 처리 시 race 가능. vLLM은 scatter cache와 paged attn을 분리한 2-kernel 패턴 사용.

1 동기 왜 쪼개는가

긴 prompt (16K~128K) 1-shot prefill → 그 step 전체를 독점 → decode 요청 멈춤
TTFT는 개선되어도 다른 req의 TPOT가 spike
해결: prefill을 C token chunk로 분할, decode 요청과 한 step에 mix

2 Token 예산 통일 공식

chunk_size(req) = min(
remaining_prefill(req),
max_num_batched_tokens − Σ decode_tok
) decode는 req당 1 tok이므로 남은 예산이 prefill chunk 한도

3 Step mix 예시

max_batched_tok = 512
req A: decode (1 tok)
req B: decode (1 tok)
req C: prefill 2000 tok (진행중 800/2000)
→ 이번 step tok = [1, 1, 510]
   chunk = 510  (512 − 2)
→ 다음 step: req C chunk 510 ...
최종 step에서 C prefill 완료 → decode 합류

4 Attention 처리 부분 prefill

chunk N_c token의 Q가 이전 chunk의 KV를 attend
현재 chunk 내부 causal + 이전 chunk full attend
kernel은 prefill varlen + paged cache 동시 — FA2 varlen이 이를 지원

Q: [chunk_k]
K,V: cache[0:chunk_k_start]
     + (chunk_k own K,V)
mask: causal within chunk_k
      full on prior cached

5 효과 정량

지표	no chunk	chunk
TPOT p99 spike	크다	작다
TTFT	짧다	약간 ↑ (overhead)
throughput	prefill 시 idle	decode 병행 ↑

overhead 원천: chunk 경계마다 attention kernel re-launch

6 Scheduler와 상호작용

prefill seq는 여러 step에 걸쳐 running에 머묾
step마다 num_computed_tokens 갱신
num_computed_tokens == prompt_len 시 decode 단계로 전환

7 Piggyback & Sarathi

정의 Sarathi-Serve (2024): decode step에 prefill chunk를 "piggyback"하여 decode의 메모리 bound 여유분 FLOPs를 prefill이 채움.

메모리 bound decode + FLOP bound prefill = 상보적
roofline 관점: 두 op AI가 다르므로 동시에 섞으면 효용 ↑ ↗ V18

chunked prefill 3키: CHUNK·MIX·PIGGY (prefill 분할 · decode와 혼합 · bound 상보)

1 동기

system prompt · few-shot · agent history — 요청 간 공통 prefix
prefix를 재연산하지 않고 기존 KV block 재사용
TTFT ↓, HBM 점유 ↓

2 Hash 체계 ★

h(block_k) = H(h(block_{k−1}) ∥ tokens_k) chained hash — token 수열이 동일하면 동일 block hash. block 단위(B_p 고정)로 경계 맞춤

tok: [t0..t15][t16..t31][t32..t47]...
hash:   h0      h1        h2
table:  h0→blk_A  h1→blk_B ...

block 경계 토큰 수 불일치 시 hit 불가 — B_p 선택 중요

3 Lookup path

새 req prompt를 B_p-block으로 분할
앞에서부터 chained hash 계산 · table lookup
최장 연속 hit span 결정 (첫 miss에서 break)
hit block의 refcount++, req.block_table에 삽입
miss span만 prefill 실행

4 Eviction 정책

정책	동작	trade
LRU on leaves	refcount=0 leaf 제거	공정 · 표준
LFU	hit 횟수 기반	hot prompt 유지
TTL	시간 만료	stale 방지

vLLM 기본: LRU · refcount 0인 block만 free 대상

5 Hit rate 감각

chat: system prompt 동일 → 60–90% prefix 비율
agent/tool use: multi-turn history 공유 → 매우 높음
RAG: context 매번 다름 → 낮음

6 정확성 조건

tokenizer·model·dtype·quant 설정이 동일해야 hit 가능. sampling seed 바뀌어도 OK (KV는 deterministic).

prefix hash에 model_id, kv_dtype, lora_id 포함
multi-tenant 시 tenant namespace 분리

7 RadixAttention과 관계

vLLM prefix cache: flat hash map
SGLang RadixAttention: trie 기반 (§14 p.15)
trie는 중간 분기 공유 가능, hash map은 block-aligned prefix만

1 동기 launch overhead

decode step = 70B 모델이면 layer × (qkv, attn, o-proj, norm, mlp) = 수백 kernel
각 launch ~5–10 μs · 누적 수 ms
decode 자체는 1 tok · step 내 ms 단위 → launch가 병목
CUDA Graph로 하나의 캡처된 DAG를 통째 launch

2 Capture 조건

모든 tensor shape·dtype·stride 고정
모든 주소 고정 (포인터 변하면 재캡처)
제어 흐름 데이터에 의존 금지 (static if only)
한 stream 안에서 실행 (cross-stream은 sync node 필요)

3 Graph pool batch size bucketing

capture once per batch size:
  G[1], G[2], G[4], G[8], G[16], G[32]
       ...각각 입력 tensor 고정
실행:
  B_actual = current batch size
  B_padded = next_pow2_or_bucket(B_actual)
  copy inputs into G[B_padded].buffers
  G[B_padded].launch()
  slice outputs[:B_actual]

padding dummy seq가 KV에 쓰지 않도록 mask 필요

4 Dynamic shape 회피

변수	처리
batch size	bucketed graph pool
seq len (cached)	상수 tensor에 넣고 kernel args로
block_tables	max_blocks 고정, 나머지 -1
prefill	graph 미사용 (eager) · varlen

5 적용 범위

decode only에 주로 적용 (prefill은 shape 변동 大)
chunked prefill mode는 eager 경로로 fallback하기도
spec decoding의 verify step도 graph 가능

6 비용

각 graph는 자체 workspace · 메모리 ↑
bucket 개수 × model size = 무시 못함
capture 시간이 warmup에 추가됨 (수 초~수십 초)

cudnn/cublas heuristic이 첫 호출에서 분기하면 그대로 박제됨. warmup을 충분히 돌려야 best kernel이 잡힘.

7 관련 설정

vLLM: enforce_eager=False, cudagraph_capture_sizes
TRT-LLM: engine build 자체가 static graph
SGLang: --disable-cuda-graph로 끌 수 있음

1 알고리즘 리마인드

수식·rejection sampling 증명 ↗ V08 §12–13
여기서는 engine path에 어떻게 꽂히는가만

expected_tok/step = (1 − α^γ+1) / (1 − α) α accept prob · γ draft length. α=0.7, γ=4 → ~2.5× throughput

2 Engine 경로 변화

기본 decode step:
  target.forward(1 tok) → 1 new tok

Spec decode step:
  1. draft.forward(γ tok) → γ drafts
  2. target.forward(γ+1 tok) → γ+1 logits
  3. verify(drafts, target_logits)
     → accepted n ∈ [1, γ+1] tok
  4. seq.append(accepted)
  5. KV cache: accepted n만 commit
     rejected tail은 discard (잘라냄)

3 Worker 구성

component	역할
DraftWorker	draft 모델 forward (작은 LM)
TargetWorker	target 모델 forward (γ+1 tok at once)
Scorer	rejection sampling · accept idx
KVManager	target cache commit/rollback

4 KV rollback ★

target이 γ+1 tok의 KV를 미리 cache에 씀
verify 후 accepted n만 유지, (γ+1−n) tok의 KV는 revoke
paged 구조상 "마지막 n slot만 유효" 마크 — refcount에 영향 없음
다음 step에 draft가 다시 이어감

5 Variant × engine

variant	draft 출처	engine 비용
Vanilla	외부 작은 모델	+1 worker
Medusa	target의 extra head	weight 공유
EAGLE	hidden-state feature	small auxiliary net
n-gram	prompt match	GPU 없음

6 Tree attention path

draft가 여러 branch → 동시 검증
target의 attention mask = tree causal mask
attention backend가 custom mask 지원해야 함 (FA2 packed causal + offset)

함정: spec decoding은 batch size가 작을 때만 throughput gain. batch이 이미 크면 memory-bound가 compute로 넘어가 gain이 사라진다.

7 Scheduler interaction

token budget에서 spec step은 γ+1 tok/seq로 계산
accept rate 미달 시 γ 동적 조정 가능 (adaptive γ)

1 양자화 대상 3축

알고리즘 (GPTQ/AWQ/SmoothQuant/FP8) 자체 ↗ V10
여기는 engine path에 들어가는 기계적 hook만

축	대상	engine 영향
weight	GPTQ/AWQ W4A16	dequant GEMM kernel
act+wgt	FP8 / INT8	low-prec GEMM
KV cache	INT8/FP8 KV	paged layout + scale

2 Model loading path

config에서 quantization 감지
GPTQ/AWQ: kernel-bound Linear → QuantLinear 치환 (group_size, bits 메타 포함)
FP8: Linear → Fp8Linear + per-tensor scale
weight checkpoint에서 pre-packed quantized tensor 로드

3 Weight-only quant (GPTQ/AWQ)

y = dequant(W_q, s, z) @ x
fused: "GEMM with dequant on-the-fly" s scale · z zero-point · group_size g (e.g. 128) · W_q INT4 packed

backend: Marlin · Machete · ExLlamaV2 · AWQ-cuda
선택 조건: batch size, dtype, arch (SM80+ vs SM90)

4 FP8 (H100 native)

weight · activation 둘 다 FP8 (E4M3 fwd)
per-tensor 또는 per-block scale
GEMM은 cuBLAS/CUTLASS FP8 kernel · BF16 accum
KV cache도 FP8 저장 가능 (per-page scale)

5 KV cache quant path ★

kv_dtype = "fp8_e4m3"
block layout:
  [B_p, H_kv, D]  data
  [per-block scale]  side table

write path:
  k_fp16 → scale_compute → k_fp8
  scale 저장 (per block or per token)

read (attention):
  k_fp8 → scale·k → k_fp16 → QK

정확도 위험: scale 위치 잘못 잡으면 붕괴 ↗ V07 §17 warn

6 Sampler · activation path

sampler는 보통 FP32 (수치 안정)
norm, residual, rotary는 FP16/BF16 유지 ↗ V08 §8–9
양자화는 GEMM과 KV에만 집중 (edge op는 FP16)

7 Engine별 지원

engine	W4A16	FP8	KV Q
vLLM	GPTQ/AWQ/Marlin	O (H100)	INT8/FP8
SGLang	AWQ/FP8	O	FP8
TRT-LLM	AWQ/SQ	O (per-tensor)	INT8/FP8

1 Scope 제한

collective 알고리즘·NVLink 토폴로지·NCCL 내부 ↗ V15
여기는 vLLM engine이 어떻게 TP/PP를 enable하는지만

2 TP 통신 지점 Megatron 방식

Y = gelu(X · W₁^col) · W₂^row
W₁ col-split, W₂ row-split
need allreduce after W₂ attention 내 Wo도 동일 패턴. allreduce ×2 per layer (attn, mlp)

3 PP 구성

layer 집합을 stage로 분할
inference는 micro-batch 없이 per-request forward
stage 간 send/recv activation (NCCL P2P)
PP > 1이면 pipeline bubble 대신 inter-request parallel 활용

4 Worker 간 역할

engine (rank 0 driver)
  Scheduler / BlockMgr  ← 단일
  ┌──────┬──────┐
  │       │       │
Worker0 Worker1 WorkerN    (TP=N)
  ModelRunner × TP
  KV cache per-worker (shard of H_kv or full)
  allreduce 동기식

engine.execute_model:
  broadcast SchedOut to all workers
  각 worker: execute → gather output

5 Executor 선택

condition	executor
TP=1, PP=1	Uniproc
single node, TP>1	MP (torch.multiprocessing)
multi-node	Ray
TP + PP	Ray (권장)

6 KV cache under TP

KV는 head 축으로 shard (H_kv/TP head per worker)
GQA인 경우 H_kv 작아서 TP가 H_kv 초과 불가
paged 구조는 per-worker 독립 pool

7 통신·compute overlap

inference는 학습보다 overlap 기회 적음 (forward only)
SP (sequence parallel)을 켜면 norm/residual에서 allgather-reduce_scatter 분할
TP=8 이상에서 allreduce가 병목 → NVLS ↗ V15 §5

주의: vLLM은 기본적으로 static TP. request 단위 dynamic TP는 지원 안 함 — 별도 replica로 처리.

8 관련 flag

--tensor-parallel-size · --pipeline-parallel-size
--distributed-executor-backend {mp, ray}
--enable-expert-parallel (MoE EP 별도 축)

1 Radix trie 자료구조 ★

정의 모든 active/cached 시퀀스의 prefix를 radix trie로 유지. edge에 token sequence 저장, 각 노드는 block id list · refcount · last_access 보유.

root
 ├─"You are a helpful"─┬─"assistant"  [A,B,C]  rc=3
 │                      └─"expert"     [D]      rc=1
 └─"###"─ ...           ...

2 연산 API

match(tokens) → (node, matched_len)
insert(tokens, block_ids) — split edge if needed
lock(node) / unlock(node) — refcount +/−
evict(size) — refcount=0 leaf LRU 제거

3 흐름 per request

req 도착:
  1. radix.match(prompt) → (node, L)
  2. block_tbl = [ node... 조상 블록들 ]
  3. prefill only tokens[L:] (tail)
  4. 생성 중 매 block 완성마다:
       radix.insert(new_block, tokens)
  5. seq 종료:
       unlock 조상 path
       (leaf refcount 0 되면 evict 대상)

4 vLLM prefix cache와 차이

	vLLM (hash)	SGLang (trie)
구조	block-hash → id	token trie
공유 단위	block-aligned	token-level
branching	명시적 CoW	edge split
LRU	block list	leaf 기준
lookup	O(prefix/B_p)	O(prefix) token

5 Multi-turn 이득

chat turn 2: prev all tokens가 prefix = trie path 완전 hit
agent tool loop: 같은 system + tool schema reuse
hit rate 60–90% 보고 사례

Zheng et al. SGLang, 2024

6 Branching · fork

parallel sampling n: 한 노드에서 n child leaf
beam search: branch 시 edge에 token 추가 → 자동 split
tree-of-thought: 다수 fork 탐색

7 Eviction 정책 세부

victim 후보: leaf with refcount=0
각 leaf last_access_time 최소 선택
leaf 제거 후 parent가 leaf-되고 refcount=0이면 재귀 제거
blocks_to_free 생성 → BlockManager.free

정확성: 같은 token 수열이라도 tokenizer·dtype·quant가 다르면 분리된 trie — namespace key 필수.

1 SGL 언어 embedded DSL

정의 Python 내장 DSL. prompt를 program으로 표현. gen, select, fork, regex, json 기본 primitive.

@function
def multi_turn(s, q):
  s += system("You are...")
  s += user(q)
  s += gen("ans", max_tokens=256)
  if s["ans"].contains("?"):
    s += user("clarify")
    s += gen("clar")

2 Compiled execution

SGL program은 engine이 직접 해석
prefix 반복은 RadixAttention이 자동으로 KV 재사용
parallel fork는 같은 prefix 공유 → 병렬 decode batch 구성

3 Constrained decoding ★

정의 logit에 mask를 씌워 허용되지 않은 token 확률 = 0 (−∞). grammar 자동기 (FSM/CFG)가 각 step에서 허용 token set 계산.

per step:
  state = fsm.current()
  allowed = fsm.allowed_tokens(state)
  logits[~allowed] = -inf
  tok = sample(logits)
  state = fsm.step(state, tok)

4 FSM 종류

유형	지원
regex	Outlines-style FSM
JSON Schema	CFG / Pydantic → FSM
EBNF/CFG	pushdown automaton
XGrammar	incremental token masking (high perf)

engine은 FSM step을 CPU에서 돌리고 GPU logit에 mask 브로드캐스트

5 Token masking 비용

vocab 100K+ — per-step mask 계산이 병목 가능
XGrammar: 가능 token set을 bit-packed + precompute
SGLang은 overlap (next-step mask 미리 계산)

6 Function calling · tool use

function schema → JSON CFG
model이 tool call → 구조 보장
tool 실행 결과를 새 user message로 append
RadixAttention이 이전 context 유지 · decode 이어감

7 Other engines의 대응

engine	구조 생성 수단
vLLM	Outlines/XGrammar plugin
TRT-LLM	logit processor (사용자 구현)
LMDeploy	custom grammar

FSM이 tokenizer의 BPE merge를 모르면 invalid token sequence 생성. tokenizer-aware FSM 필요.

1 2-Phase: Build → Runtime

[build time]
 Python API (tensorrt_llm)
   → TRT Network Graph (DSL)
   → Plugin insertion (FA, GEMM, …)
   → TRT Engine optimizer
   → serialized .engine file

[runtime]
 C++/Python runtime
   Executor (in-flight batching)
   GptSession / GptManager
   KV Cache Manager (paged)
   samplingConfig per request

2 Plugin 시스템 ★

TRT 기본 op으론 부족 → custom kernel을 IPlugin으로 wrap
핵심 plugin: gpt_attention (FMHA/paged), gemm, rmsnorm, lookup
plugin은 build 시 선택 · runtime에 고정

3 In-flight batching (IFB)

정의 TRT-LLM 용어. Orca의 iteration-level scheduling과 동일 개념. GptManager가 request queue를 매 step 재구성.

vLLM continuous batching과 1:1 대응
GptSession (batch=1), GptManager (IFB), Executor (최신)

4 KV cache 구성

Paged KV 지원 (vLLM과 유사)
kv_cache_reuse flag로 prefix 공유 on
FP8 KV · INT8 KV 지원 (per-block scale)

5 Engine build 유연성 trade

장점	단점
kernel autotune	build 수 분~시간
static optimization	dynamic shape 제약 (profile 필요)
layer fusion	모델 구조 변경 시 재build
lowest latency	quant 설정마다 별 engine

6 Quant 지원

W4A16 (AWQ), W8A8 (SmoothQuant), FP8, INT4 weights
ModelOpt (구 AMMO) 로 양자화 후 TRT-LLM에 전달
build 시 quant config 고정 → 전용 plugin 선택

7 Triton Inference Server 통합

TRT-LLM backend plugin (tensorrtllm_backend)
Triton의 Preprocessor / Postprocessor와 ensemble
production 배포 기본 패턴

version 간 engine 파일 비호환. GPU arch·TRT 버전·quant 전부 동일해야 load 가능. CI에서 파이프라인으로 재빌드 필요.

1 5-Engine × 14-Axis 비교 ★ 선택 결정표

axis	vLLM	SGLang	TensorRT-LLM	LMDeploy	MLC-LLM
언어·스택	Python + C++ core	Python + Triton/CUDA	C++ runtime + Python build	C++ + Python	TVM / MLC IR
스케줄러	3-queue (wait/run/swap) budget	RadixAttention-aware scheduler	In-flight Batching (GptManager)	Persistent batching	TVM runtime · 단순
KV 구조	PagedAttention (block_table)	Paged + Radix trie	Paged KV plugin	Block-based (TurboMind)	Contiguous · paged 옵션
Prefix sharing	Hash prefix cache (auto)	Radix trie (token-level)	kv_cache_reuse flag	Session-level cache	제한적
Chunked prefill	O (default on)	O (Sarathi-style)	O (context chunking)	O	제한적
CUDA Graph	decode bucketed	decode bucketed	build-time static	decode	TVM runtime
Spec decoding	Medusa/EAGLE/n-gram	EAGLE/Medusa	Medusa/EAGLE	미지원~제한	미지원
Quant (weight)	GPTQ·AWQ·Marlin·Machete	AWQ·FP8	AWQ·SQ·FP8·INT4	AWQ·SQ	GPTQ·AWQ (TVM codegen)
Quant (KV)	INT8/FP8	FP8	INT8/FP8 per-block	INT8	제한적
분산 (TP/PP)	TP, PP, EP (MoE)	TP, EP	TP, PP, EP	TP	TP (단순)
MoE	Mixtral · Qwen MoE · DeepSeek	DeepSeek MLA 최적화	Mixtral · DeepSeek	Mixtral	제한적
Structured gen	Outlines / XGrammar	native SGL + XGrammar	logit processor (수작업)	제한적	제한적
Sampling	temp/top-p/top-k/penalty/logitproc	같음 + logprob	같음 + bad_words	제한적 API	기본
Plug-ability	attention backend · model code open	kernel 교체 · Python-centric	plugin system · build step	자체 kernel library	IR lowering 레벨
Sweet spot	general throughput · 연구 workhorse	multi-turn · RadixAttn prefix hit 高	production low-latency · quant 극한	NVIDIA 외 NPU/edge 함께	cross-HW (Vulkan/Metal/AMD)

출처: 각 프로젝트 공식 문서 · 릴리스 노트 (2024~2025). 기능 유무만 정리, 실측 수치는 제외 (out-of-scope).

1 OpenAI-compatible 스펙

endpoint	용도
`POST /v1/completions`	text completion
`POST /v1/chat/completions`	chat role-based
`POST /v1/embeddings`	임베딩
`GET /v1/models`	list models
`POST /v1/rerank`	reranker (일부 engine)

2 요청 파이프라인

HTTP req (JSON)
  → validate (pydantic)
  → apply chat template (jinja)
  → tokenize (prompt_ids)
  → SamplingParams 생성
  → engine.add_request()
  → (async) stream chunks
  → detokenize
  → SSE event write
  → HTTP close

3 Streaming (SSE)

Server-Sent Events: data: {chunk}\n\n 반복 후 data: [DONE]\n\n
engine iteration마다 new tok 생성 → per-client chunk 전송
detokenize는 incremental (BPE 경계 보정: offset_mapping)

4 Chat template

# Jinja2 template (HF tokenizer_config.json)
{% for m in messages %}
<|start|>{{m.role}}\n{{m.content}}<|end|>
{% endfor %}
<|start|>assistant\n

모델마다 다름 · 잘못 적용 시 성능 큰 영향

5 Function calling

request에 tools 배열 (JSON schema)
schema를 prompt에 주입 (template)
constrained decoding (JSON grammar) 으로 구조 보장
response parsing → tool_calls 필드

6 Cancellation · abort

HTTP disconnect 감지 → engine에 abort_request(req_id)
scheduler가 해당 seq를 running에서 제거, block free
이미 dispatch된 step은 완료 후 버려짐 (낭비 최소화는 timing 문제)

7 관측성

metric	의미
TTFT	첫 토큰 지연
TPOT	토큰 간 평균
e2e latency	전체 응답 시간
GPU cache usage	free blocks / total
num running/waiting	queue depth
prefix hit rate	cached / prompt token

1 e2e 공식 ★

T_e2e = T_queue + T_TTFT + (N_o − 1) · T_TPOT
T_TTFT = T_prefill + T_sched + T_sampler N_o : output token 개수. T_TPOT은 평균, variance 별도 추적

2 Prefill 분해

component	AI	driver
QKV proj	high	GEMM (FLOPs)
Attention	N_p·d	FA kernel
O proj	high	GEMM
MLP	high	GEMM×2
norm/residual	low	memory BW

prefill 전체: compute-bound (대부분 구성요소)

3 Decode 분해

component	비율	bound
attn (paged)	~50%	HBM BW
GEMM (weight)	~30%	HBM BW (W 로드)
norm + rotary	~10%	BW / launch
sampler	~5%	vocab scan
host overhead	~5%	Python/schedule

decode: memory-bound 지배. batch ↑ or quant ↓로 AI 상승 필요

4 Bottleneck 결정 규칙

T_TTFT 크다 → prefill GEMM · long context attention (O(N²))
T_TPOT 크다 → weight load BW · KV read BW
T_queue 크다 → 용량 부족 · preemption 다발
variance 크다 → chunked prefill 미적용 · HoL

5 SLO 설계

목표: p99 TTFT ≤ S₁, p99 TPOT ≤ S₂
제약: max_concurrent ≤ C s.t. SLO 충족
최적: throughput(tok/s) 최대 · SLO 충족 하에 C를 올리면 throughput ↑ 하지만 latency ↑ — 반비례

6 Goodput 지표

정의 goodput = "SLO를 충족한 request의 toks/s". raw throughput보다 production에 의미.

SLO 미달 request는 goodput에서 제외
C 스윕하여 goodput 최대점 찾기

7 HW 관점 병목

HBM BW → decode 상한
SM FLOPs → prefill 상한
NVLink/PCIe → multi-GPU overhead
CPU host → Python overhead (V1에서 개선) ↗ V18

1 Fundamental trade ★

throughput(B) ≈ B · tok/s_decode(B)
latency_TPOT(B) ≈ t_step(B) B ↑: throughput 증가, step 시간 증가 → TPOT 증가

tok/s     │    ╱‾‾‾‾‾ saturation (FLOPs)
          │   ╱
          │  ╱  memory-bound linear (BW)
          │ ╱
          └─────────────────▶  B
TPOT      │              ╱
          │           ╱
          │ ──────
          └─────────────────▶  B

2 Memory-bound → compute-bound

작은 B: weight/KV 로드가 지배 → step 시간 ≈ const → throughput 선형 ↑
큰 B: FLOPs가 지배 → step 시간 선형 ↑ → throughput 포화
전환점 B* = HBM BW / (2 · weight_bytes_per_tok)

3 Tunable knobs

knob	영향
`max_num_seqs`	batch 상한 → throughput ceiling
`max_num_batched_tokens`	step당 tok 총량 → TPOT spike
`max_model_len`	KV 예약 → 동시 req ↓
`gpu_memory_utilization`	KV pool 크기
`block_size`	공유 세밀도 vs overhead
`swap_space`	preempt swap 용량

4 Chunked prefill ratio

token budget 중 prefill vs decode 점유 비율 조정
prefill 비율 ↑ → TTFT ↓, 기존 decode TPOT ↑
decode 비율 ↑ → TPOT 안정, prefill 대기 ↑

5 Workload별 sweet spot

workload	우선	처방
Chat interactive	TTFT	prefix cache · 큰 chunk 허용
Bulk batch	throughput	max_num_seqs 최대 · quant
Long doc QA	TTFT balance	chunked prefill · PP
Agent loop	TPOT + cache	radix · spec decoding

6 Admission control

들어오는 RPS가 capacity 초과 시 queue 폭발 → latency 급증
정책: reject · queue + timeout · shed low priority
goodput 최대점 이상 들어오면 전부 SLO 미달 위험

흔한 오해: throughput을 올리면 자동으로 사용자 경험도 좋아진다 — 아님. latency SLO를 먼저 고정하고 그 안에서 throughput 최대화.

trade 3키: SEQ·TOK·KV (max_seqs / max_batched_tokens / kv pool)

1 Engine 선택 결정 트리 ★

Q1. 목표가 production low-latency + quant 극한?
  YES → TensorRT-LLM (build 비용 감수)
  NO  ↓
Q2. multi-turn / agent / tree search 중심?
  YES → SGLang (RadixAttention)
  NO  ↓
Q3. Nvidia 외 HW도 필요?
  YES → MLC-LLM (cross-HW) or LMDeploy
  NO  ↓
Q4. 일반 research/throughput, 넓은 모델 커버?
  YES → vLLM (workhorse)
  NO  ↓
Q5. edge/mobile / WebGPU?
  YES → MLC-LLM

2 지표 정의 요약

지표	식
TTFT	req 도착 → 첫 tok
TPOT	(e2e − TTFT) / (N_o−1)
e2e	TTFT + (N_o−1)·TPOT
tok/s	N_o/e2e · per req
goodput	SLO 충족 req의 tok/s 합
prefix hit	cached_tok / prompt_tok
KV util	used_blocks / total_blocks

3 vLLM 경로 요약

API → AsyncEngine → LLMEngine.step()
  → Scheduler.schedule()
      [waiting|running|swapped]
      budget(tok,seq,block)
  → Executor.execute_model()
      broadcast to Workers (TP)
  → Worker.ModelRunner
      prepare_inputs (packed, block_tables)
      model.forward() [CUDAGraph?]
      sampler
  → process_outputs (BlockMgr free/alloc)
  → stream chunks

4 Paged KV 공식 빠른 참조

block_bytes = B_p·H_kv·D·2·sizeof(dtype)
n_blocks = (mem − W − act − ws − safety) / (L·block_bytes)
max_concurrent_seq ≈ n_blocks / ⌈avg_len / B_p⌉

5 Xref 정리

attention kernel 수식/내부 → ↗ V07
LLM 부속 kernel (RoPE/Norm/Sampler/MoE) → ↗ V08
양자화 수식·알고리즘 → ↗ V10
NCCL/5D 병렬 → ↗ V15
Roofline·stall 분석 → ↗ V18

6 흔한 함정 10선

prefix cache 켰는데 tokenizer/quant mismatch로 hit=0
CUDA Graph capture가 eager 동작을 박제 → dtype/backend 변경 미반영
chunked prefill off 상태에서 long prompt 한 개가 전체 batch 멈춤
CoW 미구현/미호출로 prefix 공유 seq의 KV 오염
spec decoding을 큰 batch에 적용 — gain 소실
FP8 KV scale 위치 오인 → 정확도 붕괴
max_model_len ≫ 실제 사용 → KV pool 과소, 동시성 ↓
TP size > H_kv 설정 → GQA 모델 load 실패
detokenize incremental 버그 → BPE 경계 깨짐
admission 없이 RPS 과잉 → queue 폭발, 전부 SLO 미달

vLLM 5-layer: API·ENG·SCH·BLK·WRK (API · Engine · Scheduler · BlockMgr · Worker)

단권화는 엔진 경로 지도일 뿐. 실제 튜닝 의사결정 · trace 분석은 코드·실측으로 체화 (out-of-scope).