《GPU Mode》 L093 2025 · 11월 · v0.1.0 High priority transcript · failed · Cornserve repo + 논문 으로 보강

Cornserve — Easy, Fast and Scalable Multimodal AI

multimodal LLM 추론은 vLLM 의 가정과 어긋난다 — vision encoder · projection · LLM · decoder 가 다른 자원 형태와 다른 latency 패턴을 갖는다. Cornserve 는 이 모델을 fission 시켜 각 컴포넌트를 독립적으로 scale 하는 distributed serving 플랫폼. ACM CAIS 2026 에 발표된 디자인을 Jeff Ma 가 GPU Mode 청중에게 옮긴 강의. transcript 가 실패해서 본 노트는 cornserve repo / 논문 / 일반 multimodal serving 도메인 지식을 기반으로 재구성.

multimodal serving model fission disaggregation any-to-any KV cache sharing vLLM SGLang Cornfigurator

Speaker

Jeff Ma

Cornserve maintainer · multimodal serving 의 distributed 시스템 설계

강의 번호

L093

스피커

Jeff Ma

학습 우선순위

High · 정독

자막 상태

failed · repo + 논문 보강

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제why this lecture exists 02multimodal serving 의 어려움what makes it hard 03모델별 자원 형태 — vision · audio · LLMcomponent shapes 04disaggregation 적용 — model fissioncore abstraction 05KV cache 공유 — application 사이의 reusecross-app sharing 06routing 결정 — request 가 어디로 가는가request routing 07결과 — Cornfigurator 의 자동 배치deployment planner 08vLLM / SGLang 과의 비교positioning 09다음 — any-to-any 의 도전future 10기억할 메모와 실습key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· why this lecture exists

같은 GPU 위에서 LLM 과 vision encoder 를 함께 돌릴 때 — vLLM 의 단순성이 깨진다

강의의 출발점은 vLLM 의 디자인이 가정하는 세계와 multimodal 의 실제 세계의 간극. vLLM 은 “Transformer decoder 한 개” 위에서 PagedAttention · continuous batching 을 깐다. multimodal 은 그 가정을 다 깨뜨린다 — vision encoder, projector, LLM, decoder 가 다른 자원 / 다른 latency / 다른 batching 패턴.

강의의 frame

“multimodal LLM 은 한 모델이 아니라 여러 모델의 결합이다. 한 GPU 에 다 같이 배치하면 효율이 떨어지고, 분리해서 배치하면 latency 가 추가된다 — 그 trade-off 를 어떻게 자동으로 푸는가가 Cornserve 의 출발.”

Jeff 가 강조하는 단어는 fission. 핵분열처럼 한 모델을 components 로 쪼개고, 각 component 를 독립적으로 scale. 같은 component 를 여러 application 이 공유. “하나의 거대한 model 을 호스팅한다” 가 아니라 “여러 component 의 mesh 를 호스팅한다” 의 시각 변화.

강의의 끝에 손에 잡혀야 하는 건 3개의 추상(component / application / route)과 1개의 자동 planner(Cornfigurator). 그리고 “언제 fission 이 회수되는가” 의 판단 기준.

§ 02multimodal serving 의 어려움· what makes it hard

한 request 안에 4–5개의 다른 모델이 도는 경우가 흔해진다

2024–25 의 multimodal LLM (LLaVA · Qwen2-VL · GPT-4o · Gemini) 의 공통 패턴 — 입력은 이미지/오디오/텍스트의 혼합, 출력은 텍스트 또는 다시 이미지/오디오. 한 request 가 들어오면 여러 모델이 sequential / parallel 로 돈다.

FIG · 한 multimodal request 의 실제 경로image+text in, text out

Components
timeline

vision encoder (ViT)

proj

LLM prefill

LLM decode (loop)

자원 패턴
batching/latency

batched · GPU-bound

tiny

PagedAttention

memory-bound

왼쪽 — vision encoder 는 compute-bound 한 batch friendly 패턴. 오른쪽 — LLM decode 는 memory-bound. 둘이 같은 GPU 위에 있으면 한 쪽이 항상 sub-optimal. 이게 multimodal serving 의 핵심 문제.

그리고 이 그림 위에 추가되는 복잡성.

모델 컴포넌트의 재사용 — 한 회사가 LLaVA · Qwen2-VL · 두 개의 chat app 을 동시에 서빙. 같은 ViT 인스턴스를 모두에서 공유 가능하다면 GPU 낭비가 없다.
입력 양식의 다양성 — text-only request, image+text request, video request, audio request 가 한 endpoint 에 섞여 들어옴. 각 path 가 다른 component 를 거친다.
동일 모델 다른 변형 — Qwen2-VL-7B 와 Qwen2-VL-72B 가 같은 ViT 를 쓴다. encoder 는 한 번, LLM 만 분기.
any-to-any 미래 — 출력이 텍스트만이 아닌 이미지/오디오인 경우. decoder 도 fission 의 한 자리.

vLLM 위에서 단순히 서빙하면

각 application 마다 전체 모델을 통째로 GPU 에 올려야 한다. ViT 가 LLaVA-7B 와 LLaVA-13B 두 곳에 중복 적재. 같은 ViT 인스턴스가 GPU 메모리의 5–10% 를 두 번 먹는다 — 같은 일을 하면서. fission 의 가장 단순한 동기.

§ 03모델별 자원 형태· component shapes

각 component 가 “자기에게 맞는 GPU 형상” 을 갖는다

multimodal pipeline 의 각 컴포넌트는 다른 hardware 가 가장 효율적. 그래서 fission 의 진짜 가치는 component 마다 적합한 GPU 를 따로 배치할 수 있다는 데서 나온다.

Vision Encoder (ViT) compute-bound · batch friendly · activation 작음. 같은 노드에 여러 인스턴스 가능. fp16/bf16 으로 충분. L4 · A10 scale: width

Projector / Adapter linear / MLP. 매우 작음. 별도 GPU 가 아닌 — encoder 또는 LLM 어느 쪽에 붙여도 OK. collocated scale: 0

LLM Prefill compute-bound · long prompt 가 핵심 비용. high-bandwidth GPU 가 유리. KV cache write 가 주된 출력. H100 · H200 scale: depth

LLM Decode memory-bound · KV cache read 가 dominant. 작은 batch 라도 token throughput 한계. high HBM bandwidth. H100 / H200 scale: replicas

Audio Encoder (Whisper 류) streaming 친화 · chunked input. CPU 또는 작은 GPU 로 충분. 별도 노드 가능. L4 · CPU scale: width

Image/Audio Decoder (출력) diffusion / vocoder. iterative · GPU 시간이 길다. dedicated GPU. A100 · H100 scale: replicas

이 표가 보여주는 사실 — 한 GPU type 으로 모든 component 를 서빙하는 건 거의 항상 sub-optimal. ViT 는 L4 면 충분한데 H100 위에 올라가면 H100 이 비어 있는 시간이 생긴다. LLM decode 는 H100 의 HBM bandwidth 가 필요한데 L4 위면 throughput 이 안 나온다. component 별로 적합한 GPU 에 배치하는 게 fission 의 첫 보상.

“multimodal serving 의 GPU utilization 은 단일 인스턴스 단위로 안 본다 — component graph 단위로 본다.”학습 노트

§ 04disaggregation 적용· model fission

application = component graph 의 한 경로

Cornserve 의 핵심 추상. 사용자는 component 들을 등록하고, application 을 component 들의 그래프로 정의. Cornserve 는 그 그래프를 분산 환경에 배치하고, request 가 들어오면 그래프를 따라 routing 한다.

한 가지 시나리오 — 같은 회사가 네 가지 service 를 운영.

chat-vlm: image + text → text. ViT-L · Qwen2-7B 사용.
chat-vlm-pro: 같은 ViT-L · Qwen2-72B 사용.
doc-qa: image + text → text. 같은 ViT-L · Qwen2-7B 재사용.
tts: text → audio. 별도 vocoder.

전통적 방식이면 ViT-L 이 세 service 마다 GPU 메모리에 따로 올라감. Cornserve 는 ViT-L component 한 인스턴스를 세 application 이 공유 — 같은 GPU 메모리, 같은 batched inference.

# pseudo-code — Cornserve component / app 정의
from cornserve import Component, Application

vit_l = Component(
    model="openai/clip-vit-large-patch14",
    gpu="L4",
    replicas=4,
)

qwen_7b = Component(
    model="Qwen/Qwen2-7B-Instruct",
    gpu="H100",
    replicas=2,
    backend="vllm",
)

qwen_72b = Component(
    model="Qwen/Qwen2-72B-Instruct",
    gpu="H100",
    replicas=8,
    backend="vllm",
    tensor_parallel=4,
)

# 같은 vit_l 을 두 app 이 공유
chat_vlm     = Application([vit_l, qwen_7b])
chat_vlm_pro = Application([vit_l, qwen_72b])
doc_qa       = Application([vit_l, qwen_7b])

component sharing 의 효과

같은 ViT-L 을 3개 application 이 공유하면, GPU 메모리가 1× (3× 가 아니라). 그리고 batched inference 가 훨씬 효율적 — 세 service 의 image 를 한 batch 로 묶어 ViT 한 번 forward. 단일 GPU 위에서 throughput 이 2–3× 까지.

그리고 fission 이 만들어내는 새로운 비용 — component 사이 통신. ViT 의 출력이 다른 GPU 위 LLM 으로 가야 한다. RDMA / NVLink-Sharp / IB 가 필요. 한 노드 안이면 NVLink, 노드 사이면 IB. 이 통신 비용을 fission 으로 얻는 자원 효율과 trade-off 하는 게 Cornfigurator 의 일이다 (§ 07).

§ 05KV cache 공유 — application 사이의 reuse· cross-app sharing

같은 prompt prefix 가 여러 application 에서 들어오면 — KV cache 를 한 번만 계산

LLM 추론 비용의 큰 부분이 prefill — long prompt 의 attention 을 처음에 계산하는 자리. 같은 prefix (system prompt · few-shot example · 같은 image embedding) 가 여러 request 에 등장하면, KV cache 를 한 번만 계산해 reuse 가능. SGLang 의 RadixAttention 이 이 아이디어를 구현. Cornserve 는 application 단위로 prefix 가 정의되어 application 사이의 cache reuse 가 첫 시민.

per-application prefix cache: 한 application 의 system prompt 는 모든 request 에 들어감. 첫 request 에서 prefill 한 KV cache 를 다음 request 가 그대로 사용.
per-image cache: ViT 의 image embedding 이 LLM 의 input prefix 가 됨. 같은 이미지를 두 application 이 다른 질문으로 묻는 경우 — image embedding 의 KV cache 를 공유.
session continuation: 같은 user 의 multi-turn 대화. 이전 turn 의 KV cache 를 다음 turn 의 prefill 시작점으로.
cache eviction: GPU 메모리 한정. LRU / TTL 또는 application 별 priority. cold start 비용을 감수하고 evict 하는 결정의 자동화.

Cornserve 만의 장점 — application graph 가 명시적

vLLM 은 “이 prefix 가 다른 request 와 같은가” 를 hash 비교로 판단. Cornserve 는 application 정의 자체에서 prefix 가 어디서 오는지를 안다 — system prompt 는 application metadata, image embedding 은 component 출력. 그래서 cache lookup 이 더 빠르고 더 정확. (확인 필요 — 정확한 구현은 repo 코드에서 검증)

§ 06routing 결정· request 가 어디로 가는가

graph 위 path 를 따라 — 그러나 같은 graph 에 여러 path 가 있다

request 가 들어오면 Cornserve 는 application 의 component graph 를 따라 각 component 의 적합한 인스턴스로 routing. 단순한 path 같지만, 실제로는 load balancing · cache locality · KV cache 공유의 세 trade-off 가 동시에 작용.

load balancing: ViT 인스턴스 4개 중 가장 idle 한 것 — 단순 round-robin 이 baseline.
cache locality: 같은 application 의 같은 prefix 가 cache 된 인스턴스가 있다 → 거기로 보냄. cache miss 가 prefill 의 큰 비용.
KV cache 공유 가능성: image embedding 의 KV cache 가 같은 LLM 인스턴스에 있으면, 다음 request 가 그곳으로 가야 reuse.
queue length: cache locality 를 따져도 그 인스턴스의 queue 가 너무 길면 latency 가 더 들 수 있음. 동적 trade-off.

FIG · request routing 의 결정 그래프한 request 의 path 결정

request 1
image A + Q1

ViT-L #2 · cache miss

proj

Qwen-7B #1 · prefill

decode #1

request 2
image A + Q2 (같은 image)

ViT cache hit · skip

skip

image KV cache hit

Q2 prefill only

decode #2

request 3
image B + Q3

ViT-L #3 (idle)

proj

Qwen-7B #2 · prefill

decode #3

request 2 가 같은 image A 를 가진 다른 질문 — ViT 한 번 절약 + image embedding 의 KV cache 공유로 prefill 의 큰 부분 절약. routing 이 cache locality 를 따라가면 latency 가 절반 이하. request 3 은 다른 image 라 cache miss — 다른 인스턴스로 분산.

§ 07결과 — Cornfigurator 의 자동 배치· deployment planner

component graph + 하드웨어 inventory → 최적 배치 — 사람 손 없이

Cornserve 의 함께 발표된 사이드 프로젝트 — Cornfigurator (arXiv 2512.14098). component 의 자원 요구량과 cluster 의 GPU inventory 를 입력으로 받아, 어떤 component 를 어디에 몇 replica 로 배치할지를 자동으로 푸는 planner.

입력: component 별 throughput 모델 (input → latency / GPU mem), application 의 SLO (P95 latency, target QPS), 사용 가능한 GPU pool.
최적화 목표: SLO 만족 + 총 GPU 비용 최소.
출력: 각 component 의 GPU type, replica 수, collocate 정책 (어떤 component 가 같은 노드에).
알고리즘: queueing model 위 ILP 또는 metaheuristic — 정확한 형태는 논문 참조.

왜 자동화가 의미 있는가

component 가 5–10개로 늘면 사람이 손으로 배치 결정이 비현실적. 어떤 component 를 어떤 GPU 에, 몇 replica, 어디에 collocate — 조합 폭발. 그리고 traffic 패턴이 바뀌면 다시 풀어야. Cornfigurator 가 graph + workload + cluster 를 받아 optimal plan 을 며칠 단위 사람 일을 수십초로 줄인다.

그리고 plan 의 dynamic 갱신. traffic 이 갑자기 image-heavy 가 되면 — Cornfigurator 가 ViT replica 를 늘리고 LLM replica 를 줄이는 plan 을 다시 생성. K8s 의 native scaling 위에서 실행. component-level autoscaling 의 자연스러운 형태.

§ 08vLLM / SGLang 과의 비교· positioning

vLLM 은 LLM 한 개의 backend, Cornserve 는 multimodal mesh 의 orchestrator

강의에서 자주 묻는 질문 — “vLLM 과 뭐가 다른가”. 답: Cornserve 는 vLLM 의 대체가 아니라 위 layer.

framework

scope

강점

한계

vLLM

LLM 한 개의 single-GPU/multi-GPU backend

PagedAttention · continuous batching · CUDA graph

multimodal pipeline 의 graph 차원 없음

SGLang

structured generation + LLM backend

RadixAttention · 정규 표현식 / JSON schema · continuous batching

multi-component 분산 배치는 별도 layer 필요

Cornserve

multimodal application graph 의 분산 orchestration

model fission · component sharing · cross-app cache · auto deployment

backend 자체는 vLLM/SGLang 에 위임 — 대체 아님

Triton

NVIDIA 의 inference server — generic

model 형식 다양성 · ensemble · multiple backends

LLM/multimodal 특화 최적화 약함

Cornserve 의 자리는 “여러 vLLM 인스턴스 + 여러 ViT 인스턴스를 묶어 한 application 으로 보이게 하는 layer”. backend 는 vLLM/SGLang 그대로 쓰고, 그 위에 graph routing · cache sharing · auto deployment 를 추가.

왜 새 framework 가 필요했나

vLLM/SGLang 은 single-model 가정에서 자라났다. multimodal 의 “여러 model 의 graph” 가정은 그 디자인의 외부. 새 layer 를 통째로 만드는 게 vLLM/SGLang 을 multimodal 로 확장하는 것보다 깨끗. 그래서 Cornserve 는 v0.1.0 에서 별도 프로젝트로 시작.

§ 09다음 — any-to-any 의 도전· future

출력이 텍스트가 아닐 때 — image/audio decoder 도 fission 의 한 자리

2026 의 multimodal LLM 은 출력도 multimodal — GPT-4o 의 voice, Gemini 의 native image generation. 입력 fission 은 이미 이해됐지만, 출력 fission 은 새 도전.

token-level streaming: text 출력은 LLM 의 token 단위. image 출력은 diffusion 의 step 단위, audio 는 vocoder 의 chunk 단위. 같은 framework 안에서 다 표현하려면 — 일반화된 “step” 추상.
pipelining: text → audio 의 경우, text 가 다 끝나기 전에 audio decoder 가 시작할 수 있음. early streaming.
quality control loop: 출력이 image 면 별도 verifier 가 평가. agentic 한 자리 — 강의에서 future work 로.
edge inference: encoder 는 device 위, LLM 은 cloud 위 — fission 이 cloud-edge 분할에도 자연스럽게.

Cornserve 의 디자인이 이 미래에 잘 맞는 이유 — component graph 가 임의 위상이라 출력 component 도 “또 다른 노드” 일 뿐. text-only LLM 은 한 path, multimodal output 은 다른 path. 같은 application 추상.

§ 10기억할 메모와 실습· key takeaways

다시 열었을 때 5분 안에 잡혀야 할 것

model fission

vision encoder · projector · LLM · decoder 를 분리해 각자 적합한 GPU 위에. 같은 component 를 application 사이 공유.

application = component graph

3개 추상 — Component, Application, Route. graph 의 한 path 가 한 request 의 경로.

component 별 GPU 형상

ViT 는 L4 / A10, LLM decode 는 H100 의 HBM bandwidth. 한 GPU type 으로 다 안 한다.

cross-app KV cache sharing

같은 prefix · 같은 image embedding 을 여러 application 이 공유. application 정의가 명시적이라 hash 비교보다 빠름.

routing 의 3 trade-off

load balance · cache locality · queue length. 동적으로 최적 인스턴스 선택.

Cornfigurator 자동 배치

component graph + cluster inventory + SLO → optimal placement. 사람 손 며칠 → 수십초.

vLLM 위 layer

대체 아님. backend 는 vLLM/SGLang 을 그대로. 위에 graph orchestration 만.

v0.1.0 (2025-11)

Apache-2.0 · K8s native · single-command setup. 아직 초기.

실습 시퀀스

Cornserve K8s 셋업 — repo 의 single-command install. minikube 또는 작은 cluster 위에 띄우고 health check.
LLaVA application 정의 — ViT-L + Qwen2-7B 를 component 로 등록, application graph 로 묶음. 첫 image+text request.
component sharing 확인 — 두 application (LLaVA-7B / LLaVA-13B) 이 같은 ViT-L 을 공유하도록. GPU 메모리 사용량 측정.
cache hit 측정 — 같은 이미지에 다른 질문 두 번. 두 번째 request 의 latency 가 처음의 절반 이하인지 확인.
Cornfigurator dry-run — 가상의 cluster + application 으로 plan 생성. 사람이 손으로 짠 plan 과 비교.

YouTubeL093 강의 영상

Repocornserve-ai/cornserve · v0.1.0 · Apache-2.0

PaperCornserve (ACM CAIS 2026) · Cornfigurator (arXiv 2512.14098)

§ 11다른 강의로 이어지는 길· connections

multimodal serving 시리즈 안에서

L058

vLLM internals

PagedAttention · continuous batching — Cornserve 의 backend

L062

SGLang and structured generation

RadixAttention 의 prefix sharing — § 05 의 background

L091

RL · Agents · OpenEnv

multi-component 시스템 운영의 또 다른 사례

L095

Single controller programming with Monarch

분산 orchestration 의 framework 차원

L080

Multimodal LLM architecture

ViT + projector + LLM 의 모델 차원

§ 12열린 질문· open questions

다음에 다시 들었을 때 직접 검증해야 할 것들

transcript 가 실패해서 본 노트의 코드 예시는 repo README 패턴 + 일반적 multimodal serving 도메인 지식의 결합. 실제 API 형태는 repo 의 example 파일 참조.
Cornfigurator 의 정확한 알고리즘 — ILP vs simulated annealing vs RL. arXiv 논문에서 직접 확인 필요.
throughput 비교 숫자 — vLLM 직접 사용 대비 Cornserve 의 이득이 몇 배인지의 구체적 비율은 강의 영상에 등장했을 가능성 (확인 필요).
실제 사용 사례 — repo 의 v0.1.0 시점에서 production 에서 채택한 곳이 있는지 미확인.
any-to-any 의 현재 지원 범위 — image 출력 / audio 출력의 v0.1.0 지원 여부 확인 필요.
fault tolerance — component 한 개가 죽었을 때 graph 의 다른 path 로 우회하는 mechanism 이 어떻게 구현됐는지.

검증 메모

본 노트의 component shape 표 (ViT 는 L4, LLM decode 는 H100 등) 는 일반적 multimodal serving 도메인 지식. Cornserve 가 권장하는 정확한 mapping 은 repo 의 example 또는 강의 영상에서 확인 필요.

← Lecture 092Smol Training Playbook Lecture 094 →tvm-ffi — Tianqi Chen