《GPU Mode》 L072 2025 · ScaleML Series High priority transcript · failed

Efficient & Effective Long-Context Modeling for LLMs

Context length 가 4k → 1M 으로 늘어나면 메모리, latency, 정확도 모두 quadratic 으로 무너진다. Guangxuan Xiao (StreamingLLM 저자) 의 이 강의는 long-context 의 세 가지 무너짐을 차례로 푼다 — (a) attention 메모리, (b) position 처리, (c) KV cache 운영. StreamingLLM 의 attention sink 발견에서 시작해, sliding window, position interpolation, KV eviction 으로 이어지는 long-context engineering 의 전체 지도.

long context streaming attention attention sink RoPE scaling YaRN NTK KV cache FlashAttention

Speaker

Guangxuan Xiao

MIT · StreamingLLM · SmoothQuant 저자

강의 번호

L072

스피커

Guangxuan Xiao

학습 우선순위

High · 정독

시리즈

ScaleML

하위 목차 · 12개 섹션↓ 클릭해서 이동

01강의가 풀려는 문제why long context 02long context 가 push 하는 이유use cases 03attention 메모리 절약FlashAttention 류 04streaming attentionattention sink 05position 처리 (RoPE 가족)NTK · YaRN 06KV cache 관리eviction · paging 07검색 vs 본문 처리RAG vs long context 08실측 사례benchmark 결과 09한계where it breaks 10기억할 메모key takeaways 11다른 강의로 이어지는 길connections 12열린 질문open questions

§ 01강의가 풀려는 문제· Why long context

"4k → 1M context" 의 진짜 cost 는 어디에 있는가

2024-2025 frontier 모델은 256k-1M context 를 standard 로 자랑한다. Gemini 1.5 Pro 1M, Claude 3 Sonnet 200k. 이 숫자가 의미하는 게 무엇인가 — 그리고 어떤 trick 으로 가능해졌는가? 강의의 framing 은 3 가지 무너짐을 분리하는 것.

memory blow-up — attention 의 KV cache 가 N (sequence) 에 비례해 무한정. 1M context = ~50GB KV cache (확인 필요).
compute blow-up — attention 자체는 O(N²). 1M token 의 attention 은 1k token 의 1M× 비싸다.
position generalization 무너짐 — 4k 학습 모델이 100k input 에서 hallucinate. RoPE 의 frequency 가 OOD.

강의의 입장 — “이 셋이 다 풀려야 long context 가 동작. 한 둘만 풀면 다른 차원에서 무너진다.”

강의의 인지적 frame

long context 는 단일 trick 이 아니라 3 layer 의 합. 메모리 (FlashAttention, sliding), 위치 (RoPE scaling), KV 운영 (eviction, paging). 셋이 직교하지 않고 서로 영향을 줌 — 한 trick 이 다른 trick 의 가정을 깬다. 그래서 같이 보는 게 의미.

"long context 의 magic 은 트릭 하나가 아니다 — 메모리, 위치, KV 운영 셋이 동시에 작동할 때만 성립한다."강의 §1 재구성 · Guangxuan

§ 02long context 가 push 하는 이유· Use cases

왜 사람들이 1M context 에 돈을 쓰는가 — application 의 4 카테고리

"context 길이 = 모델의 능력" 이라는 association 의 이유. 단순한 marketing 이 아니라 — 새 application 들이 long context 를 essential 하게 요구한다.

application 1 · 코드베이스 분석

repo 통째로

한 repo 의 모든 파일 (수십만 token) 을 모델 input 에. code review, refactoring suggestion, bug hunting. cursor / continue 같은 IDE 가 표준화 중. 1M token 가능 = 중간 규모 repo 통째로.

application 2 · 문서 요약

긴 PDF, 책, 논문 묶음

200 페이지 PDF, 한 책 전체. research synthesis, legal document review, financial filing analysis. RAG 와 경쟁 — long context 가 RAG 의 retrieval error 를 우회.

application 3 · 대화 메모리

한 user 의 모든 history

user 와의 대화 로그를 매 turn 다시 input. personal assistant, long-running agent. 인-메모리 모델 + 대화 history 의 결합.

application 4 · multimodal

video, audio

video 가 frame 당 토큰화되면 한 시간 video = 수십만 token. video understanding, audio transcription + analysis. multimodal long context 의 본격적 영역.

RAG 가 있는데 왜

흔한 비판 — “RAG 로 뽑아서 짧은 context 로 쿼리하면 되지 않나”. 강의의 답은 두 가지. (a) RAG 의 retrieval 이 실패하는 케이스 (cross-document reasoning) 가 많다. (b) long context 와 RAG 는 보완재 — RAG 가 candidate 을 추리고, long context 가 그 candidate set 위에서 reasoning. §07 에서 본격.

"long context 와 RAG 는 경쟁이 아니다 — 둘 다 같은 문제 (long task) 의 두 측면이다."학습 노트 · §2

§ 03attention 메모리 절약· FlashAttention 류

"O(N²) memory" 를 깨는 첫 트릭 — tiling 으로 streaming

attention 의 메모리는 본래 O(N²) — N×N matrix QKᵀ 를 계산해서 softmax. FlashAttention (Dao et al., 2022, 2023) 의 핵심은 — 이 matrix 를 explicit 으로 만들지 않고 tile 단위로 streaming. 메모리는 O(N), 속도도 더 빠르다 (HBM ↔ SRAM 왕복 절약).

FlashAttention 의 핵심 idea (단순화).

# 표준 attention
S = Q @ Kᵀ           # [N, N] 메모리 폭발
P = softmax(S)
O = P @ V

# FlashAttention - tile 단위
for i_block in range(N_tiles):
    for j_block in range(N_tiles):
        S_ij = Q_i @ K_jᵀ      # [Br, Bc] 작음
        m_ij = max(S_ij)
        P_ij = exp(S_ij - m_ij)
        O_i  = update(O_i, P_ij, V_j, online softmax)

핵심 — online softmax 로 normalizer 를 streaming 갱신. 한 번에 한 tile 만 SRAM 에 들고, 결과를 바로 누적. 결과적으로 N×N matrix 가 SRAM 에 들어감 (= 한 tile 사이즈).

FlashAttention 의 진화

FA-1 (2022) — tiling + online softmax. 4× faster, 16× less memory.
FA-2 (2023) — work partitioning 개선. backward 속도 ↑.
FA-3 (2024) — H100 의 새 hardware feature (TMA, asynchronous WGMMA) 활용. ~2× over FA-2.
FlashDecoding — inference 의 single-query attention. KV cache 위에서.

근본 의미

FlashAttention 의 algorithmic complexity 는 여전히 O(N²) — 모든 token 쌍을 계산. memory 만 O(N)이 됨. compute 는 그대로. 그래서 1M context 도 wall clock 은 quadratic 으로 무거움.

FlashAttention 만으로는 1M 까지 못 간다 — compute 가 quadratic. 다음 단계는 "모든 token 쌍을 계산할 필요가 없다" 는 깨달음 = sliding window / sparse attention. §04.

§ 04streaming attention· Attention sink discovery

StreamingLLM — 무한 length 를 가능하게 한 두 줄짜리 트릭

강의의 핵심 contribution. StreamingLLM (Xiao et al., 2023) 은 단순한 sliding window 가 학습된 모델에서 갑자기 깨지는 현상을 보고, "맨 앞 token (BOS) 의 KV 가 attention sink 역할" 임을 발견. 단순 fix — 맨 앞 4-token 의 KV 만 보존하고 나머지는 sliding window — 으로 무한 length 가 가능해짐.

FIG · 4 가지 attention patterncausal · window · sink+window

Full causal

O(N²) cost

Sliding window

naive 는 깨짐

Sink + window ★

∗

StreamingLLM

Dilated

LongFormer 류

단순 sliding window 는 학습된 모델에서 갑자기 perplexity 가 폭발. 첫 4 token 의 attention sink 만 추가하면 정상으로 돌아옴 — 그것이 StreamingLLM 의 발견.

왜 attention sink 가 작동하는가

softmax 의 normalize 가 핵심. 현재 token 이 “주의를 둘 곳이 없을 때” 도 softmax 는 100% 어딘가로 흘러야 한다. 그래서 모델은 학습 중에 "의미 없는 곳" 으로 첫 token 을 학습. 첫 token 이 dump(쓰레기통) 역할. 이게 빠지면 softmax 가 다른 의미 있는 token 들을 over-attend, 분포가 망가짐.

학습 vs inference

StreamingLLM 은 fine-tune 없이 inference time 트릭. 4 sink + 4k window 같은 설정으로 1M 길이까지 perplexity 안정. 단, 진짜 long-range dependency 는 못 잡음 (sliding window 의 한계). content 의 절반만 “뒤에” 있는 RAG-style 시나리오에 좋음.

"모델이 첫 token 을 ‘쓰레기통’ 으로 학습한다 — 그 사실 한 줄로 무한 length 가 풀린다."강의 §4 · StreamingLLM 발견

§ 05position 처리 (RoPE 가족)· NTK · YaRN · linear scaling

"4k 에 학습된 RoPE 를 100k 까지 늘이는" 트릭들

대부분 모델이 RoPE (Rotary Position Embedding) 사용. 학습 중 본 max position 너머로 가면 — RoPE 의 회전이 OOD frequency 가 됨 — 모델이 갑자기 무너짐. 이 문제를 inference 또는 light fine-tune 으로 푸는 트릭 family.

FIG · RoPE scaling 방법 비교5 family

방법fine-tune2-4× 확장10×+ 확장구현

naive (no scaling)N/A깨짐깨짐기본

linear (PI)필요OK약함division

NTK-aware선택OK중간base 변경

YaRNlightOK강함NTK + ramp

LongRoPE / Position Interp.필요OK강함search

ABF (adjusted base freq)필요OK강함retrain

YaRN 은 NTK + 일부 dimension 의 ramp 처리. 가장 많이 사용되는 long-context fine-tune 방법. 큰 ratio 확장에서 가장 robust.

linear vs NTK 직관

Position Interpolation (PI) — 모든 RoPE frequency 를 같은 비율로 압축. 4k → 16k 면 모든 frequency 를 4× 느리게. 모든 dimension 의 정밀도가 떨어짐.
NTK-aware — 높은 frequency (단기 패턴) 는 거의 그대로, 낮은 frequency (장기 패턴) 만 더 압축. 단기 패턴 보존 + 장기 확장.
YaRN — NTK 에 ramp + temperature 추가. 더 robust.

실전 가이드

4× 이내 확장은 NTK 만으로 충분 (Llama 2 4k → 16k). 8× + 확장은 YaRN + light fine-tune. 32×+ 확장은 LongRoPE / ABF + 본격 fine-tune. "같은 길이를 다른 trick 으로 풀면 perplexity 차이" — paper benchmark 의 standard.

"position embedding 은 inference 의 OOD 문제다 — RoPE scaling 의 모든 트릭은 OOD 를 in-distribution 으로 옮기는 일."학습 노트 · §5

§ 06KV cache 관리· Eviction · paging · compression

1M context 의 50GB KV cache 를 어떻게 들고 다닐 것인가

long context 에서 가장 큰 메모리 burden 은 KV cache. context length × num_layers × hidden_size × 2 (K, V) × dtype_bytes. 1M token Llama-70B ≈ 80GB+. GPU 한 장 에 안 들어감.

전략 1 · eviction

덜 중요한 KV 버리기

H2O (Heavy-Hitter Oracle), TOVA, Scissorhands. attention score 기반 중요도 측정 → 낮은 score 의 KV 제거. eviction policy 가 곧 알고리즘.

전략 2 · paging (vLLM)

KV 를 CPU 에 swap

PagedAttention, Block-level. "OS 의 가상 메모리" 의 GPU 버전. 자주 쓰는 page 만 GPU, 나머지는 CPU. swap 비용은 있지만 length 는 무한.

전략 3 · quantization

KV 를 INT4/INT8 로

KIVI, KVQuant. KV 자체를 4-bit 으로. 메모리 4×, 정확도 거의 유지. quantization 자체는 §073 의 본격 영역.

전략 4 · compression

low-rank, group share

GQA (group query attention), MLA (multi-latent attention, DeepSeek V2). 여러 head 가 KV 공유. 학습 시점 디자인.

H2O 의 핵심 idea

Heavy-Hitter Oracle — 모든 token 이 동등하게 중요하지 않다. 일부 token (heavy hitter) 이 다른 token 들에게서 강한 attention 을 계속 받음. eviction 시 — 각 token 이 받은 누적 attention score 를 추적, 가장 낮은 score 의 token 을 버림.

# 단순화한 H2O
def evict_kv(K_cache, V_cache, attention_scores):
    # 각 token 의 누적 attention score (모든 query 로부터)
    accumulated = attention_scores.sum(dim=0)  # [seq_len]
    # 하위 N 개 token 의 KV 제거
    keep_mask = accumulated > threshold
    return K_cache[keep_mask], V_cache[keep_mask]

eviction 의 trade-off

eviction 이 실수하면 나중에 필요한 정보를 영원히 잃음. recall 에 민감한 task (needle in haystack) 에서 H2O 같은 attention-score eviction 이 깨질 수 있음. paging 은 정보를 잃지 않지만 swap latency 가 추가. 어느 쪽이 좋은지는 use case.

§ 07검색 vs 본문 처리· RAG vs long context

"이미 long context 인데 RAG 가 왜 필요한가" — 그리고 그 반대

두 접근이 자주 경쟁으로 묘사되지만, 강의의 입장은 “보완재”. 각자 강한 영역이 다르다.

FIG · RAG vs long-context 의 강·약점5 axis

axisRAGlong context혼합비고

총 cost낮음높음중간RAG 이김

cross-doc reasoning약강강long ctx 이김

recall (needle)retrieval miss 위험놓치지 않음놓치지 않음long ctx 이김

latency (single Q)낮음높음중간RAG 이김

freshness강 (실시간)학습된 데이터만강RAG 의 추가

긴 reasoning chain약강강long ctx

현재 frontier 추세 — RAG + long context 의 hybrid. RAG 가 candidate 100k token 추리고, long context 가 그 안에서 reasoning. 두 차원의 결합.

이 프레이밍이 application 디자인의 출발점. 처음부터 “RAG 만” 또는 “long context 만” 이라고 못 박지 말 것. task 에 맞는 비율의 hybrid 가 거의 항상 best.

§ 08실측 사례· Benchmark 결과

"1M context 가 진짜 작동하는가" — needle, ruler, long-bench

long context 의 진짜 평가는 단순 perplexity 가 아니다 — 특정 정보를 retrieve 할 수 있는가, 여러 fact 를 결합할 수 있는가. 표준 benchmark 셋.

Needle In A Haystack (NIAH) — 긴 context 에 한 fact 를 숨기고 question. retrieval 성능. 가장 단순한 lower bound.
RULER — needle 의 더 복잡한 변종. multi-needle, multi-key, common-words. 더 realistic.
LongBench — 다양한 NLP task 의 long-context 변종. summarization, QA, code completion 등.
InfiniteBench — 100k 이상 의 task. truly long.
L-Eval, ZeroSCROLLS — academic 표준.

결과의 큰 모양

실측에서 흔히 발견되는 패턴 (확인 필요).

middle 의 “lost in the middle” — context 의 시작과 끝은 잘 보지만, 중간 부분의 정보를 잘 놓침. 모든 모델에서 공통.
context length 가 길어질수록 정확도 하락 — 1M 이라고 광고해도 800k 이상에서는 강한 degradation 흔함.
multi-needle 에서 더 빠른 하락 — single needle 은 잘하는 모델도 5 needle 은 못 함. 진짜 reasoning 은 아직 약함.
frontier closed 모델 (Gemini, Claude) 이 open 보다 일관적으로 좋음 — open 모델의 long context 는 아직 catching up.

advertised context vs effective context

"광고된 context length" 와 "실제 잘 동작하는 context length" 는 다르다. RULER 같은 benchmark 에서 90% 이상 정확도가 유지되는 길이를 “effective context” 로 본다. 보통 광고 길이의 절반 또는 그 이하.

§ 09한계· Where it breaks

지금까지의 트릭이 풀지 못 하는 것들

cross-document reasoning — multi-needle, multi-fact. attention sink 와 sliding window 만으로는 안 됨. 진짜 long-range dependency 가 필요한 task 에서 RAG 가 더 강할 수도.
중간 부분의 lost — 모델 학습의 frequency 분포의 직접 결과. 여전히 active research.
compute cost — FlashAttention 으로 메모리는 풀었지만 compute 는 여전히 quadratic. 1M token 의 attention 은 4k 의 62500× 비싸다.
training data 부족 — 100k+ 길이의 양질 학습 데이터가 부족. 짧은 문서 concatenate 로는 진짜 long-range 를 못 가르침.
evaluation 이 어려움 — long-context eval 자체가 사람의 검증이 어려움. NIAH 같은 synthetic eval 은 brittle.

"광고된 1M context 는 한 fact retrieval 에서 잘 도는 1M 일 뿐 — 진짜 1M token 의 reasoning 은 아직 frontier."학습 노트 · §9

다음 frontier — linear/sub-quadratic attention. Mamba, RWKV, RetNet, Linear Attention 등의 alternatives. 또는 MoE 와의 결합 — 길이가 길어지면 일부 expert 만 active 되도록.

§ 10기억할 메모와 코드· Key takeaways

다시 열었을 때 5분 안에 잡혀야 할 것

3 가지 무너짐

memory · compute · position. 세 다 풀려야 long context.

FlashAttention

tile + online softmax 로 O(N) memory. compute 는 여전히 O(N²).

attention sink

첫 4 token 이 “쓰레기통” 역할. sliding window + sink 로 무한 length.

RoPE scaling

linear (PI) → NTK → YaRN 의 사다리. 큰 ratio 확장은 fine-tune 필요.

KV cache 전략

eviction (H2O) · paging (vLLM) · quantization (KIVI) · compression (GQA, MLA).

RAG vs long context

경쟁 아님. RAG candidate 추리고 long context reasoning. hybrid.

lost in the middle

시작/끝은 좋고 중간은 약함. 모든 모델 공통. 학습 frequency 의 결과.

advertised vs effective

1M context 광고 ≠ 1M context 작동. RULER 같은 eval 로 effective length 측정.

YouTube youtube.com/watch?v=DFcKFDt0QEg

Slides StreamingLLM.pdf

Repo gpu-mode/lectures/lecture_072 · mit-han-lab/streaming-llm

관련 paper FlashAttention (Dao), StreamingLLM (Xiao), YaRN (Peng), H2O (Zhang), PagedAttention/vLLM

손에 새기기 — 실습 시퀀스

StreamingLLM 직접 실행 — repo 에서 hello world. 4 sink + 4k window 로 100k 길이 입력해서 perplexity 확인. naive sliding 과 비교.
NIAH 직접 만들기 — long input 에 needle 한 줄 삽입, 모델에 retrieve 시켜 본다. context length sweep 으로 정확도 그래프.
FlashAttention vs eager attention — PyTorch 의 scaled_dot_product_attention backend swap. 같은 input 의 memory + latency 비교.
YaRN fine-tune — small Llama 를 4k → 16k 로 YaRN 으로 light fine-tune. 학습 전후 NIAH 점수 측정.
vLLM PagedAttention — vLLM 으로 long context inference 한 번 돌려보고 pagination 이 OOM 회피하는 것 직접 확인.
KV cache profiling — long input 의 KV cache 메모리 사용량 측정. nvidia-smi 또는 PyTorch memory profiler.
lost-in-the-middle 재현 — 같은 fact 를 시작 / 중간 / 끝 에 두고 retrieve 정확도 비교. paper 와 같은 패턴 직접 확인.

§ 11다른 강의로 이어지는 길· Connections

attention / inference / quantization 의 가족

L071

FlexOlmo

ScaleML 시리즈 동료. data 측면

L073

Quantization in LMs

ScaleML 시리즈 동료. KV cache quantization 직접 적용

long context inference 의 KV 공유

L018

Fusing kernels

FlashAttention 의 fused kernel 의 본격

L001

profile CUDA kernels

attention kernel 의 profiling 도구

§ 12열린 질문· Open questions

이 노트가 의도적으로 비워둔 자리들

Mamba / linear attention 의 위치 — 강의에서 다루는지 확인 필요. attention quadratic 의 alternative 라는 점에서 자연스러운 다음 주제.
YaRN 의 정확한 수식 — 본문은 비유 수준. paper 의 정확한 ramp/temperature 수식은 별도 reading.
multimodal long context — video/audio 의 long context. 강의에서 §02 application 으로 짚었지만 자세한 mechanism 은 별도.
Guangxuan 의 이후 work — StreamingLLM 다음. Quest, MInference 같은 KV optimization. 본문에서 일부만 언급.
실측 수치 — context length 별 perplexity, NIAH 정확도 등. paper Table 직접 확인 권장.

검증 메모

본문은 long-context engineering 의 일반론과 StreamingLLM 의 핵심 발견을 합성. 강의 영상의 정확한 표현이 transcript 복원 후 확인 권장. RoPE scaling 의 수치 (4× / 8× / 32×) 는 일반적 도메인 지식 범위.

← Lecture 071 FlexOlmo — Sewon Min Lecture 073 → Quantization in Large Models