LESSON 02 · 2026.04.18 · T4

복사 비용의 정체 — pageable vs pinned

같은 커널, 같은 GPU. host 쪽 메모리 선택만 바꿨는데 end-to-end 시간이 5.6 배 뛰었다. D2H 한쪽은 12.7 배. 이 격차의 출처를 숫자와 페이지 테이블로 분해한다.

GPU · T4 · sm_75 sweep · 16 runs PCIe Gen3 x16 실효 · ~13 GB/s

실험

src/vector_add.cu 에 --pinned / --pageable 플래그를 달고, host 버퍼 할당을 cudaMallocHost 와 new float[] 로 분기했다. 같은 GPU 경로 — 커널은 똑같고, 호스트 메모리만 다르다.

Transfer bandwidth 고원

	pinned	pageable	ratio
H2D	~12.3 GB/s	~4.6 GB/s	2.7×
D2H	~13.1 GB/s	~1.03 GB/s	12.7×

End-to-end 총 시간

n	pinned	pageable	ratio
2²⁰	1.12 ms	6.84 ms	6.1×
2²²	4.28 ms	24.09 ms	5.6×
2²⁴	16.87 ms	95.03 ms	5.6×
2²⁶	67.38 ms	375.69 ms	5.6×
2²⁸	270.56 ms	1519.26 ms	5.6×

커널 시간은 두 모드에서 동일하다. 차이는 100% host-side 복사 에서 나온다.

왜 D2H 만 12.7 배인가 — 페이지 fault

H2D 는 유저가 init 루프에서 이미 touch 한 페이지를 읽기만 한다. Fault 없음. 그래서 2.7 배.

D2H 는 반대다. new float[n] 은 가상 주소만 할당하고 물리 페이지를 커밋하지 않는다. 드라이버가 staging → pageable 로 CPU memcpy 할 때 매 페이지에 demand-zero fault 가 발생. 1 GB 버퍼면 약 262k 페이지 fault.

pageable 이 어쩔 수 없이 강제될 때도, output buffer 는 미리 touch 해 두라. memset 한 번이면 D2H 가 수 배 빨라진다.

T4 는 pinned 일 때 PCIe Gen3 실효 상한에 닿는다

H2D 12.3, D2H 13.1 GB/s. Gen3 x16 이론 16 GB/s, 실효 ~13 GB/s. 즉 더 짜낼 여지가 없다. 다음 레버는 세 가지다.

겹치기 — streams + async copy 로 kernel 과 transfer overlap
제거 — persistent device buffer, unified memory, zero-copy
감량 — kernel fusion 으로 round-trip 자체를 줄이기

vLLM 이 KV cache 를 GPU 에 붙박이로 두는 이유, operator fusion 이 이득이 큰 이유의 배경이 여기 있다.

작은 n 은 다른 세계다

n ≤ 2¹⁸ 구간에서는 커널이 launch overhead floor (~4–7 µs) 에 눌린다. "bandwidth %" 개념 자체가 의미 없다. n = 2¹⁸ 에선 apparent bandwidth 가 이론치의 140% — L2 cache hit 때문의 허위 대역폭. 실제 HBM-bound 로 넘어가는 crossover 는 약 n ≈ 2²⁰ (4 MB).

구조적 한 줄

Memory-bound 워크로드에서 가장 큰 레버는 kernel 이 아니라 data path 다. 커널이 이론 대역폭의 70% 를 찍고 있어도 host 쪽을 pageable 로 두면 end-to-end 는 5 배 느려진다.

Prev · 01vector_add Next · 03reduction

LESSON 02 · 2026.04.18 · T4

The true cost of copies — pageable vs pinned

Same kernel, same GPU. I only changed the host-side memory choice and end-to-end time jumped 5.6×. D2H alone jumped 12.7×. We break down the source of that gap with numbers and the page table.

GPU · T4 · sm_75 sweep · 16 runs PCIe Gen3 x16 effective · ~13 GB/s

Experiment

I added --pinned / --pageable flags to src/vector_add.cu and branched host-buffer allocation between cudaMallocHost and new float[]. Same GPU path — identical kernel, only host memory changes.

Transfer bandwidth plateau

	pinned	pageable	ratio
H2D	~12.3 GB/s	~4.6 GB/s	2.7×
D2H	~13.1 GB/s	~1.03 GB/s	12.7×

End-to-end total time

n	pinned	pageable	ratio
2²⁰	1.12 ms	6.84 ms	6.1×
2²²	4.28 ms	24.09 ms	5.6×
2²⁴	16.87 ms	95.03 ms	5.6×
2²⁶	67.38 ms	375.69 ms	5.6×
2²⁸	270.56 ms	1519.26 ms	5.6×

Kernel time is identical in both modes. The gap is 100% on the host-side copy.

Why only D2H is 12.7× — page faults

H2D reads pages the user already touched during init. No fault. Hence 2.7×.

D2H is the opposite. new float[n] only allocates virtual addresses without committing physical pages. When the driver CPU-memcpys from staging → pageable, every page incurs a demand-zero fault. A 1 GB buffer means roughly 262k page faults.

Even when pageable is unavoidable, pre-touch your output buffer. A single memset makes D2H several times faster.

T4 with pinned memory hits PCIe Gen3's effective ceiling

H2D 12.3, D2H 13.1 GB/s. Gen3 x16 theoretical 16 GB/s, effective ~13 GB/s. There's nothing more to squeeze. The three remaining levers are:

Overlap — streams + async copy to overlap kernel and transfer
Elimination — persistent device buffer, unified memory, zero-copy
Reduction — kernel fusion shrinks the round trip itself

This is the background for why vLLM pins its KV cache on the GPU and why operator fusion pays off so well.

Small n is a different world

For n ≤ 2¹⁸, the kernel gets pinned against the launch-overhead floor (~4–7 µs). The "bandwidth %" concept loses meaning. At n = 2¹⁸, apparent bandwidth is 140% of theoretical — a spurious number coming from L2 cache hits. The crossover into true HBM-bound territory sits around n ≈ 2²⁰ (4 MB).

Structural one-liner

For memory-bound workloads the biggest lever isn't the kernel, it's the data path. Your kernel can sit at 70% of theoretical bandwidth and end-to-end still runs 5× slower if the host side is pageable.

Prev · 01vector_add Next · 03reduction