출처 NVIDIA GA100 WP v1.0 Fig.5 / GH100 WP v1.04 Fig.7
| unit | 역할 | 개수/partition |
|---|---|---|
| SP/FP32 | FP32 FFMA | 16 (A100) · 32 (H100) |
| INT32 | 정수 IMAD | 16 |
| FP64 | FFMA.F64 | 8 (A100) · 16 (H100) |
| Tensor Core | MMA/WGMMA | 1 (3rd/4th gen) |
| LSU | Load/Store | 8 |
| SFU | transcendental | 4 |
출처 GA100 WP Fig.6 · GH100 WP §3.1
┌──────────────── SM ────────────────┐ │ L1 I-Cache · const cache (shared) │ ├────┬────┬────┬────┐ │ │ P0 │ P1 │ P2 │ P3 │ 4 partitions │ │ ┌──┴──┐ ... │ │ │warp │ scheduler · dispatch │ │ │sched│ 1 warp/cyc │ │ └──┬──┘ │ │ ┌──┴──┐ │ │ │ RF │ 16K × 32b (partition) │ │ └──┬──┘ │ │ ┌──┴──┐ │ │ │ OC │ operand collector │ │ └──┬──┘ │ │ ┌──┴──────────────────┐ │ │ │FP32·INT·FP64·SFU·LSU│ │ │ │ Tensor Core (4th) │ │ │ └──┬──────────────────┘ │ ├────┴──────── L1/Smem 228 KB ───────┤ │ Tex unit · RT Core (GA10x/AD10x) │ └────────────────────────────────────┘
출처 CUDA C PG 12.6 App. H "Compute Capability 8.0/9.0"
| cache | 용량 | scope |
|---|---|---|
| L0 I-cache | partition 전용 | 1 warp scheduler |
| L1 I-cache | ~32 KB | SM 전체 |
| constant $ | 8 KB / partition | __constant__ |
GA100/GH100 동일 계층. 수치는 CUDA C PG App.H 추정 상한.
| arch | reg/SM | byte/SM | max reg/thread |
|---|---|---|---|
| Ampere sm_80 | 65,536 | 256 KB | 255 |
| Hopper sm_90 | 65,536 | 256 KB | 255 |
| Blackwell sm_100 | 65,536 | 256 KB | 255 |
출처 CUDA C PG 12.6 Table 21 "Technical Specifications per Compute Capability"
ptxas가 allocation 담당.
cycle t: FFMA r0, r1, r2, r3
│
┌──────────┴──────────┐
│ Operand Collector │
│ slot: A B C accum │
│ ↑ ↑ ↑ │
└───┬──┬──┬──┬────────┘
│ │ │ │
RF RF RF RF (bank select cycle-staggered)
│ │ │ │
└──┴──┴──┴─→ ALU FP32 FMA
개념도 · 출처: Greg Smith GTC 2010 "Fermi SM" · Volta 이후 구조 동일.
| storage | latency | BW |
|---|---|---|
| register | ~1 cyc | full |
| local (L1 hit) | ~30 cyc | SM L1 BW |
| local (HBM) | ~400 cyc | HBM BW |
출처 CUDA C PG App.H "Local Memory" + 표준 micro-benchmark 권장값.
출처 CUDA Occupancy Calculator (deprecated) 동일 식 · PG Table 21.
setmaxnreg ★setmaxnreg.inc.sync.aligned 232: consumer warp이 큰 regsetmaxnreg.dec.sync.aligned 40: producer warp은 최소-maxrregcount or __launch_bounds__로 상한 지정-Xptxas -v로 spill 확인 (stack frame bytes)| arch | L1+Smem 총량/SM | Smem 최대/block |
|---|---|---|
| Volta (GV100) | 128 KB | 96 KB |
| Ampere (GA100) | 192 KB | 164 KB |
| Ada (AD102) | 128 KB | 100 KB |
| Hopper (GH100) | 256 KB | 228 KB |
| Blackwell (GB100) | 256 KB | 228 KB |
출처 GA100 WP Table 2 · GH100 WP §3.2 · GB100 WP v1.0 §3.1
cudaFuncSetAttribute(..., cudaFuncAttributePreferredSharedMemoryCarveout, %)출처 CUDA C PG 12.6 §3.2.5 / GA100 WP Table 3
cudaDeviceSetSharedMemConfig상세 PTX는 ↗ V04 §4·§7
cluster (1..16 CTA, default ≤8) ├── CTA0.smem 228 KB ─┐ ├── CTA1.smem 228 KB ─┤ unified ├── CTA2.smem 228 KB ─┼── logical └── CTA3.smem 228 KB ─┘ space addr = (rank, offset)
cluster.map_shared_rank (PTX) → remote ptr출처 GH100 WP §3.3 "Thread Block Cluster" · PTX ISA 8.x §9.7.12
| arch | L2 total | 구조 |
|---|---|---|
| Volta GV100 | 6 MB | unified |
| Ampere GA100 | 40 MB | 2 partition × 20 MB |
| Ada AD102 | 96 MB | unified |
| Hopper GH100 | 50 MB | 2 partition × 25 MB |
| Blackwell GB100 die | ~60 MB / die | dual-die via NV-HBI |
출처 GA100 WP Table 2 · GH100 WP §3.4 · GB200 WP §2.1 (die partitioned)
출처 GH100 WP Fig.10 · GB200 WP "NV-HBI" 항목
// host cudaStreamAttrValue attr = {}; attr.accessPolicyWindow.base_ptr = p; attr.accessPolicyWindow.num_bytes = N; attr.accessPolicyWindow.hitRatio = 1.0f; attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; cudaStreamSetAttribute(s, cudaStreamAttributeAccessPolicyWindow,&attr);
출처 CUDA C PG 12.6 §5.2.6 "L2 Cache Set-Aside"
| arch | L2 set-aside 최대 |
|---|---|
| GA100 | 30 MB (40 MB 중) |
| GH100 | ≤ 75% |
cudaDeviceGetLimit(cudaLimitPersistingL2CacheSize)
출처 GA100 WP §2.4 "Compute Data Compression"
hitRatio 낮춰도 효과 반감| spec | HBM2e | HBM3 | HBM3e |
|---|---|---|---|
| stack 높이 | 8-hi | 8/12-hi | 8/12-hi |
| stack 용량 | 16 GB | 16/24 GB | 24/36 GB |
| ch/stack | 8 | 16 | 16 |
| pin/ch | 16-bit | 16-bit | 16-bit |
| data rate | 3.2 Gb/s | 6.4 Gb/s | 8~9.2 Gb/s |
| BW/stack | ~410 GB/s | ~820 GB/s | ~1.2 TB/s |
출처 JEDEC JESD238A (HBM3) · NVIDIA GH100/GB200 WP 공식 수치
| GPU | stack 수 | 용량 | peak BW |
|---|---|---|---|
| A100 SXM4 80GB | 5(=6 중 1 disabled) | 80 GB HBM2e | 2.039 TB/s |
| H100 SXM5 80GB | 5(=6 중 1) | 80 GB HBM3 | 3.35 TB/s |
| H200 SXM | 6 | 141 GB HBM3e | 4.8 TB/s |
| B200 (GB100) | 8 (2 die × 4) | 192 GB HBM3e | 8.0 TB/s |
출처 NVIDIA A100 WP Table 2 · H100 WP Table 2 · H200 datasheet · B200 GTC'24 spec
SM → L2 slice → MC → HBM stack
↑
per-stack controller
↑
channel pseudo-channel
↑
bank group · bank · row
출처 GH100 WP §3.4 "Memory Subsystem"
PG §5.3.2 "Device Memory Access"
| 항목 | 영향 |
|---|---|
| 용량 | user-visible 크기 이미 ECC 감안 (명시된 80 GB) |
| BW | 공식 peak 는 ECC on 기준 |
| off 옵션 | A100+ 는 nvidia-smi -e 0 지원 (DC only) |
출처 NVIDIA HBM ECC 기술문서 · nvidia-smi man.
| link | A100 | H100 | B200 |
|---|---|---|---|
| NVLink | 600 GB/s (3rd) | 900 GB/s (4th) | 1800 GB/s (5th) |
| PCIe | Gen4 x16 64 GB/s | Gen5 x16 128 GB/s | Gen5 x16 128 GB/s |
출처 A100/H100 WP · GB200 GTC'24 session
| gen | arch | CC | 대표 shape |
|---|---|---|---|
| 1st | Volta GV100 | 7.0 | m8n8k4 FP16→FP32 |
| 2nd | Turing TU102 | 7.5 | + INT8/INT4 |
| 3rd | Ampere GA100 | 8.0 | m16n8k16 + TF32/BF16/FP64 |
| 4th | Hopper GH100 | 9.0 | wgmma m64nNk16 + FP8 |
| 5th | Blackwell GB100 | 10.0 | + FP4/FP6 (micro-scale) |
출처 NVIDIA 공식 세대 표기 (각 arch WP).
| dtype | V100 | T4 | A100 | H100 | B200 |
|---|---|---|---|---|---|
| FP64 TC | — | — | ✓ | ✓ | ✓ |
| TF32 | — | — | ✓ | ✓ | ✓ |
| FP16 | ✓ | ✓ | ✓ | ✓ | ✓ |
| BF16 | — | — | ✓ | ✓ | ✓ |
| INT8 | — | ✓ | ✓ | ✓ | ✓ |
| INT4 | — | ✓ | ✓ | (dep) | — |
| FP8 E4M3/E5M2 | — | — | — | ✓ | ✓ |
| FP6 E3M2/E2M3 | — | — | — | — | ✓ |
| FP4 E2M1 | — | — | — | — | ✓ |
출처 GV100/GA100/GH100/GB200 WP + PTX ISA 8.x "mma" 허용 dtype 목록
| TF @ dtype | A100 | H100 | B200 |
|---|---|---|---|
| FP64 TC | 19.5 | 67 | 40 |
| TF32 | 156 | 495 | 1100 |
| FP16/BF16 | 312 | 989 | 2250 |
| FP8 | — | 1979 | 4500 |
| FP4 | — | — | 9000 |
단위 TF/s · dense (2:4 sparsity 아님) · 출처 A100/H100 WP Table 2 · B200 GTC'24 keynote 공식 슬라이드
출처 GA100 WP §2.3 "Sparsity"
| gen | 변경점 | 소프트웨어 영향 |
|---|---|---|
| 1st(V100) | warp-level WMMA | C++ API 만 |
| 3rd(A100) | PTX mma.sync | thread fragment 직접 |
| 3rd+cp.async | A100 G→Smem 직결 | double buffer 단순화 |
| 4th(H100) | wgmma async + smem operand | warp-group 프로그래밍 |
| 5th(B200) | micro-scale + per-block scale | dtype 선언에 scale block 포함 |
PTX ISA 8.x §9.7.13 "Matrix multiply-accumulate"
| spec | 값 |
|---|---|
| 프로세스 | TSMC N7 (7 nm) |
| die 크기 | 826 mm² |
| 트랜지스터 | 54.2 B |
| GPC / TPC / SM (full) | 8 / 64 / 128 |
| A100 enabled SM | 108 (SXM4) |
| FP32 core / SM | 64 |
| TC / SM | 4 (3rd gen) |
출처 NVIDIA GA100 WP v1.0 §2.1 / Table 2
| 자원 | 값 |
|---|---|
| FP32 core | 6,912 |
| FP64 core | 3,456 |
| Tensor Core | 432 |
| base clock | 1095 MHz |
| boost clock | 1410 MHz |
| level | cap | BW/latency |
|---|---|---|
| Reg/SM | 256 KB | ~1 cyc |
| L1+Smem/SM | 192 KB | ~20 cyc |
| L2 | 40 MB | ~200 cyc |
| HBM2e | 40/80 GB | 1.555 / 2.039 TB/s |
출처 GA100 WP Table 2 · latency는 public micro-benchmark 권장값 ↗ V02 §14
mbarrier 초기 버전)출처 GA100 WP §2
| dtype | dense TF | 2:4 TF |
|---|---|---|
| FP64 | 9.7 | — |
| FP64 TC | 19.5 | — |
| FP32 | 19.5 | — |
| TF32 TC | 156 | 312 |
| FP16/BF16 TC | 312 | 624 |
| INT8 TC | 624 TOPS | 1248 TOPS |
출처 A100 WP Table 2 · boost clock 기준
| model | mem | TDP | form |
|---|---|---|---|
| A100 SXM4 40 | 40 GB | 400 W | SXM4 |
| A100 SXM4 80 | 80 GB | 400 W | SXM4 |
| A100 PCIe 40 | 40 GB | 250 W | PCIe Gen4 |
| A100 PCIe 80 | 80 GB | 300 W | PCIe Gen4 |
NVIDIA A100 Datasheet (2021)
| spec | 값 |
|---|---|
| 프로세스 | TSMC 4N |
| die 크기 | 814 mm² |
| 트랜지스터 | 80 B |
| GPC / TPC / SM (full) | 8 / 72 / 144 |
| H100 SXM5 enabled SM | 132 |
| FP32 core / SM | 128 |
| TC / SM | 4 (4th gen) |
출처 NVIDIA GH100 WP v1.04 §2.1 / Table 2
| 자원 | 값 |
|---|---|
| FP32 core | 16,896 |
| FP64 core | 8,448 |
| Tensor Core | 528 |
| base clock | 1590 MHz |
| boost clock | 1980 MHz |
| level | cap | BW/latency |
|---|---|---|
| Reg/SM | 256 KB | ~1 cyc |
| L1+Smem/SM | 256 KB | ~23 cyc |
| L2 | 50 MB (2×25) | ~260 cyc |
| HBM3 | 80 GB | 3.35 TB/s |
| NVLink 4.0 | 18 link | 900 GB/s |
출처 GH100 WP Table 2 · latency는 NVIDIA H100 micro-benchmark (nvidia GTC'22)
상세 ↗ V02 §9 · V04 §1–§7
| dtype | dense TF | 2:4 TF |
|---|---|---|
| FP64 | 34 | — |
| FP64 TC | 67 | — |
| FP32 | 67 | — |
| TF32 TC | 495 | 989 |
| FP16/BF16 TC | 989 | 1979 |
| FP8 TC | 1979 | 3958 |
| INT8 TC | 1979 TOPS | 3958 TOPS |
출처 H100 WP Table 2 · SXM5 / boost clock
| model | mem | TDP |
|---|---|---|
| H100 SXM5 | 80 GB HBM3 | 700 W |
| H100 PCIe | 80 GB HBM2e | 350 W |
| H100 NVL | 94 GB HBM3 | 400 W |
| H200 SXM | 141 GB HBM3e | 700 W |
H100 datasheet / H200 datasheet (2024)
vimax3·vimin3: 3-input SIMD max/min + relu출처 GH100 WP §2.6 "DPX"
cuTensorMapEncodeTiled host 생성cp.async.bulk.tensor출처 GH100 WP §2.4 · PTX 8.x §9.7.8 · 상세 ↗ V04 §4·§5
| issue | sm_80 cp.async | sm_90 TMA |
|---|---|---|
| 발행 | 32 thread 협력 | 1 thread |
| 주소 계산 | SW | HW (descriptor) |
| OOB | branch | HW fill |
| multi-D | manual | native |
| multicast | — | cluster-wide |
출처 PTX ISA 8.x §9.7.13.2 "wgmma" · CUTLASS 3.x Hopper kernel
__cluster_dims__(x,y,z) 또는 launch attributecluster.sync / cluster.arrive출처 GH100 WP §3.3 · CUDA C PG §7.27
st.shared::cluster, ld.shared::clusteratom.shared::cluster출처 PTX ISA 8.x §9.7.12 · GH100 WP §3.3
출처 GH100 WP §2.3 "Transformer Engine" · NVIDIA TE github docs
TMA ─────┐ ┌─── smem ─── WGMMA
├─(mbar)──┤
producer │ │ consumer
warps ↓ ↓ (warp-specialized)
Cluster + DSM: 여러 CTA 간 smem 공유
TE: FP8 layer 주변 wrapper
| spec | B200 (GB100 dual-die) |
|---|---|
| 프로세스 | TSMC 4NP |
| die 크기 | 2 × 800 mm² (reticle limit) |
| 트랜지스터 | 208 B (total) |
| die 간 link | NV-HBI 10 TB/s |
| SM (enabled) | 2 × 80 = 160 |
| FP32 core / SM | 128 |
| TC / SM | 4 (5th gen) |
출처 NVIDIA GB200 NVL72 WP · GTC'24 Keynote spec slide
| level | B200 |
|---|---|
| Reg/SM | 256 KB |
| L1+Smem/SM | 256 KB |
| L2/die | ~60 MB |
| HBM3e | 192 GB (8 stack) |
| HBM BW | 8.0 TB/s |
| NVLink 5.0 | 1800 GB/s |
출처 GB200 WP §2 · B200 공식 데이터시트 (2024)
출처 GB200 NVL72 WP §2.2 · GTC'24
| dtype | dense PF | 2:4 PF |
|---|---|---|
| FP64 TC | 0.04 | — |
| TF32 | 1.1 | 2.2 |
| FP16/BF16 | 2.25 | 4.5 |
| FP8 | 4.5 | 9 |
| FP6 | 4.5 | 9 |
| FP4 | 9 | 18 |
단위 PF/s (PetaFLOP/s) · 출처 NVIDIA B200 공식 spec sheet · dense = non-sparsity
| metric | A100 | H100 | B200 |
|---|---|---|---|
| process | N7 | 4N | 4NP |
| transistor | 54 B | 80 B | 208 B |
| SM | 108 | 132 | 160 |
| FP32 core | 6912 | 16896 | 20480 |
| TC gen | 3rd | 4th | 5th |
| Reg/SM | 256K | 256K | 256K |
| Smem/SM | 164K | 228K | 228K |
| L2 | 40 MB | 50 MB | ~120 MB |
| HBM | 80 HBM2e | 80 HBM3 | 192 HBM3e |
| BW | 2.04 TB/s | 3.35 TB/s | 8.0 TB/s |
| NVLink | 600 GB/s | 900 GB/s | 1800 GB/s |
| TDP (SXM) | 400 W | 700 W | 1000 W |
| FP16 TC | 312 TF | 989 TF | 2250 TF |
| FP8 TC | — | 1979 TF | 4500 TF |
출처 A100/H100/GB200 공식 WP · SXM form · boost clock · dense
| reason | 원인 |
|---|---|
| Long Scoreboard | global / local / surface load 대기 |
| Short Scoreboard | smem / TC pipeline 대기 |
| Wait | fixed-latency inst 완료 대기 |
| Drain | exit 경로 동기화 |
| MIO Throttle | LSU/atomic/mem queue 포화 |
| Tex Throttle | Tex unit 대기 |
| Barrier | __syncthreads·mbarrier 대기 |
| Dispatch Stall | issue slot 경쟁 |
| No Instruction | I-cache miss |
| Not Selected | eligible 이나 선택 안 됨 |
출처 Nsight Compute "Warp State Statistics" 섹션 · 상세 실측 해석 ↗ V18 §6
출처 Little's law 적용. NVIDIA GTC'16 "GPU Performance Analysis" 참고.
sm__inst_issued section[F] fetch I-cache → IB [D] decode opcode + operand spec [OC] operand coll. RF bank read staging [EX] execute ALU / LSU / TC / SFU [WB] write-back RF write
출처 Greg Smith "Fermi SM" GTC'10 · Volta/Turing 백서 종합.
| pipe | 예 instruction | latency (cyc, A100) |
|---|---|---|
| FP32 FMA | FFMA | 4 |
| INT IMAD | IMAD | 4 |
| SFU | RCP·RSQ·EX2 | 16~32 |
| TC MMA | HMMA | 16~32 |
| LSU smem | LDS | ~20 |
| LSU L2 | LDG.L2 | ~200 |
| LSU HBM | LDG | ~400 |
latency 는 NVIDIA micro-benchmark (Jia et al. 2019 / 2021) 통계 권장값 · 세대별 변동 ↗ V02 §14
SASS 관련 ↗ V04 §14·§15
cp.async 는 issue 후 즉시 retire queue 로mbarrier 또는 wgmma.wait_groupAsync proxy 상세 ↗ V04 §10
출처 PTX ISA 8.x §8 "Memory Consistency Model" (2021~)
| scope | 범위 |
|---|---|
.cta | block 내 thread |
.cluster | Hopper cluster 내 block |
.gpu | 전체 device |
.sys | host + 여러 device (UM) |
scope 은 작은 → 큰. .sys 는 가장 비싸며 UM/peer 동기화에서 필요.
| qualifier | 의미 |
|---|---|
.relaxed | atomicity 만, ordering 없음 |
.acquire | 후속 memory op 재배열 금지 (read 쪽) |
.release | 선행 memory op 재배열 금지 (write 쪽) |
.acq_rel | 둘 다 |
.sc | sequential consistency (acq_rel + total order) |
PTX ISA 8.x §8.4 "Memory Operation Ordering"
// producer st.global data, %v; st.release.gpu flag, 1; // consumer ld.acquire.gpu %f, flag; @p bra spin_if_zero; ld.global %r, data; // see v
.gpu, peer/host 통신이면 .sys.
| 명령 | 의미 |
|---|---|
fence.acq_rel.cta | block 내 ordering |
fence.acq_rel.gpu | device 전체 |
fence.sc.gpu | SC + device |
fence.proxy.async | generic ↔ async proxy |
membar.gl | legacy global fence |
__threadfence_block/system | C runtime wrapper |
fence.proxy.async 로 순서 확보 필요.
PTX ISA 8.x §9.7.13.3 "async proxy fence"
atom.global.add.s32 · ·.min·.max·.and·.or·.xor·.exch·.cas.cta/.gpu/.sysatom.add.f32 · sm_60+ f64상세 PTX atomic ↗ V03 §11
| op | V100 | A100 | H100 |
|---|---|---|---|
| FFMA (FP32) | 4 | 4 | 4 |
| IMAD | 4 | 4 | 4 |
| FP64 FMA | 8 | 8 | 8 |
| SFU RCP | ~18 | ~16 | ~16 |
| HMMA (FP16) | ~22 | ~16 | ~16 |
출처 Jia et al. "Dissecting GPU Architectures" (2019, 2021) · 권장값. HW 세부는 NVIDIA 미공개.
| access | V100 | A100 | H100 |
|---|---|---|---|
| LDS (smem) | ~23 | ~22 | ~23 |
| LDG L1 hit | ~28 | ~30 | ~33 |
| LDG L2 hit | ~193 | ~200 | ~260 |
| LDG L2 miss (HBM) | ~400 | ~450 | ~490 |
| Atomic L2 | ~270 | ~260 | ~300 |
출처 Jia 2019/2021 · NVIDIA GTC'22 micro-benchmark 슬라이드. clock cycle 단위.
Little's law · 공식 논문 Volkov 2010
| op | H100 cyc |
|---|---|
wgmma.mma_async m64n256k16 | ~32 (issue+retire) |
cp.async.bulk.tensor 128B | ~100+ |
mbarrier.try_wait | ~1 (polling) |
출처 CUTLASS 3.x Hopper kernel timing · NVIDIA GTC'22
| level | A100 | H100 | B200 |
|---|---|---|---|
| Reg/SM | ~14 TB/s | ~31 TB/s | ~36 TB/s |
| Smem/SM | ~19 TB/s | ~33 TB/s | ~33 TB/s |
| L2 total | ~4 TB/s | ~5.5 TB/s | ~13 TB/s |
| HBM | 2.04 TB/s | 3.35 TB/s | 8.0 TB/s |
출처 A100/H100/GB200 WP + GTC'22, GTC'24 공식 슬라이드. SM 내부 BW 는 per-SM × SM 수 합산.
| GPU | peak FP16 TC | HBM | ridge (FLOP/B) |
|---|---|---|---|
| A100 | 312 TF | 2.04 TB/s | 153 |
| H100 | 989 TF | 3.35 TB/s | 295 |
| H100 FP8 | 1979 TF | 3.35 TB/s | 591 |
| B200 FP8 | 4500 TF | 8.0 TB/s | 562 |
ridge = 모든 workload 가 compute-bound 이기 위한 최소 AI · ↗ V18 §2
| gen | x16 per-dir | GPU |
|---|---|---|
| Gen4 | 32 GB/s | A100 PCIe |
| Gen5 | 64 GB/s | H100/B200 PCIe |
64 GB/s per direction = 128 GB/s bidirectional aggregate (표기 관행 차이)
| gen | link 수 | 총 BW |
|---|---|---|
| NVLink 3 (A100) | 12 | 600 GB/s |
| NVLink 4 (H100) | 18 | 900 GB/s |
| NVLink 5 (B200) | 18 | 1800 GB/s |
NVSwitch 와 NVL 위상 ↗ V15 §3
| GPU | form | TDP |
|---|---|---|
| V100 SXM2 | SXM2 | 300 W |
| A100 SXM4 | SXM4 | 400 W |
| A100 PCIe | PCIe | 250~300 W |
| H100 SXM5 | SXM5 | 700 W |
| H100 PCIe | PCIe | 350 W |
| H200 SXM | SXM5 | 700 W |
| B200 SXM | SXM6 | 1000 W |
| GB200 Superchip | liquid | 2700 W (GPU+CPU) |
출처 각 세대 datasheet · GB200 NVL72 WP §3.1
nvidia-smi -lgc 로 clock 고정 가능 (DC)| 상태 | trigger | 반응 |
|---|---|---|
| T<T_slowdown | — | boost 유지 |
| T≥T_slowdown | HW sensor | clock 단계적 ↓ |
| T≥T_shutdown | critical | drop 또는 halt |
출처 NVIDIA DCGM user guide · datasheet thermal spec
nvidia-smi --query-gpu=power.draw,power.limitnvmlDeviceGetPowerUsage실측 해석/Nsight ↗ V18 §7
| 항목 | GA100 (A100) | GA102 (RTX 3090) |
|---|---|---|
| 대상 | DC / HPC / AI | Gaming / Workstation |
| 프로세스 | N7 | Samsung 8N |
| SM | 108 | 82 |
| FP32/SM | 64 | 128 |
| FP64 ratio | 1:2 | 1:64 |
| Smem/SM max | 164 KB | 100 KB |
| L2 | 40 MB | 6 MB |
| NVLink | 600 GB/s | 없음 (3090 개인은 제한) |
| RT Core | — | 2nd gen |
| ECC HBM | ✓ | — |
| MIG | ✓ | — |
출처 GA100 WP · GA102 WP (2020)
| 항목 | H100 SXM | L40S |
|---|---|---|
| arch | Hopper GH100 | Ada AD102 |
| SM | 132 | 142 |
| FP64 TC | 67 TF | 1.4 TF |
| FP16 TC | 989 TF | 362 TF |
| FP8 TC | 1979 TF | 733 TF |
| Memory | 80 GB HBM3 | 48 GB GDDR6 |
| BW | 3.35 TB/s | 864 GB/s |
| NVLink | 900 GB/s | — |
| RT Core | — | 3rd |
| TDP | 700 W | 350 W |
출처 H100/L40S 공식 datasheet
| arch | DC chip | RTX chip |
|---|---|---|
| Pascal | GP100 | GP102 |
| Volta | GV100 | — (Turing 분리) |
| Turing | T4 (TU104) | TU102 |
| Ampere | GA100 | GA102 |
| Ada | AD102 (L40) | AD102 (RTX 40) |
| Hopper | GH100 | — (DC only) |
| Blackwell | GB100/GB200 | GB202 (RTX 50) |
arch 명은 same·chip(die)만 다름. PMPP 4e App.
출처 NVIDIA MIG User Guide (2023) · A100/H100 WP "MIG" 섹션
| profile | SM | mem | slot |
|---|---|---|---|
| MIG 1g.10gb | 14 | 10 GB | 1 |
| MIG 2g.20gb | 28 | 20 GB | 2 |
| MIG 3g.40gb | 42 | 40 GB | 3 |
| MIG 4g.40gb | 56 | 40 GB | 4 |
| MIG 7g.80gb | 98 | 80 GB | 7 |
출처 NVIDIA A100 MIG User Guide Table 2
| profile | SM | mem (80GB) |
|---|---|---|
| 1g.10gb | 16 | 10 GB |
| 1g.20gb | 16 | 20 GB |
| 2g.20gb | 32 | 20 GB |
| 3g.40gb | 60 | 40 GB |
| 4g.40gb | 60 | 40 GB |
| 7g.80gb | 132 | 80 GB |
출처 NVIDIA H100 MIG User Guide · 2024
출처 A100 MIG User Guide §2.3
| 시나리오 | 이점 |
|---|---|
| serving multi-tenant | QoS · 격리 |
| small batch infer | GPU 활용률 ↑ |
| Dev + Prod 분리 | 안전 |
| CI/CD 테스트 | 병렬 queue |
| arch | 연도 | CC | 대표칩 |
|---|---|---|---|
| Kepler | 2012 | 3.x | K20·K80 |
| Maxwell | 2014 | 5.x | M40 |
| Pascal | 2016 | 6.x | P100 (GP100) |
| Volta | 2017 | 7.0 | V100 (GV100) |
| Turing | 2018 | 7.5 | T4 · RTX 20 |
| Ampere | 2020 | 8.0/8.6 | A100 · RTX 30 |
| Ada | 2022 | 8.9 | L40 · RTX 40 |
| Hopper | 2022 | 9.0 | H100 · H200 |
| Blackwell | 2024 | 10.0 | B200 · GB200 · RTX 50 |
출처 NVIDIA CUDA C PG App. H 세대 표 · 각 WP 출시 연도
| level | cap | lat | BW |
|---|---|---|---|
| Reg | 256 KB/SM | ~1 | full |
| Smem | 164/228 KB | ~20 | ~19/33 TB/s |
| L1 | shared w/ smem | ~30 | — |
| L2 | 40/50 MB | ~200 | ~4/5.5 TB/s |
| HBM | 80/192 GB | ~400 | 2/3.35/8 TB/s |
A100 / H100 / B200 순 · 단위 cycle 또는 TB/s
| arch | -arch string |
|---|---|
| Ampere A100 | sm_80 |
| Ampere RTX 30 | sm_86 |
| Ada L40/RTX 40 | sm_89 |
| Hopper H100 | sm_90 / sm_90a |
| Blackwell B200 | sm_100 / sm_100a |
TMA/WGMMA는 sm_90a 필수 (아키별 특수). 상세 ↗ V04 §1
| row | A100 | H100 | B200 |
|---|---|---|---|
| SM | 108 | 132 | 160 |
| TC gen | 3rd | 4th | 5th |
| HBM | 80 2e | 80 3 | 192 3e |
| BW TB/s | 2.0 | 3.35 | 8.0 |
| NVLink | 600 | 900 | 1800 |
| FP16 TC | 312 | 989 | 2250 |
| FP8 TC | — | 1979 | 4500 |
| FP4 TC | — | — | 9000 |
| TDP W | 400 | 700 | 1000 |
| Process | N7 | 4N | 4NP |
__syncthreads, TMA 는 mbarrier