| format | e | bias | E range |
|---|---|---|---|
| FP64 | 11 | 1023 | [−1022, +1023] |
| FP32 | 8 | 127 | [−126, +127] |
| FP16 | 5 | 15 | [−14, +15] |
| BF16 | 8 | 127 | [−126, +127] |
| 값 | S | E | M |
|---|---|---|---|
| +0 | 0 | 0 | 0 |
| −0 | 1 | 0 | 0 |
| +Inf | 0 | all-1 | 0 |
| −Inf | 1 | all-1 | 0 |
| sNaN | — | all-1 | ≠0, MSB=0 |
| qNaN | — | all-1 | ≠0, MSB=1 |
31 30 23 22 0 ┌─┬──────────┬───────────────────────┐ │S│ E (8) │ M (23) │ └─┴──────────┴───────────────────────┘ sign exp mantissa (fraction) bias = 127 normalized: x = (-1)^S · 2^(E-127) · 1.M
63 62 52 51 0 ┌─┬─────────────┬───────────────────────┐ │S│ E (11) │ M (52) │ └─┴─────────────┴───────────────────────┘ bias = 1023 precision ≈ 15~17 decimal digit
εmach(FP64) = 2−52 ≈ 2.22e−16 · ↗ V09 §8 Error
| format | m | εmach |
|---|---|---|
| FP64 | 52 | 2−52 ≈ 2.22e−16 |
| FP32 | 23 | 2−23 ≈ 1.19e−7 |
| FP16 | 10 | 2−10 ≈ 9.77e−4 |
| BF16 | 7 | 2−7 ≈ 7.81e−3 |
| FP64 | FP32 | |
|---|---|---|
| max norm | ~1.80e+308 | ~3.40e+38 |
| min norm | ~2.23e−308 | ~1.18e−38 |
| min subnorm | ~4.94e−324 | ~1.40e−45 |
| precision | 15~17 dd | 6~9 dd |
IEEE 754-2008 · dd = decimal digit
15 14 10 9 0 ┌─┬──────┬───────────────┐ │S│E (5) │ M (10) │ └─┴──────┴───────────────┘ bias=15 range ~[6e-8, 6.5e+4] ε_mach = 2^-10 ≈ 9.77e-4
15 14 7 6 0 ┌─┬────────────┬─────────────┐ │S│ E (8) │ M (7) │ └─┴────────────┴─────────────┘ bias=127 range ≈ FP32 ε_mach = 2^-7 ≈ 7.81e-3
18 17 10 9 0 ┌─┬─────────────┬───────────────┐ │S│ E (8) │ M (10) │ └─┴─────────────┴───────────────┘ 19-bit internal · storage 32-bit bias=127 range = FP32
| FP16 | BF16 | TF32 | |
|---|---|---|---|
| S/E/M | 1/5/10 | 1/8/7 | 1/8/10 |
| bias | 15 | 127 | 127 |
| max norm | 65504 | ~3.4e38 | ~3.4e38 |
| min norm | 6.1e−5 | 1.2e−38 | 1.2e−38 |
| εmach | 9.8e−4 | 7.8e−3 | 9.8e−4 |
| loss scale | 필요 | 불필요 | 불필요 |
| storage | 16b | 16b | 32b |
precision (mantissa)
↑
10│ FP16 TF32 FP32
│ │
7│ BF16
│
3│ FP8-E4M3
2│ FP8-E5M2 FP6-E3M2
1│ FP4-E2M1
└─────────────────────→ range (exp)
e=4 e=5 e=8(FP32·BF16·TF32) e=11(FP64)
세로축 mantissa bit = precision · 가로축 exp bit = range · 정성 배치
| dtype | A100 (3gen) | H100 (4gen) |
|---|---|---|
| FP16 | 312 TF | 989 TF |
| BF16 | 312 TF | 989 TF |
| TF32 | 156 TF | 495 TF |
| FP8 | — | 1979 TF |
| FP64 TC | 19.5 TF | 67 TF |
NVIDIA A100/H100 whitepaper · sparsity 미포함 · TC=Tensor Core
| E4M3 | E5M2 | |
|---|---|---|
| S/E/M | 1/4/3 | 1/5/2 |
| bias | 7 | 15 |
| max | 448 | 57344 |
| min norm | 2−6 | 2−14 |
| min subn | 2−9 | 2−16 |
| εmach | 2−3=0.125 | 2−2=0.25 |
| Inf/NaN | NaN only (S.1111.111) | IEEE 표준 |
| 용도 | weight · activation | gradient |
OCP FP8 spec · H100 whitepaper · E4M3 NaN만 (Inf 없음, range↑)
7 6 3 2 0 ┌─┬─────┬─────┐ │S│E (4)│M (3)│ └─┴─────┴─────┘ bias=7 max=448 NaN: S.1111.111 Inf 없음 → max 표현 확장 (240 vs 224)
7 6 2 1 0 ┌─┬─────────┬───┐ │S│ E (5) │M 2│ └─┴─────────┴───┘ bias=15 max=57344 IEEE 표준 (Inf/NaN) → FP16과 exp 공유, mantissa만 다름
| 단위 | metadata | outlier 견딤 |
|---|---|---|
| per-tensor | 1 | 낮음 |
| per-channel | C | 중 |
| per-block G=32 | N/32 | 높음 |
| per-element | N | (=FP32) |
구체 알고리즘 선택은 ↗ V10 §1~§4
E=0001 (2^-6) 0.015625, 0.017578, 0.019531, ... 0.029296 E=0010 (2^-5) 0.03125, 0.03515, 0.0390, ... E=1111 (2^8, 특수) MSB ≠ 111: normal (up to 448) M = 111: NaN only → 지수 간 가격차 2배 · 지수 내 8 레벨 (3 mant)
C[m,n] = Σ_k A[m,k] · B[k,n] A,B : FP8 product : FP16 중간 acc : FP32 ← 필수 output : BF16/FP16/FP8 dequant
| E2M1 | E3M0 | |
|---|---|---|
| S/E/M | 1/2/1 | 1/3/0 |
| bias | 1 | 3 |
| levels | 16 | 16 |
| range | ±6 | ±128 |
| 주 용도 | weight block-scale | (거의 미사용) |
E2M1: 0b SEEM values: 0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6 → 16 lattice point (sign + magnitude)
| bits | |x| | bits | |x| |
|---|---|---|---|
| S000 | 0 | S100 | 1.0 |
| S001 | 0.5 | S101 | 1.5 |
| S010 | 1.0* | S110 | 2.0 |
| S011 | 1.5* | S111 | 3.0 |
E∈{0,1,2,3}·M∈{0,1} · *subnorm/norm 경계 · MX spec
| E2M3 | E3M2 | |
|---|---|---|
| S/E/M | 1/2/3 | 1/3/2 |
| bias | 1 | 3 |
| max | 7.5 | 28 |
| εmach | 2−3 | 2−2 |
| 특징 | precision↑ | range↑ |
| name | elem | scale | G |
|---|---|---|---|
| MXFP8 | FP8 E4M3 / E5M2 | UE8M0 | 32 |
| MXFP6 | FP6 E2M3 / E3M2 | UE8M0 | 32 |
| MXFP4 | FP4 E2M1 | UE8M0 | 32 |
| MXINT8 | INT8 | UE8M0 | 32 |
OCP MX spec v1.0 · 2023
| MXFP4 | NVFP4 | |
|---|---|---|
| element | FP4 E2M1 | FP4 E2M1 |
| scale | UE8M0 (pow-2) | FP8 E4M3 |
| block G | 32 | 16 |
| scale/elem | 8/32 = 0.25b | 8/16 = 0.5b |
| 표준 | OCP open | NVIDIA only |
NVFP4 whitepaper · 같은 FP4 element지만 quantization 오차 ↓
block (G=32 elements, MXFP8): ┌─────────────────────────────────────┐ │ s₀ (E8M0, 1B) │ e₀ e₁ ... e₃₁ (32B) │ └─────────────────────────────────────┘ scale 32 FP8 elements → total 33 B per block → overhead 8b / 32 elem = 0.25 b/elem tensor [N elements] = ┌────────┬────────┬────────┬─── ... │block 0 │block 1 │block 2 │ │s₀│data │s₁│data │s₂│data │ └────────┴────────┴────────┴─── ... 각 block 독립 scale · 순차 배치
block (G=16 elements, NVFP4):
┌──────────────────────────────────────┐
│ s₀ (FP8 E4M3, 1B) │ e₀..e₁₅ (packed) │
└──────────────────────────────────────┘
scale 16 FP4 = 8 B
→ total 9 B per block
→ overhead 8b / 16 elem = 0.5 b/elem
FP4 packing (2 elem / byte):
byte: [ e₂ᵢ₊₁ | e₂ᵢ ] ← nibble
high low
16 elements = 8 bytes
scale precision:
MXFP4 UE8M0: s ∈ {2^-127, ..., 2^127} (pow-2 만)
NVFP4 FP8: s = (-1)^S · 2^(E-7) · 1.M (full FP8)
| format | elem b | scale b | eff b/elem |
|---|---|---|---|
| FP16 | 16 | — | 16 |
| BF16 | 16 | — | 16 |
| FP8 (per-tensor) | 8 | ≈0 | 8 |
| MXFP8 (G=32) | 8 | 0.25 | 8.25 |
| MXFP4 (G=32) | 4 | 0.25 | 4.25 |
| NVFP4 (G=16) | 4 | 0.5 | 4.5 |
| mode | 약어 | 규칙 |
|---|---|---|
| Round Nearest Even | RN | 가장 가까운 값, tie는 짝수 mantissa |
| Round toward Zero | RZ | magnitude 감소 (truncate) |
| Round toward −∞ | RD | floor (direction down) |
| Round toward +∞ | RU | ceil (direction up) |
decimal 예 (소수 1자리 → 정수): 2.5 → 2 (even) 3.5 → 4 (even) 4.5 → 4 (even) 5.5 → 6 (even) → 긴 합산에서 bias 0 (상·하 round 확률 동일)
| 상황 | RN | SR |
|---|---|---|
| w += 0.01 (FP16) | 0으로 drop | 1/100 확률로 1 ULP 증가 |
| 기대값 | 0 | 0.01 |
| long-run drift | bias 있음 | unbiased |
| determinism | yes | no (seed 없으면) |
fma.rn.f32): RN 기본__float2half: RN 기본, __float2half_rn/_rz/_rd/_ru variantcvt.{rn|rz|rm|rp|rni|rzi|rmi|rpi} suffixPTX ISA · cvt instruction · ↗ V03 §5
큰 수: 1.00 × 2^20
작은 수: 1.00 × 2^0
정렬: 1.00 × 2^20
+ 0.00...01 × 2^20 (20 bit shift)
─────────────────
→ mantissa 비트수 넘으면 소실
serial (left-to-right):
(((a₀ + a₁) + a₂) + a₃)
→ O(N) step, O(log N) 오차 누적 낮음 if ordered
tree (pairwise):
a₀ a₁ a₂ a₃
\ / \ /
s₁ s₂
\ /
S
→ O(log N) step, 오차 O(√log N)
warp-shuffle:
stride 16, 8, 4, 2, 1
→ HW에 결정된 순서, 일관성 yes
Higham "Accuracy and Stability" Ch4 · pairwise가 일반적으로 더 정확
| op | input | acc |
|---|---|---|
| GEMM | FP16/BF16/FP8 | FP32 |
| reduction | FP16 | FP32 |
| layer norm stat | BF16 | FP32 |
| softmax denom | FP16 | FP32 |
원칙: 누적 dtype ≥ 입력 dtype · 짧으면 catastrophic
| 종류 | 의미 |
|---|---|
| forward | |ŷ − y| / |y| |
| backward | |x̂ − x| / |x|, ŷ = f(x̂) |
| stability | backward error 작으면 "stable" |
stable 알고리즘 + ill-conditioned 문제 = 여전히 부정확
| 문제 | 대체 공식 |
|---|---|
| √(1+x) − 1 | x / (√(1+x) + 1) |
| 1 − cos(x) | 2·sin²(x/2) |
| log(1+x), x→0 | log1p(x) |
| exp(x)−1 | expm1(x) |
__log1pf, __expm1fs = 0; c = 0 // s: sum, c: lost bit carry for x in xs: y = x - c // undo prev loss t = s + y // add, may lose LSB c = (t - s) - y // recover lost bit s = t return s
std::reduce의 일부 impl 채용-ffast-math/-O3가 Kahan의 (t − s) − y 를 "0"으로 최적화 → 회피. volatile 또는 fast-math off.
| 버전 | pass | 메모리 |
|---|---|---|
| naive | 2 | logits + denom |
| stable 3-pass | 3 | +max |
| online (Milakov) | 1 | (m, l) state |
online softmax: (m, l) pair를 streaming update · FlashAttention 핵심 · ↗ V07 §2~§3
F.cross_entropy 내부 LSE 사용master W (FP32) ──┐
▼ cast
W_fp16 ── forward ──► loss
│
◄── backward ── grad_fp16
│
▼ cast to FP32 + scale down
Δ (FP32) ──► update master W
Micikevicius et al. "Mixed Precision Training" ICLR 2018
| FP16 | BF16 | |
|---|---|---|
| range | 좁음 | = FP32 |
| precision | 10 bit | 7 bit |
| loss scaling | 필수 | 불필요 |
| master W | FP32 필요 | FP32 권장 |
| divergence risk | grad underflow | update precision |
| HW 지원 | V100+ | A100+ |
| 변수 | BF16 path | FP16 path |
|---|---|---|
| master W | FP32 | FP32 |
| W (compute) | BF16 | FP16 |
| activation | BF16 | FP16 |
| gradient | BF16 | FP16 (scaled) |
| accumulator | FP32 | FP32 |
| optimizer m/v | FP32 | FP32 |
| loss scale | — | dynamic |
torch.amp.autocast + GradScalerjmp| static | dynamic | |
|---|---|---|
| s | hyperparam fixed | runtime adapt |
| 튜닝 | 수동 (2k) | 자동 |
| overflow | 주기적 점검 | 자동 대응 |
| 장점 | 예측 가능 | 로버스트 |
| 단점 | 튜닝 필요 | 몇 step loss |
init: s = 2^15
for step:
scaled_loss = loss * s
scaled_loss.backward()
# gradient = grad * s
if any(isinf(g) or isnan(g)): # overflow
s = s / 2 # shrink
skip optimizer step
reset counter
else:
unscale grad: g = g / s
optimizer.step()
counter += 1
if counter >= N: # stable window
s = s * 2 # try larger
counter = 0
PyTorch GradScaler 기본 동작 · N=2000 step default
torch.isinf(g).any() / torch.isnan(g).any()__isnanf, __isinff intrinsic각 rank가 독립 scale 사용 시 weight divergence → scale은 global synchronize → overflow rank 있으면 모든 rank skip protocol: 1. each rank: check Inf/NaN local 2. allreduce OR bool 3. if any rank overflow: shrink s, skip 4. else: proceed optimizer
te.Linear, te.TransformerLayer drop-inhistory[N] (ring buffer):
step t : amax_t
step t-1 : amax_{t-1}
...
step t-N+1: amax_{t-N+1}
amax_used = max(history)
s = amax_used / FP8_max
step t:
x_t 계산
amax_t 계산 (현재 tensor)
quant(x_t) with s_{t-1} ← 직전 step scale
push amax_t to history
step t+1:
s_t = max(history) / FP8_max
quant with s_t
cost: scale이 1 step lag · 수렴에는 무해 (history size N ≥ 16)
| op | input | output |
|---|---|---|
| forward linear | FP8 E4M3 | BF16 |
| activation | BF16 | BF16 |
| backward wgrad | FP8 E5M2 | BF16 |
| backward dgrad | FP8 E5M2 | BF16 |
| master update | FP32 | FP32 |
forward: W (E4M3) × X (E4M3) → Y (FP32 acc → BF16) backward (grad output gY): dgrad: gY (E5M2) × W^T (E4M3) → gX wgrad: X^T (E4M3) × gY (E5M2) → gW 이유: W, X는 정규화된 좁은 분포 → E4M3 gY는 outlier 가능 (loss gradient) → E5M2
| format | S | E | M | bias | max |
|---|---|---|---|---|---|
| FP64 | 1 | 11 | 52 | 1023 | 1.8e308 |
| FP32 | 1 | 8 | 23 | 127 | 3.4e38 |
| TF32 | 1 | 8 | 10 | 127 | 3.4e38 |
| BF16 | 1 | 8 | 7 | 127 | 3.4e38 |
| FP16 | 1 | 5 | 10 | 15 | 65504 |
| FP8 E4M3 | 1 | 4 | 3 | 7 | 448 |
| FP8 E5M2 | 1 | 5 | 2 | 15 | 57344 |
| FP6 E3M2 | 1 | 3 | 2 | 3 | 28 |
| FP6 E2M3 | 1 | 2 | 3 | 1 | 7.5 |
| FP4 E2M1 | 1 | 2 | 1 | 1 | 6 |
| FP4 E3M0 | 1 | 3 | 0 | 3 | 128 |
IEEE 754 · OCP FP8/MX · NVFP4 whitepaper
| 대상 | 권장 dtype |
|---|---|
| master weight | FP32 |
| optimizer m/v | FP32 |
| forward activation | BF16 (or FP16) |
| forward weight (TC) | BF16 / FP8 E4M3 |
| gradient | BF16 / FP8 E5M2 |
| matmul accumulator | FP32 항상 |
| reduction | FP32 |
| layer norm stat | FP32 |
| softmax denom | FP32 |
| attention P out | BF16 (or FP8 [0,1] clamp) |
| KV cache | FP8 / INT8 (↗ V10) |
| loss | FP32 |
| 전략 | master | compute | scale |
|---|---|---|---|
| FP32 baseline | FP32 | FP32 | — |
| FP16 AMP | FP32 | FP16 | dynamic loss scale |
| BF16 | FP32 | BF16 | 없음 |
| FP8 (TE) | FP32 | FP8 | per-tensor delayed |
| MXFP8 (Blackwell) | FP32 | FP8 | per-block G=32 |
| NVFP4 | FP32 | FP4 | per-block G=16 |
-O3로 제거