Summary of "NVIDIA Ampere (RTX 30) Architecture Deep-Dive: RT Cores, GDDR6X vs. GDDR6, & More"

Overview / context

This is an educational deep‑dive based on NVIDIA first‑party disclosures (architects’ presentation, press tech day, and the ~42‑page Ampere whitepaper). It is not a performance review. Reviewers (e.g., GamersNexus) are preparing RTX 3080 benchmarks and advise waiting for full reviews before making purchase decisions.
Sponsor: Thermaltake Core P3 case. Memory partner noted: Micron (GDDR6X).

Process node
- Ampere (GA102) is built on Samsung “8N”.
- Turing (TU102) used TSMC 12 nm. Ampere has higher transistor density.
Die and transistor counts (examples)
- GA102: ~28 billion transistors on ~628 mm².
- TU102: ~18.6 billion transistors on ~754 mm².
Power
- GA102 reference TDP ≈ 320 W vs TU102 ≈ 250 W.
- NVIDIA states improved performance‑per‑watt despite the higher TDP.

Block/container changes
- Ampere reorganizes the GPC / TPC / SM layout. GA102 supports up to 7 GPCs and up to 84 SMs in a full configuration; the RTX 3080 Founders Edition uses 6 of 7 GPCs.
PCIe and NVLink
- PCIe updated to Gen4.
- NVLink lane grouping changed from 2×8 to 4×4 (same total lanes but different allocation).
ROPs and rasterization
- More ROP units, and ROPs are decoupled from memory/L2 (now tied to GPC). NVIDIA claims this improves rasterization performance in games.

FP32 data path
- SMs can now execute 128 FP32 operations per clock (up from 64) by adding an FP32 unit into the integer data path. This boosts throughput for FP‑heavy workloads.
L1 cache
- Increased per SM from 96 KB → 128 KB, with doubled bandwidth and larger cache partitions.
- NVIDIA cites a combined ~2.7× performance uplift for the SM subsystem (this refers to SM‑level subsystem improvements, not whole‑card speed).
CUDA and FP64
- More flexible handling of FP32 vs integer work.
- FP64: GA102 has 168 FP64 units (~2 per SM); consumer cards remain suboptimal for heavy double‑precision compute.

Sparsity workflow (high level)
1. Train networks densely.
2. Prune to create sparse matrices while retaining near‑original accuracy.
3. Use compressed sparse representations on hardware‑accelerated tensor cores.
Per‑SM tensor core changes
- Turing: 8 tensor cores/SM × 64 FP16 FMA → 512 FP16 FMA/SM (dense).
- Ampere: 4 tensor cores/SM × 128 FP16 FMA (dense) → 512 FP16 FMA/SM (parity on dense).
- With sparsity: Ampere can do 4 × 256 FP16 FMA → 1,024 FP16 FMA/SM (2× improvement when sparsity is used).
Real‑world impact
- NVIDIA quotes substantial tensor TFLOPS uplift when using hardware sparsity. Actual gains depend on workload and software integration (for example, DLSS 2.0 benefits from these improvements).

Role
- RT cores remain dedicated for BVH traversal and intersection handling.
Hardware improvements
- Improved triangle intersection unit: reported ~2× triangle intersection throughput to reduce a common bottleneck.
- Added triangle position interpolation unit: supports motion blur in ray tracing by interpolating triangle positions over time.
Performance claims
- NVIDIA reports RT TFLOPS increases (marketing examples show ~1.7× RT TFLOPS in some slides).
- Combined pipeline claims include up to ~44% reduction in frame production time when SM, Tensor, and RT improvements are considered together. Real gains depend on the full pipeline and the specific workload.

Configuration and bandwidth
- GA102 uses GDDR6X (Micron) with a 320‑bit memory bus running at 19 Gb/s, yielding ~760 GB/s effective bandwidth (~23% increase vs previous GDDR6 on similar bus widths).
Signaling: PAM4
- GDDR6X uses PAM4 signaling (4 voltage levels per symbol) to transmit 2 bits per clock vs 1 bit per clock in binary signaling.
Micron reliability/throughput techniques
- Maximum Transition Avoidance (MTA) coding: avoids voltage jumps greater than one level to reduce errors.
- Adaptive sampling/training: centers the sample point within the multilevel “eye” for reliable reads.
Unit clarification
- Common confusion: Gb/s (gigabits per second) ≠ GB/s (gigabytes per second). For example, 19 Gb/s is not 19 GB/s.

This architectural deep‑dive precedes full benchmarking and reviews. GamersNexus and other outlets warn not to judge final product performance solely from architecture disclosures.
RTX 3080 FE uses GA102 with 6 of 7 GPCs. Additional SKUs (e.g., a possible 3080 Ti) could appear between the 3080 and 3090.

Ampere is an architectural evolution with several notable hardware changes:
- Wider FP32 data paths in SMs and larger/faster L1 cache.
- Tensor cores optimized for sparse math, improving AI inferencing throughput when sparsity is used.
- Faster RT core triangle handling and added motion‑blur support for ray tracing.
- Much faster GDDR6X memory using PAM4 signaling with Micron reliability measures.
NVIDIA’s quoted uplift numbers typically refer to specific subsystems or synthetic scenarios; whole‑system performance requires validation by reviews and real benchmarks.

Primary sources: NVIDIA (architects’ presentation, press tech day, Ampere whitepaper).
Video presenter / publisher: GamersNexus (produced the deep‑dive and is preparing the RTX 3080 review).
Technology partners referenced: Micron (GDDR6X memory).
Sponsor: Thermaltake (Core P3 case).