The A100 defined the modern AI data center; the H100 is the generation that made large language models economical to train and serve at scale. With H100 supply normalized and A100s abundant on the secondary market, this comparison has shifted from an availability question to a genuine value decision—and the right answer depends entirely on which decade of workload you are running.
Key Takeaways
- Hopper's generational gains concentrate in transformer workloads: the Transformer Engine with FP8 is the headline, not the raw FLOPS line.
- Memory bandwidth (~3.35TB/s vs ~2TB/s) and fourth-generation NVLink (900GB/s vs 600GB/s) widen the gap as jobs span more GPUs.
- For LLM training and serving, H100 advantages compound into multiples, not percentages; for classic ML and CV, discounted A100s remain superb value.
- Run the decision on cost per unit of your work—per token, per epoch, per batch—not on list price or benchmark charts.
01Two generations, one decision
When NVIDIA shipped the A100 in 2020, it reset expectations for what a data-center accelerator was: a single card that trained vision models, served recommenders, and partitioned itself into seven isolated slices for inference tenants. Four years of software maturation made it the most thoroughly understood GPU in production history—every framework optimized for it, every failure mode documented, every ops team fluent in it.
The H100 arrived in 2022 into a different world: transformer architectures had eaten machine learning, model sizes were doubling on a cadence, and the economics of AI had become the economics of attention layers. Hopper was designed for exactly that world, and the design shows. The comparison below is therefore less “old versus new” than “general-purpose excellence versus transformer specialization”—and your workload mix decides which philosophy pays.
02Architecture deep dive
Ampere: the all-rounder that built the era
The A100's Ampere architecture introduced third-generation Tensor Cores with TF32—a precision that accelerated FP32-era code without changes—plus structural sparsity support and Multi-Instance GPU (MIG) partitioning. Its 80GB HBM2e variant at roughly 2TB/s of bandwidth became the workhorse configuration of the era. Ampere's genius was breadth: vision, speech, recommenders, scientific computing, and the first wave of large language models all ran well, which is why the install base became enormous and the software stack became bulletproof.
Hopper: built for the attention economy
The H100's defining feature is the Transformer Engine: fourth-generation Tensor Cores paired with software that dynamically selects FP8 or FP16 precision layer by layer, preserving accuracy while roughly doubling throughput wherever 8-bit math is safe. Around the cores, every pipe widened—HBM3 memory at ~3.35TB/s, fourth-generation NVLink at 900GB/s per GPU, PCIe Gen5 to the host, and a dedicated Tensor Memory Accelerator that moves data asynchronously so compute units stop waiting on addresses. Hopper also added DPX instructions for dynamic-programming workloads and confidential-computing support for regulated deployments.
03Spec-by-spec comparison
| Specification | A100 (SXM, 80GB) | H100 (SXM5, 80GB) |
|---|---|---|
| Architecture / process | Ampere, 7nm | Hopper, 4N |
| Tensor Cores | 3rd generation (TF32/BF16/FP16/INT8) | 4th generation + Transformer Engine (adds FP8) |
| Memory | 80GB HBM2e | 80GB HBM3 |
| Memory bandwidth | ~2.0 TB/s | ~3.35 TB/s |
| NVLink (per GPU) | 600 GB/s (gen 3) | 900 GB/s (gen 4) |
| Host interface | PCIe Gen4 | PCIe Gen5 |
| FP8 Tensor throughput | — (not supported) | ~4× A100's FP16 rate (with sparsity) |
| MIG partitions | Up to 7 | Up to 7 (2nd-gen MIG) |
| TDP (SXM) | 400W | 700W |
| Confidential computing | — | Supported |
| Typical market position (2024) | Strong secondary-market value | Premium, supply normalized |
Read the table with one eye on your workload: almost every H100 win above—FP8, bandwidth, NVLink—compounds specifically in large-transformer training and serving. The A100's wins—power draw and price—compound in everything that does not need Hopper's specialization.
04Training performance in practice
For transformer pre-training and fine-tuning, published results and field experience agree on the shape: H100 delivers roughly 2–3× A100 throughput per GPU on large language models when FP8 is engaged, with the multiple growing at cluster scale because faster NVLink and NVSwitch keep gradient exchange off the critical path. A 70B-parameter fine-tune that occupies an A100 cluster for a week finishes in roughly half the wall-clock time or less on an equivalent H100 count—which compounds into faster iteration, not just lower cost.
Outside transformers the gap narrows honestly. Convolutional vision training, classic recommender models, and FP32-heavy scientific workloads see meaningful but unspectacular gains—typically tens of percent rather than multiples—because they cannot exploit FP8 and were rarely memory-bandwidth-bound on Ampere to begin with. If this is your workload census, the H100 premium buys you little that an A100 discount does not buy better.
The scaling story
- Single node (8 GPUs): H100's NVLink advantage shows in tensor-parallel layers; expect the per-GPU multiple to hold.
- Multi-node: H100 systems pair with 400Gb-class fabrics; at equal fabric tiers the H100's compute density means fewer nodes per job—less fabric, fewer failure points, simpler parallelism maps.
- Software maturity: new optimizations land Hopper-first—FP8 recipes, attention kernels, inference engines. A100 performance is stable; H100 performance keeps improving after purchase, a real if unquantifiable dividend.
05Inference economics
Inference is where the comparison becomes a business-model question. LLM serving is dominated by two resources—memory bandwidth (token generation is a bandwidth workload) and KV-cache capacity (context length lives in VRAM). The H100 improves the first by roughly 70% and, via FP8 weights and KV-cache, effectively expands the second. In practice that means roughly 2× or better tokens-per-second per GPU on popular open-weight models, longer feasible context windows, and more concurrent sessions per card.
The A100 counters with price. At secondary-market rates, two A100s can cost less than one H100—and for small-to-mid models in the 7B–13B class, batch-tolerant workloads, or internal tools without aggressive latency SLOs, A100 serving remains thoroughly economical. The crossover arrives with scale and model size: once you are provisioning fleets for 30B+ models or chasing cost-per-million-tokens at volume, the H100 wins the spreadsheet despite the sticker.


06Total cost of ownership
List-price comparisons mislead in both directions; the honest model prices the system and the work:
- Power and cooling: the H100's 700W versus the A100's 400W matters at fleet scale—but throughput-per-watt on transformer work still favors Hopper decisively. Price per delivered unit of work, not per rack-watt.
- Node and fabric amortization: if H100s finish the same job with half the nodes, you also bought half the switches, cables, host CPUs, and failure surface. This routinely closes most of the sticker gap on training clusters.
- Utilization risk: premium silicon idling is premium waste. Below roughly 50% sustained utilization, the cheaper fleet usually wins regardless of generation.
- Resale and lifecycle: A100s bought today are bridge assets—price them with a planned second life as an inference or development tier. H100s carry longer relevance but also a successor (H200) already reshaping their premium.
07The decision framework
- Census your workloads. What share is transformer training/serving versus classic ML, vision, and scientific compute? The H100 case strengthens linearly with the first number.
- Check the memory walls. Models or contexts that do not fit A100 VRAM at your serving precision settle the argument immediately.
- Model cost per unit of work. Tokens, epochs, batches—quote both fleets against the same monthly workload, including power and fabric.
- Mind utilization honestly. Bursty demand favors the cheaper fleet or rented burst capacity; steady pipelines justify premium silicon.
- Consider the mixed fleet. The mature answer is rarely all-or-nothing: H100s for the transformer tier, A100s absorbing everything else, routed by scheduler policy. Most well-run estates we operate land exactly here.
08Frequently asked questions
Is the A100 obsolete now that the H100 is widely available?
Does FP8 hurt model accuracy?
Can I mix A100 and H100 nodes in one cluster?
Should I wait for the H200 instead of buying either?
09Verdict
Choose the H100 when…
- Transformers pay your bills: LLM pre-training, fine-tuning, high-volume serving
- Context lengths and model sizes are pressing against A100 memory
- Cluster scale makes NVLink/fabric efficiency compound
- Cost per token at volume is the metric your CFO reads
Choose the A100 when…
- Workloads are vision, recommenders, classic ML, or FP32 scientific compute
- You are serving small-to-mid models without aggressive latency SLOs
- Budget pressure meets a strong secondary market
- You are building dev/experimentation tiers under a premium training fleet
The honest summary: if transformers are your business, the H100 already won—buy it for the cost curve, not the bragging rights. If they are not, the A100's discount is doing more work than most procurement teams realize. And if your estate is large enough to contain both answers, run both tiers deliberately—that is not indecision; that is portfolio management.
Sizing an H100 or A100 deployment?
Talk to the Semifly team about workload profiling, fleet design, and managed GPU infrastructure.
Contact Us

