2025-12-01

Dr. Pongsak Wonglertkunakorn

When the Shovel Seller Meets the Power Company: Nvidia’s Reign, Google’s Countermove, and the AI Hardware Endgame​

Dr. Pongsak Wonglertkunakorn

  • Workplace Consultant
  • Ph.D. in Management from National Institute of Development Administration
  • M.S. in Computer and Information Science from University of Pennsylvania
  • B.Eng. in Computer Engineering from Chulalongkorn University

Executive summary

  • Nvidia turned the AI boom into a GPU empire by owning the full stack from CUDA to DGX systems.
  • Google isn’t just “catching up”—it’s been iterating Tensor Processing Units (TPUs) for ~a decade and is leaning into vertical integration (chip → system → compiler → model → cloud).
  • The next competitive frontier isn’t raw FLOPs—it’s energy efficiency per trained/served token, interconnect bandwidth, and software portability.
  • AWS (Trainium/Inferentia), AMD (MI300/MI325), and Alibaba are pushing credible alternatives, shifting the market from monopoly to optionality.
  • Buyers will rebalance CapEx vs OpEx as silicon refresh cycles compress; the safest bet is architectures that deliver end-to-end efficiency rather than just peak benchmarks.

1. From Gold Rush to Grid Constraints

The last year made one truth impossible to ignore: Nvidia became the default seller of “picks and shovels” for the AI gold rush. If your roadmap said “train a frontier-scale model” or “serve billions of tokens a day,” you probably queued for H100s.

2. Google’s Quiet Decade and the TPU Bet

Google didn’t wake up yesterday. It has shipped seven TPU generations internally, with commercial access in Google Cloud since 2018. The strategy difference is key:
But markets never allow a single king forever. And the real bottleneck ahead isn’t only supply—it’s power. Hyperscale AI campuses are measured in hundreds of megawatts, not racks. The winners won’t just be faster; they’ll be the ones who turn joules into tokens most efficiently.

  • Then: TPUs sold as a service (you don’t buy chips; you rent them in GCP).
  • Now: With AI demand exploding, Google is advertising a full chip→system→compiler→model stack: TPUs + XLA/PJRT compilers, JAX/TF, Gemini, and data center fabric tightly tuned for ML.
    This is classic vertical integration. Teams building models (DeepMind/Gemini) sit next to teams building chips and compilers. If the modelers need a new primitive (say, better attention kernels or sparsity patterns), the silicon and software can co-evolve. That tight loop is hard for a merchant-silicon model to match.

3. Tech Deep Dive: GPU vs TPU (and why efficiency is the real moat) Compute cores & math

  • Nvidia GPUs: Highly parallel multipurpose cores with Tensor Cores accelerating FP16/BF16/FP8; great for AI and still flexible for graphics/HPC.
  • Google TPUs: Matrix-multiply engines (think systolic arrays) optimized for dense linear algebra with BF16/FP8-class formats and sparsity support—lean and purpose-built for ML.
    Interconnect & scale
  • Nvidia: NVLink/NVSwitch + Infiniband/Ethernet fabrics; strong collective ops performance across multi-GPU nodes.
  • Google: Custom TPU interconnect (ICI) and datacenter fabric designed with compiler awareness; the scheduling stack knows the network, not just the chip.
    Software stacks
  • Nvidia: The CUDA ecosystem (cuBLAS, cuDNN, NCCL, Triton compiler) is the moat—decade-deep libraries, kernels, profilers, and a massive developer base.
  • Google: XLA/PJRT, JAX/TensorFlow, and increasingly robust PyTorch pathways aim to make model graphs portable. The pitch: same code path, better efficiency on TPUs inside GCP.
    Why this matters now

As models grow, so does the share of time in collective comms, optimizer steps, parameter sharding, and KV-cache movement. The stack that coordinates chip + memory + interconnect + compiler with the fewest wasted joules wins. That’s not marketing; that’s physics and scheduling.

4. The Energy Equation: Joules per Token Training and serving LLMs at scale is turning into a power budgeting problem:

  • Cost per trained parameter and cost per 1M served tokens are bounded by energy per FLOP and network efficiency.
  • If Google can prove lower $/token (end-to-end) for mainstream LLM workloads, CFOs and governments will listen—especially as grid constraints tighten and carbon policies bite.

Nvidia knows this. Architectures like Blackwell (and what comes after) are laser-focused on FP8/FP4 efficiency, memory bandwidth, and interconnect gains—and on keeping developers inside CUDA’s gravity well.

5. The Market Is Moving From Monopoly to Optionality It’s not just Google:

  • AWS rolled out Trainium (training) and Inferentia (inference), now in later generations, with Neuron SDK improving PyTorch pathways.
  • AMD has momentum with MI300/MI325 and a fast-maturing ROCm stack; several hyperscalers are qualifying AMD for both training and inference.
  • Alibaba and other clouds in Asia are signaling vertical stacks of their own.

Result: customers can mix and match by workload—Nvidia for max flexibility, TPUs/Trainium for cost-optimized LLMs, AMD where price/perf and availability line up. That re-introduces pricing power to buyers, not just vendors.

6. CapEx vs OpEx: Don’t Marry a Chip, Marry an Outcome

Buying racks of GPUs felt smart when demand outstripped supply. But silicon depreciates fast as data types shrink (FP8→FP4), memory stacks improve, and kernels get smarter. Many CTOs are now asking:

  • For volatile workloads, do we rent efficiency in the cloud and avoid being stuck with last year’s silicon?
  • For steady, predictable training runs, do we own to control cost and schedule?
  • Can we portable-ize our training graph (via Triton/XLA/PJRT) so we arbitrage vendors without rewriting the model?
    The pragmatic answer is a hybrid: anchor capacity where you’re most efficient, burst elsewhere, and keep your model code as vendor-agnostic as you can.

7. For Engineering Leaders: What to Measure Next

  • End-to-end $/token (train + serve), not just peak TFLOPs.
  •  Utilization: time spent doing useful math vs. waiting on memory/collectives.
  • Compiler wins: XLA/Triton graph-level fusions, kernel autotuning, quantization.
  • Interconnect health: all-reduce efficiency at your target scale (hundreds to thousands of accelerators).
  • Portability: how painful is it to land the same model on CUDA, XLA, and Neuron?
  • Power envelope: tokens per kWh, not just tokens per second.

8. And for Investors: Beyond the Hype Curve
This isn’t “Nvidia loses, everyone else wins.” It’s a broadening of the field:

  • Nvidia still has unmatched developer mindshare and a juggernaut software stack.
  • Google’s edge is integrated efficiency across chip, compiler, model, and cloud.
  • AWS and AMD provide credible price/perf and supply alternatives.
  • Apple won’t fight in hyperscale training, but its silicon leadership and on-device AI may steer edge inference expectations that ripple back to the cloud.

The capital cycle will favor firms that translate watts into tokens most efficiently—and do so with a stack that developers actually like to use.

Bottom line

The AI era isn’t just beginning; it’s evolving. The battlefield is shifting from raw speed to orchestrated efficiency—the harmony of silicon, interconnect, compiler, and model. Nvidia’s CUDA moat remains deep, but Google’s vertically integrated TPU stack—and the rise of AWS/AMD/Alibaba options—means customers finally have leverage.

We use cookies to ensure the proper functioning of our website, analyze usage, and improve your experience. By continuing to use this website, you consent to our use of cookies.