Gradient Brief

Issue No. 01 • April 28, 2026

Gradient Brief

MLOps & AI Infrastructure — for the engineers building it


Google Cloud Next '26 Unveils AI Infrastructure for the Agentic Era

The shift from conversational AI to autonomous, multi-step "agentic" workflows is forcing a massive evolution in underlying infrastructure. At Google Cloud Next '26, the company unveiled its eighth-generation Tensor Processing Units (TPUs), explicitly designed to handle the complex reasoning and reinforcement learning loops required by AI agents.

The new lineup splits into two specialized chips. The TPU 8t is a training powerhouse, packing 9,600 chips into a single superpod to deliver 121 exaflops of compute and two petabytes of shared memory. According to Google, this architecture provides nearly 3x higher compute performance than previous generations, allowing teams to compress months of training into weeks. For inference and reinforcement learning, the TPU 8i triples on-chip SRAM to 384 MB and increases high-bandwidth memory to 288 GB, effectively breaking the memory wall by hosting massive KV Caches entirely on silicon. This design yields an 80% improvement in performance per dollar for inference.

To connect this massive compute, Google introduced the Virgo Network, a data center fabric capable of linking 134,000 TPUs in a single facility and over one million across multiple sites. This infrastructure rollout aligns with Alphabet's staggering capital expenditure guidance of $175 billion to $185 billion for 2026—nearly double its 2025 spend—signaling a massive, sustained bet on the physical layer of AI. For MLOps teams, this means the hardware bottleneck is shifting from raw compute availability to network fabric and memory bandwidth optimization.

Tool of the Week: llm-d (v0.5)

Use Case: Kubernetes-native distributed LLM inference scheduling.

As teams scale LLM deployments on Kubernetes, simple round-robin load balancing quickly breaks down, leading to poor GPU utilization and wasted KV-cache. llm-d is an open-source inference scheduler that solves this by integrating with KServe and vLLM. It introduces prefix-cache aware routing, ensuring requests are intelligently directed to the vLLM instance that already holds the relevant context in memory. Furthermore, it supports prefill/decode disaggregation, allowing teams to assign compute-heavy prefill tasks and latency-sensitive decode tasks to different GPU groups.

In production environments, such as at Tesla, implementing llm-d's cache-aware routing resulted in a 3x improvement in output tokens per second and a 2x reduction in time to first token.

Quick Hits

  • VAST Data Hits $30B Valuation: The Nvidia-backed AI data platform raised ~$1 billion in a Series F round led by Drive Capital and Access Industries. VAST provides the storage architecture for major players like xAI and CoreWeave.
  • Cursor Eyes $50B Valuation: Anysphere, the startup behind the popular AI coding assistant Cursor, is reportedly in talks to raise $2 billion in a round co-led by a16z and Thrive Capital, highlighting the explosive growth of AI developer tools.
  • Snowflake's Agentic Push: Snowflake announced expansions to Snowflake Intelligence and Cortex Code, positioning its platform as the "control plane for the agentic enterprise" by allowing developers to build and orchestrate AI directly within their data ecosystem.

Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.

Keep Reading