Gradient Brief

MLOps & AI Infrastructure — for the engineers building it

Lead Story

Google Cloud Next '26 Rewires the AI Infrastructure Stack

Google Cloud Next '26 in Las Vegas this week was not a product announcement event so much as an architectural declaration. The headline move was the formal retirement of Vertex AI as a standalone brand. In its place, Google has launched the Gemini Enterprise Agent Platform, which it describes as "the evolution of Vertex AI" — absorbing all existing Vertex services, roadmap items, and custom training capabilities under a single umbrella. The platform introduces Agent Studio (the renamed and now generally available Agent Designer), Agent Simulation for stress-testing multi-step agentic workflows against thousands of synthetic scenarios before deployment, and Agent Registry for governed versioning and access control. Critically, Anthropic's Model Context Protocol (MCP) now spans every Google Cloud and Workspace service, meaning any agent built on the platform can address the entire GCP surface through a single protocol.

On the hardware side, Google announced its eighth-generation Tensor Processing Units with a notable architectural split. The TPU 8t is the training variant: 9,600 chips per superpod, 121 exaflops of compute, two petabytes of shared memory, and doubled inter-chip interconnect (ICI) bandwidth. The TPU 8i is purpose-built for inference and reinforcement learning, tripling on-chip SRAM to 384 MB, raising HBM to 288 GB, and introducing a dedicated Collectives Acceleration Engine that reduces on-chip latency by up to 5x. Google claims the TPU 8i delivers 80% better performance per dollar for inference than the prior generation. Underpinning both is the new Virgo Network, a collapsed fabric architecture with 4x the bandwidth of previous generations, capable of connecting 134,000 TPUs in a single data center or more than one million TPUs across sites into a single logical training cluster.

For teams running workloads on Kubernetes, the week also brought the release of Kubernetes v1.36 "Haru" (April 22). The release's 70 enhancements include two features of direct relevance to ML platform engineers. First, Workload Aware Scheduling (WAS) enters Alpha, introducing a decoupled PodGroup API that allows the scheduler to treat distributed job pods as a single logical entity. Unlike the gang scheduling in v1.35, which required a minimum number of schedulable pods, v1.36's PodGroup cycle evaluates the entire group atomically: all pods bind together or none do. Second, the allocatedResourcesStatus field on Pod status graduates to Beta, giving platform teams a native way to surface hardware health — including GPU failures from both Device Plugin and DRA-provisioned hardware — directly via kubectl describe pod. The Dynamic Resource Allocation (DRA) extended resource feature also graduates to Beta in this release, expanding native ResourceClaim support for specialized hardware.

Taken together, the week's announcements reflect a maturing AI infrastructure market. Cloud providers are no longer competing solely on raw GPU capacity; they are competing on the depth of the full stack — from custom silicon and networking fabric to scheduling primitives and agentic orchestration layers. For ML engineers and platform teams, the practical implication is that the tooling required to run production inference and training workloads is becoming significantly more opinionated, and significantly more capable.

Tool of the Week

llm-d

Kubernetes-native distributed LLM inference scheduler — open source, Apache 2.0

The problem it solves. A single vLLM instance handles KV-cache reuse elegantly, but the moment you scale to multiple replicas behind a standard load balancer, that efficiency collapses. Round-robin routing sends requests to whichever pod has capacity, with no awareness of where the relevant KV-cache already exists. The result is duplicate computation, inflated GPU memory pressure, and unpredictable tail latency under load. This is not a theoretical concern: it is the first bottleneck that teams hit when moving from a proof-of-concept deployment to production scale.

How llm-d works. Developed by Red Hat and IBM Research, llm-d is a Kubernetes-native inference scheduler that operates as a layer above the serving runtime. It integrates with Envoy and the Gateway API Inference Extension to implement prefix-cache aware routing, directing each incoming request to the specific vLLM pod where the matching KV-cache already resides. It also supports prefill/decode disaggregation — separating the compute-heavy prompt processing phase from the memory-bandwidth-bound token generation phase — and cache-aware LoRA routing for multi-adapter deployments. The latest release, llm-d 0.5, adds hierarchical KV offloading to any filesystem, resilient networking via UCCL, and scale-to-zero autoscaling.

Measured results. A production deployment detailed jointly by Red Hat and Tesla engineers this week reported a 3x improvement in output tokens per second and a 2x reduction in time-to-first-token (TTFT) after enabling prefix-cache aware routing, measured on Llama 3.1 70B running on four AMD MI300X GPUs (tensor-parallel-size=4, gpu-memory-utilization=0.90, max-model-len=65536). The project currently has approximately 3,000 GitHub stars and holds weekly community calls on Wednesdays at 12:30 PM ET.

How to get started. llm-d is designed to complement KServe rather than replace it. KServe handles the model lifecycle, autoscaling, and API governance; llm-d handles runtime-aware routing and cache locality. The recommended stack for production LLM serving on Kubernetes is KServe + llm-d + vLLM. Documentation and getting-started guides are available at llm-d.ai.

Attribute	Detail
License	Apache 2.0 (free for commercial use )
Latest Version	0.5 (hierarchical KV offloading, cache-aware LoRA routing, scale-to-zero)
Supported Runtimes	vLLM, Hugging Face TGI (via KServe)
Key Dependency	KServe v0.16+ (LLMInferenceService CRD)
Pricing	Free and open source

Quick Hits

VAST Data closes $1B Series F at a $30B valuation. The AI infrastructure company, which builds a Disaggregated Shared Everything (DASE) architecture that unifies storage and compute for large-scale AI workloads, raised the round led by Drive Capital with participation from NVIDIA, Fidelity, and NEA. VAST reports more than $500M in committed ARR and a Rule of X score of 228%, and claims to support environments spanning millions of GPUs globally. The valuation represents a 3x increase from its $9.1B Series E in late 2023.

Factory raises $150M Series C at a $1.5B valuation. The New York-based startup, which builds autonomous AI coding agents ("Droids") for enterprise engineering teams, closed the round led by Khosla Ventures with participation from Sequoia Capital, Insight Partners, and Blackstone. Factory positions itself as a full-stack AI software development platform rather than a copilot, targeting the replacement of manual engineering workflows at the task and sprint level.

OpenAI open-sources Privacy Filter. The company released an open-weight, 1.5B parameter (50M active) bidirectional token-classification model for PII detection and masking under an Apache 2.0 license. The model supports 128k context windows, processes inputs in a single forward pass using a constrained Viterbi decoder, and detects eight span categories including private_person, account_number, and secret (passwords, API keys). It is available on Hugging Face and GitHub and is designed to run locally, making it directly applicable to data pipelines that feed external LLM APIs.

Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.

Google's AI infrastructure overhaul, Kubernetes v1.36's gang scheduling, and the open-source tool fixing LLM routing

Gradient Brief

Google Cloud Next '26 Rewires the AI Infrastructure Stack

llm-d

Keep Reading

Quick Links

Subscription

Socials