Issue No. 07 • May 21, 2026
Gradient Brief
MLOps & AI Infrastructure — for the engineers building it
OpenAI Lands on AWS Bedrock — and the Real Story Is 2 Gigawatts of Trainium
The most consequential AI infrastructure shift of the year happened this week: OpenAI's models, Codex, and Managed Agents are now available on AWS Bedrock in limited preview. This marks the definitive end of Microsoft Azure's exclusivity over OpenAI workloads — a relationship that dated to a $1 billion investment in 2019 and grew to a reported $13 billion commitment. For ML engineers already building on AWS, the practical effect is immediate: GPT-5.5 and GPT-5.4 are now accessible through the same Bedrock APIs used for Anthropic's Claude, inheriting IAM-based access management, AWS PrivateLink connectivity, CloudTrail logging, and existing compliance frameworks. Customers can apply OpenAI usage toward existing AWS cloud commitments, with no separate procurement required.
The architectural centerpiece of the announcement is Amazon Bedrock Managed Agents, powered by OpenAI. The product is built on the OpenAI agent harness — the same runtime, environment, and inference API that OpenAI uses internally. AWS VP Anthony Liguori described it on stage as "engineered for faster execution, sharper reasoning, and reliable steering of long-running tasks." The harness gives agents persistent memory across sessions, identity-based permission enforcement, and AgentCore as the default compute environment with authorization and observability layered on top. This is a tighter coupling between model and runtime than other agent platforms have offered. Codex is also now available within the AWS security boundary, accessible via the Codex CLI, desktop app, and VS Code extension, allowing development teams to authenticate with native AWS credentials and route all inference through Bedrock.
The silicon story beneath the software is the more structurally significant development. As part of the February deal that this week's event productized, OpenAI committed to consuming approximately 2 gigawatts of Trainium capacity, spanning Trainium3 and Trainium4. This mirrors Anthropic's parallel commitment — spanning Graviton and Trainium2 through Trainium4 — announced eight days earlier. Two frontier AI labs that compete on every benchmark and every architectural decision have now made parallel, multi-year bets on the same custom silicon roadmap. AWS's custom silicon business already generates more than $20 billion in annual revenue, with over $225 billion in total revenue commitments for Trainium. For ML platform teams evaluating their inference substrate, this is a durable signal: Trainium is no longer an alternative to NVIDIA for AWS workloads, it is becoming the primary substrate for frontier model inference on AWS.
The competitive read for Microsoft is uncomfortable but not fatal. Azure AI Foundry already hosts both Claude and GPT, and Microsoft retains a large installed base of enterprise customers. But the structural claim Bedrock can now make — that both top labs run on AWS-managed infrastructure, on AWS-designed silicon, within the AWS security boundary — is a differentiated position that Foundry cannot replicate. For developers choosing where to build, the inference cloud landscape has shifted from a model-choice question to a silicon and infrastructure question. Custom silicon is the new layer of competition, and AWS got both top labs to commit to its roadmap in the same week.
Tool of the Week: Kthena Router
Open-source | Apache 2.0 | Free | github.com/volcano-sh/kthena
As Kubernetes becomes the de facto standard for deploying AI/ML workloads, the traffic management layer has become a genuine operational problem. Kthena Router, a project from the Volcano-SH organization, is an open-source inference traffic manager that this week shipped support for both the Kubernetes Gateway API and the Gateway API Inference Extension — a CNCF-standardized spec for AI/ML routing resources.
The problem it solves: In traditional inference routing configurations, the modelName field in ModelRoute resources is global. When multiple teams deploy routes with the same model name — a common occurrence in multi-tenant clusters where different teams use "deepseek-r1" or "llama-3" for different purposes — the result is undefined routing behavior. Kthena solves this by leveraging the Gateway API's concept of Gateway resources, which define independent routing spaces. Each Gateway can listen on a different port, and ModelRoutes bound to different Gateways are completely isolated even if they share the same model name.
| Feature | Native ModelRoute/ModelServer | Gateway API + Inference Extension |
|---|---|---|
| Multi-tenant isolation | Via port-based routing | Via Gateway resource scoping |
| Prefill-decode disaggregation | Supported | Not available |
| Weighted routing / A/B testing | Supported | Via InferenceObjective |
| Interoperability with other gateways | Limited | Full (standard Kubernetes API) |
| Vendor lock-in risk | Higher (custom CRDs) | Lower (CNCF standard) |
Getting started: Kthena Router is installed via Helm. Enable Gateway API support by setting --enable-gateway-api=true at deploy time. The project is available at oci://ghcr.io/volcano-sh/charts/kthena and is free under the Apache 2.0 license.
Who should look at this: ML platform engineers managing multi-team Kubernetes clusters where multiple teams are deploying inference services against the same model names. Also relevant for teams evaluating the Gateway API Inference Extension standard as a path away from proprietary routing CRDs.
Quick Hits
- DeepInfra closes $107M Series B backed by NVIDIA The cloud inference platform — which hosts over 190 open-weight models on owned hardware at prices that undercut hyperscaler rates by a factor of five to ten — closed a round co-led by 500 Global and Georges Harik, with NVIDIA, Felicis, and Samsung Next also participating. The company has scaled its processing volume 8,000x since its 2022 seed round. NVIDIA's participation continues a pattern of backing inference cloud providers — CoreWeave, Lambda, Baseten — that deepens customer dependency on Blackwell GPU hardware.
- IBM Think 2026: watsonx Orchestrate becomes an agentic control plane At its annual Think conference in Boston, IBM announced the next generation of watsonx Orchestrate (private preview), repositioning it as a centralized control plane for the multi-agent era. The platform allows organizations to deploy and govern agents from any source — including third-party agents — with consistent policy enforcement and audit trails.
- IREN acquires Mirantis for $625M to strengthen its AI cloud software layer IREN Limited (NASDAQ: IREN), a vertically integrated AI cloud provider operating large-scale GPU clusters in renewable-energy-rich regions of the US and Canada, announced a definitive agreement to acquire Mirantis in an all-stock deal valued at approximately $625 million. Mirantis brings Kubernetes-based orchestration, the k0rdent AI platform for managing AI infrastructure across bare metal, VMs, and Kubernetes environments, and a customer base of over 1,500 enterprises.
Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.