Issue No. 11 • June 4, 2026
Gradient Brief
MLOps & AI Infrastructure — for the engineers building it
Google I/O 2026 Reveals the Scale of Agentic Infrastructure
Google I/O 2026 (May 19-20) was less about new frontier models and more about the infrastructure required to run them at scale. The most staggering number from Sundar Pichai's keynote was the token volume: Google is now processing 3.2 quadrillion tokens per month, a 7x increase from the 480 trillion reported at I/O 2025. This volume is being driven by 8.5 million developers and over 375 Google Cloud customers who each processed more than 1 trillion tokens in the past 12 months. To support this, Google confirmed its 2026 capex will reach $180-190 billion, up from $31 billion in 2022 [1].
For ML engineers, the most operationally significant announcements centered on the agent developer stack. Google launched the Managed Agents API, a configuration-first approach where developers define agentic behavior and Google provisions an ephemeral, secure sandbox with skills and Model Context Protocol (MCP) servers. For teams building custom agent meshes, the Agent Development Kit (ADK) v2.0 is now generally available. ADK 2.0 introduces "Collaborative Workflows" with explicit operating modes for sub-agents (chat, task, and single-turn) and dynamic, graph-free workflows using decorators. Google also confirmed that IBM's Agent Control Plane (ACP) has merged into the A2A protocol, signaling critical industry consolidation around agent communication standards [2].
The hardware and model layers were also updated to support this agentic shift. Google detailed its 8th generation TPU strategy, splitting the silicon into the TPU 8t for training (3x raw compute vs. previous generation) and the TPU 8i optimized specifically for inference latency. On the model side, Gemini 3.5 Flash was introduced as the new default engine. Google claims it produces outputs four times faster than competing frontier models while costing less to run. In an internal test demonstrating this speed, 93 parallel sub-agents running on Gemini 3.5 Flash built a working operating system core from scratch in 12 hours, making over 15,000 model requests and processing 2.6 billion tokens [1] [3].
Tool of the Week: Gemini CLI
Open Source (Apache 2.0) | Local | Google
An open-source AI agent that runs Gemini models directly in the terminal. It is not a thin API wrapper; it is a full agentic tool with file system access, shell command execution, web fetching, and MCP server support.
The Gemini CLI (released alongside Google I/O) brings agentic capabilities directly to the developer's local environment. With a 1 million token context window, engineers can feed entire codebases into a single session. The inclusion of MCP support means it can connect to existing ML toolchains, such as databases, experiment trackers, or model registries. Furthermore, its non-interactive mode (gemini -p "prompt" --output-format json) enables seamless integration into CI/CD pipelines for automated testing and code analysis.
Apache 2.0 licensed. The free tier offers 60 requests per minute and 1,000 requests per day using a personal Google account via OAuth, making it highly viable for daily use without incurring API costs. Paid tiers use standard Gemini API or Vertex AI pricing.
Quick Hits
- Durantic Launches to Manage Heterogeneous GPU Fleets Founded by infrastructure veterans from Meta and Hudson River Trading, Durantic emerged from stealth to offer a managed operating layer for fragmented AI compute. The company uses a proprietary bare-metal-native control plane to handle provisioning, network configuration, and hardware maintenance across mixed GPU generations and orchestration systems like Kubernetes and Slurm.
- JFrog 2026 Software Supply Chain Security Report The report highlights a severe "AI governance gap." While 97% of organizations claim to have certified model governance, 53% self-host models from sources where malicious payloads have been detected. Most notably, JFrog tracked malicious AI agent skills for the first time, identifying 969 carrying high-impact payloads, alongside a 451% year-over-year surge in malicious npm packages.
- LMCache at MLSys 2026 Presented at the MLSys conference in Santa Clara, LMCache is an open-source KV caching solution that extracts and stores KV caches outside of GPU memory. By enabling cache reuse across different queries and inference engines (compatible with vLLM and SGLang), it demonstrated up to a 15x improvement in throughput for enterprise-scale LLM inference workloads.
Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.