Issue No. 05 • May 12, 2026
Gradient Brief
MLOps & AI Infrastructure — for the engineers building it
Lead Story
GPT-5.5 Shifts the Frontier Toward Agentic Efficiency
OpenAI released GPT-5.5 this week, marking a distinct shift in how frontier models are positioned for enterprise workloads. Rather than focusing solely on raw capability gains, the release emphasizes token efficiency and agentic reliability. The model matches GPT-5.4's per-token latency in real-world serving while demonstrating significantly higher intelligence, particularly in multi-step coding and computer-use tasks. On the Terminal-Bench 2.0 evaluation, GPT-5.5 scored 82.7%, outperforming both Claude Opus 4.7 (69.4%) and Gemini 3.1 Pro (68.5%). However, Anthropic's Claude Opus 4.7 still maintains a slight edge on the SWE-Bench Pro benchmark at 64.3%.
The economics of the new model reflect this focus on efficiency. The API pricing is set at $5 per million input tokens and $30 per million output tokens for a 1-million token context window. While this represents a nominal price increase over GPT-5.4, OpenAI notes that GPT-5.5 uses approximately 40% fewer output tokens to complete the same Codex tasks, bringing the effective cost increase down to roughly 20%. For workloads requiring maximum reliability, a GPT-5.5 Pro variant will be available at $30 per million input tokens and $180 per million output tokens.
The infrastructure ecosystem is already adapting to support the new model's agentic capabilities. Databricks announced immediate support for GPT-5.5 and Codex, routing all usage through its newly rebranded Unity AI Gateway. This integration provides enterprise ML teams with a single control plane for permissions, rate limits, PII guardrails, and Model Context Protocol (MCP) governance. Every agent tool call and model request is logged with identity, latency, and cost metrics directly into customer-owned Delta tables, solving one of the primary observability challenges in deploying autonomous agents at scale.
This release underscores a broader trend in AI infrastructure: the bottleneck is moving from model intelligence to orchestration and governance. As models become more capable of executing long-horizon tasks autonomously, the tooling required to monitor, secure, and audit those actions—like the Unity AI Gateway—is becoming just as critical as the models themselves.
Tool of the Week
vLLM v0.20.0
High-throughput and memory-efficient LLM inference engine — open source, Apache 2.0
The problem it solves. Serving large language models in production requires balancing throughput, latency, and GPU memory utilization. As context windows expand and models grow more complex, managing the Key-Value (KV) cache becomes the primary bottleneck. Inefficient memory allocation leads to fragmented GPU memory, limiting the number of concurrent requests a server can handle and driving up infrastructure costs.
How vLLM v0.20.0 works. Released on April 27, v0.20.0 is the largest update in the project's history, featuring 752 commits from 320 contributors. The headline feature is the introduction of the TurboQuant 2-bit KV cache. This new attention backend delivers 2-bit KV cache compression, effectively quadrupling cache capacity and enabling massive context windows (up to 125K) on constrained hardware. The release also re-enables FlashAttention 4 as the default Multi-Head Latent Attention (MLA) prefill backend, supporting head-dim 512 and paged-KV on SM90+ architectures.
Measured results. Beyond memory efficiency, v0.20.0 introduces zero-bubble asynchronous scheduling combined with speculative decoding. This allows speculative decoding to operate with zero-bubble overlap, significantly improving overall throughput. The release also includes a 2.1% end-to-end latency improvement achieved by optimizing the batch invariant with fused RMS norm. For teams deploying the latest models, v0.20.0 adds day-zero support for DeepSeek V4, the Gemma 4 series (including fast prefill and quantized MoE), and Granite 4.1 Vision.
How to get started. The new release defaults to CUDA 13.0 and upgrades the underlying framework to PyTorch 2.11. It also introduces an initial Intermediate Representation (IR) skeleton, laying the foundation for future custom kernel work. You can install the latest version via pip (pip install vllm). Full documentation and release notes are available on the project's GitHub repository.
| Attribute | Detail |
|---|---|
| License | Apache 2.0 |
| Latest Version | 0.20.0 (TurboQuant 2-bit KV cache, FA4 default MLA prefill) |
| New Model Support | DeepSeek V4, Gemma 4 series, Granite 4.1 Vision |
| Frameworks | PyTorch 2.11, CUDA 13.0, Transformers v5 |
| Pricing | Free and open source |
Quick Hits
Ineffable Intelligence raises $1.1B seed round to build a "superlearner." Founded by former Google DeepMind researcher David Silver (creator of AlphaGo), the UK-based startup closed Europe's largest-ever seed round at a $5.1 billion valuation. Backed by Sequoia Capital, NVIDIA, and Google, Ineffable is abandoning the standard LLM pre-training paradigm. Instead, the company is using reinforcement learning to build AI systems that discover knowledge autonomously through experience and environment interaction, without relying on human-generated text data.
ComfyUI secures $30M Series B at a $500M valuation. The open-source, node-based workflow platform for generative AI raised the round led by Craft Ventures, bringing its total funding to $47 million. ComfyUI has rapidly become the standard infrastructure layer for professional AI artists and developers building complex, multi-step image and video generation pipelines, challenging proprietary, closed-ecosystem creative tools.
MLflow 3.11.1 introduces AI-powered issue identification for agents. The latest release of the open-source MLOps platform adds a "Detect Issues" feature to its traces table. The system uses AI to automatically analyze selected execution traces and surface potential problems across correctness, safety, and performance categories. The update also includes configurable budget policies and alerts for the AI Gateway, allowing teams to set daily or monthly spending limits and automatically block requests to prevent runaway inference costs.
Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.