Issue No. 14 • June 16, 2026
Gradient Brief
MLOps & AI Infrastructure — for the engineers building it
Production Inference Is Moving to Private Cloud
Broadcom's Private Cloud Outlook 2026, published June 9, reports that 56% of enterprises now run or plan to run production AI inference on private cloud, while public cloud use for the same workloads fell to 41% from 56% a year earlier. The report attributes the shift to three pressures: cost, complexity, and control.
For engineering teams, that move puts the serving problem back in house. Running inference on owned hardware means owning the throughput, latency, and memory-efficiency work that managed endpoints used to absorb. The result is a pull toward tunable open serving stacks and local accelerators rather than turnkey APIs.
The buildout underneath is visible at the silicon and network layers, where this week's announcements point to a private-inference market large enough to justify custom hardware. The practical question for platform teams is no longer whether to self-host, but how to make self-hosted serving efficient.
Tool of the Week: vLLM on DGX Spark
Open Source | Local (DGX Spark) | vLLM Project
A June 1 vLLM blog post documents how to run vLLM efficiently on NVIDIA's GB10-based DGX Spark, the desk-side system (sm_121 Blackwell) built for local large-model inference. vLLM exposes an OpenAI-compatible endpoint with the batching, KV-cache, and telemetry controls a team needs to serve NVFP4 models on a single box.
The guidance is specific to the hardware's unified-memory design. Leave headroom in the memory pool with --gpu-memory-utilization, keep --max-num-seqs low because the Spark favors small-batch over high-concurrency serving, and rely on CUDA graphs, FP4 kernels, async scheduling, and MTP speculative decoding for throughput. Prometheus metrics ship in for local observability.
The post closes with a local evaluation on a Nemotron-3-Super deployment, giving teams a concrete reference for standing up private inference on their own hardware.
Quick Hits
- Broadcom networking at OFC 2026 Broadcom introduced Taurus, a 400G-per-lane optical DSP it calls an industry first, alongside 102.4T Ethernet switches with co-packaged optics and PCIe Gen6, the interconnect layer large inference clusters depend on.
- OpenAI joins Broadcom's silicon roster CEO Hock Tan confirmed OpenAI as Broadcom's sixth major hyperscaler customer, with a $10B-plus custom inference accelerator slated for mass production in late 2026; Broadcom's AI backlog now stands at $73B.
- Inference economics keep compounding Cost per unit of capability continues to fall roughly 10x per year, shifting engineering attention toward inference-time scaling and serving efficiency rather than raw model size.
Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.