Gradient Brief

MLOps & AI Infrastructure — for the engineers building it

Production Inference Is Moving to Private Cloud

Broadcom's Private Cloud Outlook 2026, published June 9, reports that 56% of enterprises now run or plan to run production AI inference on private cloud, while public cloud use for the same workloads fell to 41% from 56% a year earlier. The report attributes the shift to three pressures: cost, complexity, and control.

For engineering teams, that move puts the serving problem back in house. Running inference on owned hardware means owning the throughput, latency, and memory-efficiency work that managed endpoints used to absorb. The result is a pull toward tunable open serving stacks and local accelerators rather than turnkey APIs.

The buildout underneath is visible at the silicon and network layers, where this week's announcements point to a private-inference market large enough to justify custom hardware. The practical question for platform teams is no longer whether to self-host, but how to make self-hosted serving efficient.

Tool of the Week: vLLM on DGX Spark

Open Source | Local (DGX Spark) | vLLM Project

A June 1 vLLM blog post documents how to run vLLM efficiently on NVIDIA's GB10-based DGX Spark, the desk-side system (sm_121 Blackwell) built for local large-model inference. vLLM exposes an OpenAI-compatible endpoint with the batching, KV-cache, and telemetry controls a team needs to serve NVFP4 models on a single box.

The guidance is specific to the hardware's unified-memory design. Leave headroom in the memory pool with --gpu-memory-utilization, keep --max-num-seqs low because the Spark favors small-batch over high-concurrency serving, and rely on CUDA graphs, FP4 kernels, async scheduling, and MTP speculative decoding for throughput. Prometheus metrics ship in for local observability.

The post closes with a local evaluation on a Nemotron-3-Super deployment, giving teams a concrete reference for standing up private inference on their own hardware.

Quick Hits

Broadcom networking at OFC 2026 Broadcom introduced Taurus, a 400G-per-lane optical DSP it calls an industry first, alongside 102.4T Ethernet switches with co-packaged optics and PCIe Gen6, the interconnect layer large inference clusters depend on.
OpenAI joins Broadcom's silicon roster CEO Hock Tan confirmed OpenAI as Broadcom's sixth major hyperscaler customer, with a $10B-plus custom inference accelerator slated for mass production in late 2026; Broadcom's AI backlog now stands at $73B.
Inference economics keep compounding Cost per unit of capability continues to fall roughly 10x per year, shifting engineering attention toward inference-time scaling and serving efficiency rather than raw model size.

Gradient Brief is published for ML engineers, data scientists, and technical founders. Forward to a colleague who should be reading this.

vLLM on DGX Spark, Broadcom Private Cloud Outlook 2026, Broadcom at OFC 2026, AI inference cost trends

Gradient Brief

Production Inference Is Moving to Private Cloud

Tool of the Week: vLLM on DGX Spark

Quick Hits

Keep Reading

Quick Links

Subscription

Socials