Morph LLM shows three ways to run coding AI faster on cheaper GPUs

A task-trained speculator, automated kernel search, and a TCP-based prefix cache together cut inference costs without requiring frontier-grade hardware.

Alessandro Benigni

PUBLISHED JUN 22, 2026

4 MIN READ

Follow on Google

2 HR AGO

Morph LLM shows three ways to run coding AI faster on cheaper GPUs — featured image for AI Insiders

Inference economics, not model quality, is increasingly deciding which coding tools survive. Morph LLM, a startup that serves open models including Qwen, GLM, and DeepSeek for coding workloads, published a technical breakdown on June 20 describing three separate optimizations it built for exactly that constraint. The numbers it claims are striking: 3.07x decoding speedup, 162 tokens per second on a $7K GPU card, and an 84% reduction in time-to-first-token over full recompute.

The first technique targets speculative decoding, a method where a smaller draft model predicts upcoming tokens and a larger target model validates them in a single pass. The key variable is acceptance rate: how often the large model keeps the draft’s guesses. Morph’s finding is that a generic off-the-shelf draft model achieves 1.93x speedup, while a draft trained specifically on the target model’s coding output reaches 3.07x on the same setup. The gap exists because code reuses templates, variable names, and diff patterns at rates that a model trained on general internet text cannot anticipate. A model that has read a large volume of code diffs predicts those patterns; a model trained on news articles does not.

The second technique addresses GPU kernel tuning. Coding agent workloads share a high proportion of prefix tokens between requests, because the system prompt, tool definitions, and repository files repeat across turns. Morph reports that across real workloads, programming traffic shares 97% of its prefix tokens, with prompts ranging from 37x to more than 2,000x longer than the outputs. Caching those prefixes is the obvious optimization, but the published cache abstractions are tuned for high-end NVIDIA cards that frontier labs actually buy. Porting them to other hardware without retuning can drop performance to 7% of optimal. Morph’s approach treats kernel selection as a search problem: propose a candidate, verify correctness against reference output, benchmark, keep the winners. Using this automated loop on lower-cost NVIDIA and AMD hardware, its warp-decode kernels reached 162 tokens per second on an 80-billion-parameter mixture-of-experts model running on a $7K RTX PRO 6000 card, up from 97 tokens per second and past the 120 tokens per second Morph reports for a $25K H100.

The third technique replaces NVLink with custom interconnect code. NVLink, NVIDIA’s proprietary GPU-to-GPU fabric, moves roughly 900 GB/s between devices. PCIe Gen5, the interconnect on affordable multi-GPU boxes, moves 64 GB/s per direction. That 14x bandwidth gap is tolerable for single-GPU inference and catastrophic for tensor parallelism across multiple GPUs, where an all-reduce operation on every layer consumes 8 to 11% of compute time on NVLink hardware and 40 to 75% on PCIe. Morph’s solution is bare-metal kernels that overlap the all-reduce with compute to hide much of the gap, combined with a prefix cache that shares cached prefixes across machines over plain TCP. The 84% reduction in time-to-first-token is measured against full recompute; the TCP transport is slower than RDMA, so the gain depends on driving cache hit rates high enough that fetching a cached prefix over TCP is still faster than recomputing it locally.

One frame for why this matters: the tools market for coding agents is converging on similar model weights. Qwen, DeepSeek, and other open models are freely available to any team. The differentiation is shifting to the serving layer, where latency directly affects user experience and compute cost directly affects margin. A team that achieves 3x higher throughput on hardware that costs one-third as much has a structural cost advantage that compounds over time.

The skeptic’s note is necessary here. All three figures come from Morph’s own engineering blog, published without independent reproduction or third-party audit. The 3.07x speedup is reported on Vicuna-13B in a controlled comparison; real-world gains on production models at scale may differ. The 162 tokens-per-second result applies to a specific 80B MoE configuration on a specific card. The 84% time-to-first-token reduction assumes a system with high cache hit rates already established by the other two techniques working in concert. None of this makes the techniques implausible, but teams evaluating this stack should benchmark against their own workloads rather than assuming the headline figures transfer directly.

Morph publishes its warp-decode kernel code publicly. Teams running 80-billion-parameter models on mid-range NVIDIA hardware have the clearest immediate path to reproduce the throughput claim and verify whether it holds for their use case.

Source: Morph LLM company blog, “Optimizing Models to Be Fast at Codegen,” authored by Tejas Bhakta, published June 20, 2026.

Morph LLM shows three ways to run coding AI faster on cheaper GPUs

The morning brief for people inside the AI industry.

More in Tools

Mistral's Le Chat gains a Code tab and an app-building surface

NVIDIA Ships Open XR AI Stack for AR Glasses Agents

OpenAI Retires ChatGPT Pulse, Unifies Tasks Into One Hub