The Cost of Trust

The contradictions in enterprise AI are getting harder to ignore: who you pay, who you compete with, and what your agents can actually finish.

The Platform Contradiction: Paying a Rival You Built In

Salesforce’s relationship with a competing AI product inside Slack raises a question that most platform companies will face soon: when your distribution helps a rival more than your own product, what exactly is the strategy?

Salesforce Is Paying $300M a Year to a Rival It Built Into Its Own Platform. Salesforce promoted a competing AI assistant inside Slack despite the product going head-to-head with Agentforce and Slackbot. The structural conflict was visible enough that Salesforce employees could not explain it.
The reward problem: why RL stalls outside math and code. Reinforcement learning dominates coding and math benchmarks because those domains produce clean, checkable answers. Everywhere else in the economy, the verification problem has not been solved and progress slows sharply.

Agents in Production: Cost, Continuity, and the Phone in Your Pocket

Three developments this week show agents moving out of demos and into the conditions that production actually demands: failure recovery, cost control, and the ability to run from anywhere.

Cognition ships a dual-agent cost cutter for Devin. A multi-model harness routes tasks between models by complexity, cutting per-task cost by 35% while holding frontier-level performance. The approach offers a template for any team running autonomous agents at volume.
Mistral Launches Workflows, a Durable Orchestration Layer for Agent Pipelines. Agent pipelines that survive failures and resume without restarting are the gap Mistral is targeting with Workflows. The product is aimed at production teams who have hit the limits of stateless orchestration.
Cursor brings agent control to iOS with public beta launch. Developers can now launch agents, monitor progress, and merge pull requests from their phones. The shift moves on-call and async review workflows off the desk entirely.

Model Specialization Takes Hold: Science, Code, and Personalization

Three model stories this week share the same underlying logic: general-purpose inference is becoming a baseline, and the competition is shifting toward what the model knows and who it knows it about.

Sakana AI ships Fugu Ultra at 93.2 on LiveCodeBench after losing Claude. Forced off a top frontier model by a US government directive, the Tokyo lab rebuilt its coding stack around multi-model routing and shipped a 93.2 on LiveCodeBench. The result converts a supply-chain vulnerability into a product argument.
Google Cloud will sell SandboxAQ’s science models alongside Gemini. Enterprise researchers gain access to models trained on equations and lab data rather than web text, distributed through the same channel as a general-purpose frontier model. The move defines a lane that general models cannot fill.
Google opens Gemini’s personalized image generation to all US free users. Gemini now reads Google account data to generate images without explicit prompts, bringing zero-shot personalization to a free tier. The feature raises the stakes in a personalization race where Apple and OpenAI are also competing.

The Inference Gap: Speed Records vs. Task Reality

Inference is getting dramatically faster and dramatically cheaper, yet a new evaluation shows the best coding agents still clear fewer than 4 in 10 real-world upgrade tasks. The two stories belong together.

DeepSeek open-sources DSpark to cut LLM inference time. A speculative-decoding framework claims up to 85% faster generation while preserving output fidelity. The open-source release puts the technique within reach of any team running their own inference stack.
Best coding agent clears under 40% of real upgrade tasks in RoadmapBench. RoadmapBench tests agents on actual open-source version upgrades across 17 repositories. Even the top-performing model solves fewer than 4 in 10 tasks, a result that separates benchmark performance from production usefulness.

Quick Hits

AI2’s DiScoFormer cuts density error 37x over KDE at 100 dimensions. The Allen Institute for AI built a single transformer that estimates both density and score for any data distribution in one forward pass, no retraining required.
The Transformer Becomes a Commodity Layer. Standardized inference APIs are doing to AI what TCP/IP did to networking: disaggregating the stack and redistributing value away from the model layer.