Open Weights Land While Returns Stay Missing

Three threads pull through today’s issue. A Chinese open-weight model just shipped at frontier parity on agentic browsing, the disclosure norms around safety and evaluation are tightening, and the bills for last year’s AI deployments are coming due without the promised savings.

Frontier Open Weights Catch Up Fast

Two open-weight launches today change what teams can run without a hyperscaler in the loop, one at the frontier and one on a phone.

MiniMax M3 beats Opus 4.7 on BrowseComp as open weights ship — M3 scores 83.5 on BrowseComp versus Claude Opus 4.7’s 79.3, combines frontier coding with a 1M-token context, and drops weights on Hugging Face within ten days.
Prism ML ships 4B-parameter diffusion that runs on an iPhone — Bonsai Image 4B’s 1-bit and ternary variants make on-device generative image inference real, changing the unit economics for any mobile product currently calling a Stable Diffusion API.

Disclosure Becomes the Bar

Two documents from frontier labs raise the implicit standard for what counts as a credible safety disclosure and a credible third-party evaluation.

What a 244-page safety card actually discloses — Zvi Mowshowitz argues the Opus 4.8 system card sets a disclosure floor that competing labs have not cleared, and that the Mythos gap it implies is the more important signal.
OpenAI publishes an evaluations playbook for third-party auditors — The playbook codifies what an external evaluation report should contain (claim, representativeness, harness, evidence) at the moment the harness is becoming the variable that decides whether a capability appears at all.

Agents Settle Into Real Workflows

Four stories show the agentic stack maturing past the demo, with the operational questions now being how agents get permitted, how their work gets verified, and which surface owns the document.

Microsoft’s Copilot super app leaks before Build 2026 — Screenshots reveal a unified app with GitHub Copilot, Cowork, and an always-on Scout agent, ahead of Microsoft’s June 2 keynote, a consolidation play after years of Copilot positioning confusion.
Workday’s agent bet: permissioning, not performance, is the wall — Workday is wiring Google’s Gemini through its system of record so HR and finance agents inherit existing user permissions, betting the bottleneck is governance, not model quality.
Devin crossed a threshold: automation now triggers more sessions than humans — Cognition’s Ido Pesok says async Devin sessions now outnumber interactive ones, with 10 to 20 parallel Devins per dev server becoming the standard verification pattern before merge.
NotebookLM’s Canvas feature puts Google in the document-ownership fight — Leaked screenshots show a Canvas workspace mode plus Connectors and Personal Preferences, closing the feature-parity gap with ChatGPT Canvas and Claude Artifacts.

The Builder’s Stack

Four developer-facing releases that change something concrete in the stack, from the laptop you compile on to the model card you submit for audit.

Nvidia’s N1X laptop chip targets Intel’s last stronghold — The N1X (or RTX Spark Superchip per Bloomberg) bundles 20 Arm cores with an RTX 5070-class GPU and is expected to ship in Dell and Lenovo laptops this fall, opening a real path for on-device AI workflows.
Open-source harnesses get Claude Code’s workflow primitive — The pi-dynamic-workflows extension brings declarative fan-out plus synthesize agent orchestration to the open-source Pi harness, eliminating the vendor lock-in for multi-perspective code review and large refactors.
TRL’s token buffer fix closes a silent RLHF correctness hole — Quentin Gallouedec lays out the detokenize-then-tokenize drift that quietly corrupts RL gradient updates and the prefix-preserving template fix that resolves it, a quiet upgrade most teams running RLHF will want.
Nvidia ships a tool that auto-generates EU AI Act model cards — The MCG Toolkit pulls model lineage from MLflow, Weights and Biases, and NeMo to pre-fill Annex IV documentation in the Model Card++ format, cutting compliance work from days to minutes.

The Money Question

Two stories about where AI capital is going and whether it is coming back.

Corporate AI budgets are rising on returns that haven’t arrived — A Bain global survey finds AI automation cost savings are broadly below projections, and 60 percent of large companies still lack the data foundation to scale AI, even as 91 percent forecast higher growth in 2026.
Ex-DeepMind lab Inherent raises $50M seed for science question-prioritization AI — The London-based lab is building Faraday, a platform that decides which scientific questions are worth investigating, the latest in a wave of ex-DeepMind science-AI spinouts.

Today’s Quick Hits

xAI opens grok-build-0.1 API at $1 in / $2 out per million tokens — Two days after the Grok Build CLI launch, the underlying coding model is now an open API with integrations confirmed for Cursor and OpenClaw, priced below GPT-5 mini.