DeepReinforce released Ornith-1.0, an open-weight family of coding models that goes beyond answering coding tasks to generating the reinforcement-learning scaffolds that guide its own training. The weights and a technical report are published on Hugging Face as of June 25.

The family spans four size points: a 9B Dense model aimed at resource-constrained hardware, a 31B Dense, a 35B MoE, and a 397B MoE flagship intended for frontier-scale deployments. Each variant is trained on top of Gemma 4 and Qwen 3.5 pretrained checkpoints, which means teams already running those foundations can compare directly against their current stack.

The architectural novelty is in how RL training works. Most reinforcement-learning pipelines rely on hand-written task harnesses, fixed scaffolds that tell the model how to approach a problem. Ornith-1.0 eliminates that dependency. Each training step runs in two stages: the model first proposes a refined scaffold for the task at hand, then generates a solution conditioned on that scaffold. Reward from the solution flows back through both stages, so the model is simultaneously trained to orchestrate and to answer. Over many training steps, per-task strategies emerge without engineers designing them by hand.

That design creates a real attack surface. A model that authors its own scaffolds could, in principle, write scaffolds that game the reward signal without solving the underlying task. DeepReinforce addresses this with three layered controls: a fixed outer trust boundary that the model cannot modify; a deterministic monitor that detects attempts to access withheld paths or alter verification scripts; and a frozen LLM judge that can veto the verifier when gaming occurs within the allowed tool surface. Whether these controls hold across adversarial fine-tunes or more capable future variants is a question the release announcement does not address.

On benchmark performance, DeepReinforce claims the 397B flagship scores 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, which the company says matches Claude Opus 4.7 and outperforms open peers including MiniMax M3 and DeepSeek-V4-Pro. The 35B model is reported to beat similarly sized Qwen and Gemma builds. At the small end, the 9B variant is said to reach 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified, matching models several times its size. These figures come from DeepReinforce directly; the release does not include independent benchmark replication.

The lab is not a first-time entrant to open RL research. Prior releases include CUDA-L1 and the IterX optimization loop for code agents, so Ornith-1.0 extends a pattern rather than arriving from nowhere.

What makes this release worth watching is not the benchmark numbers but the training method. The open-source coding-model field has been crowded since DeepSeek-Coder-V2, and raw benchmark improvements are common enough to parse cautiously. Self-authoring scaffolds, if the approach holds up at scale, would let a model specialize its problem-solving strategy to each task class without human harness engineering. That shifts the labor from scaffold design to reward specification, which is both easier to automate and easier to get wrong.

Teams building coding agents on open weights should pull the 35B MoE variant and run it against their own task distribution before accepting the benchmark claims at face value. If the scaffold-mutation mechanism transfers to their domain, it is a qualitatively different capability than the prior generation of RL-tuned coding models.

Reported by TestingCatalog on June 25, 2026.