Framework lock-in for agent training is a weaker constraint than it looks. In a post on X, Mishra strips away the abstractions of TRL, Unsloth, and PRIME-RL to argue that every agent-training system resolves to the same five-step loop: prompt the model, receive an action, pass it to an environment, compute a reward, and apply a gradient update.
To make the point concrete, Mishra built a toy text-to-diagram agent in pure Python. The model emits JSON actions of two types: create_shape and connect. A validating canvas environment receives those actions and checks them. The reward function combines four signals: JSON validity, schema compliance, layout quality, and semantic coverage of prompt keywords. Nothing in that stack requires a framework.
The practical consequence for builders is direct. Teams that have invested engineering time in a specific training framework because it felt like load-bearing infrastructure should re-examine the assumption. If the loop is identical across frameworks, the choice reduces to ergonomics, not capability. Picking TRL over a custom loop does not change what the agent learns; it changes how much boilerplate you maintain.
Posted on X by Mishra on 2026-05-20.