RadixArk has open-sourced Miles, a reinforcement learning framework built for post-training large language models at cluster scale. PyTorch announced the release in a blog post describing the framework as a way to keep RL post-training reproducible even as it turns into a distributed-systems problem. The stakes are practical: teams running RL on frontier-scale models now have to coordinate rollout, training, and weight synchronization across thousands of GPUs, and most existing tooling was not built for that.

Miles composes four existing systems rather than replacing them. It uses SGLang for high-throughput rollout generation, NVIDIA’s Megatron-LM for distributed training, Ray for cluster orchestration, and PyTorch as the shared programming layer for models, autograd, and profiling. According to the PyTorch blog post, this composition matters because rollout and training have different performance profiles: rollout is memory-bandwidth-bound during decoding, while training is compute-bound and communication-heavy. Keeping the two phases synchronized, especially for weight transfer and routing consistency, is where the engineering effort concentrates.

The framework follows what its authors call a small-core, many-edges design. The trainer itself stays compact, and the parts teams most often want to change (rollout logic, reward computation, loss functions, sample filtering, and training-loop hooks) attach through user-supplied Python modules at launch time. That structure lets infrastructure teams adapt Miles to new algorithms without forking the codebase, a distinction that separates it from RL frameworks that require modifying a shared trainer directly.

Every long-lived process in a Miles run, from trainer ranks to rollout servers, is represented as a Ray actor. That gives Miles access to Ray’s existing operator surface: job submission, worker supervision, log aggregation, and dashboard visibility, without additional bolt-on infrastructure. Ray’s GPU-aware scheduler also supports both disaggregated layouts, where rollout and training run on separate nodes, and colocated layouts on shared nodes. Miles can run in a fully asynchronous mode where rollout actors stream samples into a queue that the trainer drains independently, removing the blocking dependency between generation and training that slows down synchronous pipelines.

On the training side, Miles plugs directly into Megatron-LM’s argument parser, model construction, and distributed checkpoint format rather than wrapping it as a black box. New model architectures are added through plug-in specs, small files that insert custom PyTorch components into Megatron’s pipeline. PyTorch’s post states this lets Miles support architectures such as DeepSeek-V3/V4, GLM-4.7, and Qwen3 MoE variants without maintaining a long-lived Megatron fork.

Mixture-of-experts models introduce a specific failure mode that Miles addresses directly: routing decisions made during rollout can drift from the ones computed during training, destabilizing the policy update. Miles counters this with what it calls Rollout Routing Replay, which preserves routing decisions across the rollout and training boundary. The framework also ships a unified low-precision pipeline spanning BF16, FP8, MXFP8, and INT4-QAT, applied consistently across both rollout and training rather than as isolated backend features.

The release announcement lists ready-to-run recipes for DeepSeek-V4, Kimi K2.5 and K2.6, GLM-5 and 5.1, and Qwen3.5 and 3.6, alongside support for NVIDIA’s Hopper and Blackwell GPU generations. The blog post does not include independent benchmarks comparing Miles against other RL frameworks such as OpenRLHF or veRL, so its throughput and cost advantages remain unverified outside RadixArk’s own claims.

For infrastructure teams already running Megatron-LM training jobs, Miles offers a path to add RL post-training without adopting a separate, incompatible stack. Teams evaluating RL frameworks for MoE models in particular should test Rollout Routing Replay against their own routing-drift failures before committing a training budget to it.

Reported by PyTorch on the Miles framework announcement blog post.