Lujun Gui∗, Qingnan Ren

∗: Equal contribution

Github: ShadeCloak/ADORA

Wandb report: ADORA & ADORA_VL

Huggingface: AdoraRL

——Mar 20, 2025

<aside> 💡 The blog presents ADORA (Advantage Dynamics via Online Rollout Adaptation), a reinforcement learning framework that dynamically adjusts advantage values during training based on the model's rollout distribution. We demonstrate how ADORA substantially enhances long Chain-of-Thought (CoT) reasoning and reflective capabilities in Large Language Models (LLMs) and Vision-Language Models (VLMs) through training on logic puzzles and geometry problems respectively. Furthermore, the framework features plug-and-play compatibility, being theoretically applicable to any advantage-based RL method. Notably, we have fully open-sourced our training code and implementation details to the community, aiming to inspire more advancements in post-training research.

****For LLMs, our ADORA implementation in the Logic-RL framework achieves 40 AMC with only 100 training steps compared to the original paper's 39 AMC with 1200 steps, while maintaining comparable AIME performance at 7. For VLMs, using only 2K samples from the Geometry3K training set starting from Qwen2.5-VL-7B-Instruct, we attained 73.5% accuracy on MathVista alongside consistent response length progression, establishing state-of-the-art performance in reproducing the multimodal version of DeepSeek-R1-Zero.

</aside>

<aside> 🛠 ADORA can be implemented in verl or OpenRLHF by modifying only a single function. Still, based on your specific training objectives, you need to define a method to generate advantage weights from the results of actor rollouts. Moreover, you can also choose to incorporate ADORA only at certain stages of RL training. Notably, ADORA demonstrates compatibility and independence, showing seamless integration capabilities with cold-start scenarios and the recently proposed DAPO. We welcome feedback, improvements, and collaboration opportunities to further explore ADORA's potential implementations.

</aside>

Table of Content

Introduction

The DeepSeek R1 report is highly informative, with its simple and efficient Rule-Based RL enabling the model to generate longer responses and significantly enhancing its reflective reasoning ability as RL training steps increase. Recently, many efforts to reproduce DeepSeek R1 have explored curriculum learning or prompt filtering to control the learning trajectories and prompt quality. However, methods aligned with reinforcement learning's inherent update dynamics, remain largely under-explored.

This blog proposes ADORA (Advantage Dynamics via Online Rollout Adaptation) - an RL framework that dynamically adjusts advantage values based on the model's rollout distribution. Through experiments focused on enhancing the reasoning capabilities of LLMs and VLMs, we demonstrated the superiority of the ADORA framework:

Advantage Dynamics via Online Rollout Adaptation

The core challenge lies in aligning RL training objectives with targeted capability improvements. Current methodologies apply uniform weighting to all training prompts when optimizing KL divergence and advantage approximation. However, this homogeneous approach overlooks the dynamic importance of different prompt types throughout the training process. Specifically, certain prompts might hold greater significance for particular capability development phases, while others serve better as validation checks once foundational skills are established. Effective RL alignment should implement adaptive weighting mechanisms that prioritize prompts based on their current relevance to the model's evolving competency, rather than main static parity across all training instances.

ADORA addresses this by dynamically adjusting advantage values at the per-prompt granularity through rollout response analysis. We abstract ADORA into the following core implementation:

    advantages *= weight_func(sequences_per_prompt, rewards, **aux_metrics)

This operation applies computed weights to original advantage values, emphasizing the adjustment of prompt-specific advantage distributions during training.