FlowR2A: Learning Reward-to-Action Distribution
for Multimodal Driving Planning

Xirui Li1, Zhe Liu1†, Xiaoqing Ye2*, Wenhua Han2, Yifeng Pan2, Junyu Han2, Hengshuang Zhao1*
Project lead *Corresponding author

Highlights

  • Rewards as a condition, not a target. We reframe simulation rewards from discriminative scores into a generative condition, and learn the reward-conditioned action distribution p(a|r) with a flow-matching decoder.
  • Dense training supervision. Every simulated trajectory–reward pair becomes usable training signal, unifying the dense supervision of scoring-based planners with the dynamic proposal generation of anchor-based planners in a single generative model.
  • Fine-grained reward signals. Rewards cover safety, progress, comfort, and rule compliance, exposing rich signals for the model to internalize action–reward correlations.
  • Controllable test-time sampling. A reward target and an initial noise level expose an interpretable 2D sampling space, steering proposals via reward guidance and anchored sampling.
  • State-of-the-art, high-quality proposals. FlowR2A tops the NAVSIM v1 / v2 benchmarks under a lightweight backbone, with multimodal proposals of substantially higher quality than prior methods.

Framework

FlowR2A unifies the dense reward supervision of scoring-based methods with the dynamic proposal generation of anchor-based methods, all within a single generative model.

Training pipeline of FlowR2A
Training. A flow-matching action decoder is conditioned on fine-grained reward signals (safety, progress, comfort, rule compliance) injected via AdaLN. Every action–reward pair from simulation becomes a valid training sample, so the model internalizes the correlation between an action and its outcomes rather than imitating a single ground-truth trajectory.
Inference pipeline of FlowR2A
Inference. Classifier-free reward guidance plus zero-shot anchored sampling span a 2D space of (score target, initial noise level). This produces a diverse set of high-quality proposals that a lightweight mode selector ranks for the final action.

Video comparisons

Each planner's proposals are colored by PDMS, from red (0) to green (1).

Per-frame comparisons

Proposal quality

FlowR2A produces consistently high-quality proposal candidates, surpassing the prior multimodal planner iPad on both single and average proposal quality.

top-k comparison

FlowR2A's top-K proposals dominate prior planners across K.

NAVSIM Performance

FlowR2A achieves state-of-the-art performance on the NAVSIM v1 navtest benchmark under a lightweight backbone.

Interactive sampling space

FlowR2A offers flexible sampling control through two intuitive knobs: a reward target rhigh steers proposals toward higher-PDMS regions, and an initial noise level tinit trades anchor fidelity for sampling diversity.

0.800.850.900.95
0.750.800.850.900.95
sampling space proposals

Higher rhigh guides proposals toward higher-reward regions; higher tinit introduces more sampling diversity around the anchor.

Training reward visualization

Fine-grained reward labels used in training partition dense trajectories in complementary ways, exposing rich signals for the model to internalize action-reward correlations.

BibTeX

@article{flowr2a2026,
  title         = {FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning},
  author        = {Li, Xirui and Liu, Zhe and Ye, Xiaoqing and Han, Wenhua and Pan, Yifeng and Han, Junyu and Zhao, Hengshuang},
  journal       = {arXiv preprint arXiv:2606.24231},
  year          = {2026}
}