Saved in:
Bibliographic Details
Main Authors: Kong, Zicheng, Ma, Dehua, Xu, Zhenbo, Yang, Alven, Ru, Yiwei, Wang, Haoran, Zhou, Zixuan, Bie, Fuqing, Xiang, Liuyu, Wu, Huijia, Zhao, Jian, He, Zhaofeng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.00846
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908801927479296
author Kong, Zicheng
Ma, Dehua
Xu, Zhenbo
Yang, Alven
Ru, Yiwei
Wang, Haoran
Zhou, Zixuan
Bie, Fuqing
Xiang, Liuyu
Wu, Huijia
Zhao, Jian
He, Zhaofeng
author_facet Kong, Zicheng
Ma, Dehua
Xu, Zhenbo
Yang, Alven
Ru, Yiwei
Wang, Haoran
Zhou, Zixuan
Bie, Fuqing
Xiang, Liuyu
Wu, Huijia
Zhao, Jian
He, Zhaofeng
contents Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00846
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis
Kong, Zicheng
Ma, Dehua
Xu, Zhenbo
Yang, Alven
Ru, Yiwei
Wang, Haoran
Zhou, Zixuan
Bie, Fuqing
Xiang, Liuyu
Wu, Huijia
Zhao, Jian
He, Zhaofeng
Computation and Language
Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.
title Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis
topic Computation and Language
url https://arxiv.org/abs/2602.00846