Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kong, Zicheng, Ma, Dehua, Xu, Zhenbo, Yang, Alven, Ru, Yiwei, Wang, Haoran, Zhou, Zixuan, Bie, Fuqing, Xiang, Liuyu, Wu, Huijia, Zhao, Jian, He, Zhaofeng
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.00846
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908801927479296
author	Kong, Zicheng Ma, Dehua Xu, Zhenbo Yang, Alven Ru, Yiwei Wang, Haoran Zhou, Zixuan Bie, Fuqing Xiang, Liuyu Wu, Huijia Zhao, Jian He, Zhaofeng
author_facet	Kong, Zicheng Ma, Dehua Xu, Zhenbo Yang, Alven Ru, Yiwei Wang, Haoran Zhou, Zixuan Bie, Fuqing Xiang, Liuyu Wu, Huijia Zhao, Jian He, Zhaofeng
contents	Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00846
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis Kong, Zicheng Ma, Dehua Xu, Zhenbo Yang, Alven Ru, Yiwei Wang, Haoran Zhou, Zixuan Bie, Fuqing Xiang, Liuyu Wu, Huijia Zhao, Jian He, Zhaofeng Computation and Language Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.
title	Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis
topic	Computation and Language
url	https://arxiv.org/abs/2602.00846

Similar Items