:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liang, Yongyuan, Chow, Wei, Li, Feng, Ma, Ziqiao, Wang, Xiyao, Mao, Jiageng, Chen, Jiuhai, Gu, Jiatao, Wang, Yue, Huang, Furong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.01163
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
by: Liang, Yongyuan, et al.
Published: (2025)

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025)

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025)

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
by: Wang, Xiyao, et al.
Published: (2024)

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
by: Huang, Hailang, et al.
Published: (2024)

Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss
by: Zheng, Ruijie, et al.
Published: (2024)

Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion
by: Liang, Yongyuan, et al.
Published: (2024)

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
by: Liu, Xiangyu, et al.
Published: (2024)

Is poisoning a real threat to LLM alignment? Maybe more so than you think
by: Pathmanathan, Pankayaraj, et al.
Published: (2024)

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
by: Ma, Menghe, et al.
Published: (2026)

Generalization Bounds via Conditional $f$-Information
by: Wang, Ziqiao, et al.
Published: (2024)

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
by: Wang, Ziqiao, et al.
Published: (2022)

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
by: Wang, Xiyao, et al.
Published: (2025)

Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning
by: Zhang, Zhicheng, et al.
Published: (2026)

LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding
by: Wang, Xiaodong, et al.
Published: (2026)

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
by: Wang, Xiyao, et al.
Published: (2025)

SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment
by: Xue, Rong, et al.
Published: (2025)

Generalization in Federated Learning: A Conditional Mutual Information Framework
by: Wang, Ziqiao, et al.
Published: (2025)

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
by: Lv, Guannan, et al.
Published: (2026)

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
by: Schroeder, Philip, et al.
Published: (2025)

On $f$-Divergence Principled Domain Adaptation: An Improved Framework
by: Wang, Ziqiao, et al.
Published: (2024)

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
by: Oh, Yeongtak, et al.
Published: (2026)

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
by: Zhong, Hao, et al.
Published: (2025)

Agentic Critical Training
by: Liu, Weize, et al.
Published: (2026)

ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments
by: Yu, Jingwen, et al.
Published: (2025)

Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
by: Wu, Yanru, et al.
Published: (2026)

ROVER: A Multi-Season Dataset for Visual SLAM
by: Schmidt, Fabian, et al.
Published: (2024)

PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction
by: Feng, Qiao, et al.
Published: (2025)

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
by: Ju, Yuanchen, et al.
Published: (2025)

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
by: Ying, Kaining, et al.
Published: (2025)

A Language Agent for Autonomous Driving
by: Mao, Jiageng, et al.
Published: (2023)

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
by: Zhou, Yuhang, et al.
Published: (2024)

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
by: Chow, Wei, et al.
Published: (2025)

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL
by: Wang, Xiyao, et al.
Published: (2023)

Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate
by: Xu, Yuancheng, et al.
Published: (2023)

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
by: Wang, Xiyao, et al.
Published: (2025)

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation
by: Shang, Yingjia, et al.
Published: (2025)

PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
by: Lu, Xudong, et al.
Published: (2026)

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
by: Zhang, Guozhen, et al.
Published: (2025)