Saved in:
| Main Authors: | Liang, Yongyuan, Chow, Wei, Li, Feng, Ma, Ziqiao, Wang, Xiyao, Mao, Jiageng, Chen, Jiuhai, Gu, Jiatao, Wang, Yue, Huang, Furong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.01163 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
by: Liang, Yongyuan, et al.
Published: (2025)
by: Liang, Yongyuan, et al.
Published: (2025)
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025)
by: Chow, Wei, et al.
Published: (2025)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025)
by: Cai, Zikui, et al.
Published: (2025)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
by: Huang, Hailang, et al.
Published: (2024)
by: Huang, Hailang, et al.
Published: (2024)
Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss
by: Zheng, Ruijie, et al.
Published: (2024)
by: Zheng, Ruijie, et al.
Published: (2024)
Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion
by: Liang, Yongyuan, et al.
Published: (2024)
by: Liang, Yongyuan, et al.
Published: (2024)
Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
by: Liu, Xiangyu, et al.
Published: (2024)
by: Liu, Xiangyu, et al.
Published: (2024)
Is poisoning a real threat to LLM alignment? Maybe more so than you think
by: Pathmanathan, Pankayaraj, et al.
Published: (2024)
by: Pathmanathan, Pankayaraj, et al.
Published: (2024)
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
by: Ma, Menghe, et al.
Published: (2026)
by: Ma, Menghe, et al.
Published: (2026)
Generalization Bounds via Conditional $f$-Information
by: Wang, Ziqiao, et al.
Published: (2024)
by: Wang, Ziqiao, et al.
Published: (2024)
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
by: Wang, Ziqiao, et al.
Published: (2022)
by: Wang, Ziqiao, et al.
Published: (2022)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
by: Wang, Xiyao, et al.
Published: (2025)
by: Wang, Xiyao, et al.
Published: (2025)
Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning
by: Zhang, Zhicheng, et al.
Published: (2026)
by: Zhang, Zhicheng, et al.
Published: (2026)
LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding
by: Wang, Xiaodong, et al.
Published: (2026)
by: Wang, Xiaodong, et al.
Published: (2026)
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
by: Wang, Xiyao, et al.
Published: (2025)
by: Wang, Xiyao, et al.
Published: (2025)
SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment
by: Xue, Rong, et al.
Published: (2025)
by: Xue, Rong, et al.
Published: (2025)
Generalization in Federated Learning: A Conditional Mutual Information Framework
by: Wang, Ziqiao, et al.
Published: (2025)
by: Wang, Ziqiao, et al.
Published: (2025)
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
by: Lv, Guannan, et al.
Published: (2026)
by: Lv, Guannan, et al.
Published: (2026)
ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
by: Schroeder, Philip, et al.
Published: (2025)
by: Schroeder, Philip, et al.
Published: (2025)
On $f$-Divergence Principled Domain Adaptation: An Improved Framework
by: Wang, Ziqiao, et al.
Published: (2024)
by: Wang, Ziqiao, et al.
Published: (2024)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
by: Oh, Yeongtak, et al.
Published: (2026)
by: Oh, Yeongtak, et al.
Published: (2026)
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
by: Zhong, Hao, et al.
Published: (2025)
by: Zhong, Hao, et al.
Published: (2025)
Agentic Critical Training
by: Liu, Weize, et al.
Published: (2026)
by: Liu, Weize, et al.
Published: (2026)
ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments
by: Yu, Jingwen, et al.
Published: (2025)
by: Yu, Jingwen, et al.
Published: (2025)
Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
by: Wu, Yanru, et al.
Published: (2026)
by: Wu, Yanru, et al.
Published: (2026)
ROVER: A Multi-Season Dataset for Visual SLAM
by: Schmidt, Fabian, et al.
Published: (2024)
by: Schmidt, Fabian, et al.
Published: (2024)
PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction
by: Feng, Qiao, et al.
Published: (2025)
by: Feng, Qiao, et al.
Published: (2025)
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
by: Ju, Yuanchen, et al.
Published: (2025)
by: Ju, Yuanchen, et al.
Published: (2025)
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
A Language Agent for Autonomous Driving
by: Mao, Jiageng, et al.
Published: (2023)
by: Mao, Jiageng, et al.
Published: (2023)
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
by: Zhou, Yuhang, et al.
Published: (2024)
by: Zhou, Yuhang, et al.
Published: (2024)
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
by: Chow, Wei, et al.
Published: (2025)
by: Chow, Wei, et al.
Published: (2025)
COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL
by: Wang, Xiyao, et al.
Published: (2023)
by: Wang, Xiyao, et al.
Published: (2023)
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate
by: Xu, Yuancheng, et al.
Published: (2023)
by: Xu, Yuancheng, et al.
Published: (2023)
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
by: Wang, Xiyao, et al.
Published: (2025)
by: Wang, Xiyao, et al.
Published: (2025)
Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation
by: Shang, Yingjia, et al.
Published: (2025)
by: Shang, Yingjia, et al.
Published: (2025)
PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
by: Lu, Xudong, et al.
Published: (2026)
by: Lu, Xudong, et al.
Published: (2026)
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
by: Zhang, Guozhen, et al.
Published: (2025)
by: Zhang, Guozhen, et al.
Published: (2025)
Similar Items
-
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
by: Liang, Yongyuan, et al.
Published: (2025) -
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025) -
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025) -
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
by: Wang, Xiyao, et al.
Published: (2024) -
Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
by: Huang, Hailang, et al.
Published: (2024)