Saved in:
| Main Authors: | Liu, Yuhong, Zhang, Beichen, Zang, Yuhang, Cao, Yuhang, Xing, Long, Dong, Xiaoyi, Duan, Haodong, Lin, Dahua, Wang, Jiaqi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.27606 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)
by: Zhang, Beichen, et al.
Published: (2025)
Visual-RFT: Visual Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025)
by: Liu, Ziyu, et al.
Published: (2025)
Visual Agentic Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025)
by: Liu, Ziyu, et al.
Published: (2025)
Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
by: Li, Jinsong, et al.
Published: (2026)
by: Li, Jinsong, et al.
Published: (2026)
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
SPARK: Synergistic Policy And Reward Co-Evolving Framework
by: Liu, Ziyu, et al.
Published: (2025)
by: Liu, Ziyu, et al.
Published: (2025)
ETCHR: Editing To Clarify and Harness Reasoning
by: Zhang, Beichen, et al.
Published: (2026)
by: Zhang, Beichen, et al.
Published: (2026)
MM-IFEngine: Towards Multimodal Instruction Following
by: Ding, Shengyuan, et al.
Published: (2025)
by: Ding, Shengyuan, et al.
Published: (2025)
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
by: Zhang, Beichen, et al.
Published: (2025)
by: Zhang, Beichen, et al.
Published: (2025)
Long-CLIP: Unlocking the Long-Text Capability of CLIP
by: Zhang, Beichen, et al.
Published: (2024)
by: Zhang, Beichen, et al.
Published: (2024)
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
by: Qian, Rui, et al.
Published: (2025)
by: Qian, Rui, et al.
Published: (2025)
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
by: Sun, Zeyi, et al.
Published: (2025)
by: Sun, Zeyi, et al.
Published: (2025)
Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024)
by: Qian, Rui, et al.
Published: (2024)
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
by: Ding, Shengyuan, et al.
Published: (2025)
by: Ding, Shengyuan, et al.
Published: (2025)
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
by: Ding, Shuangrui, et al.
Published: (2024)
by: Ding, Shuangrui, et al.
Published: (2024)
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
by: Cao, Yuhang, et al.
Published: (2024)
by: Cao, Yuhang, et al.
Published: (2024)
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
by: Wei, Xilin, et al.
Published: (2025)
by: Wei, Xilin, et al.
Published: (2025)
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
by: Xing, Long, et al.
Published: (2024)
by: Xing, Long, et al.
Published: (2024)
2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC
by: Zhang, Zhixiong, et al.
Published: (2025)
by: Zhang, Zhixiong, et al.
Published: (2025)
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
by: Bu, Jiazi, et al.
Published: (2024)
by: Bu, Jiazi, et al.
Published: (2024)
Advancing Complex Video Object Segmentation via Progressive Concept Construction
by: Zhang, Zhixiong, et al.
Published: (2025)
by: Zhang, Zhixiong, et al.
Published: (2025)
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
by: Sun, Zeyi, et al.
Published: (2025)
by: Sun, Zeyi, et al.
Published: (2025)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
by: Huang, Qidong, et al.
Published: (2024)
by: Huang, Qidong, et al.
Published: (2024)
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
by: Zang, Yuhang, et al.
Published: (2025)
by: Zang, Yuhang, et al.
Published: (2025)
Are We on the Right Way for Evaluating Large Vision-Language Models?
by: Chen, Lin, et al.
Published: (2024)
by: Chen, Lin, et al.
Published: (2024)
HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
by: Bu, Jiazi, et al.
Published: (2025)
by: Bu, Jiazi, et al.
Published: (2025)
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
by: Chen, Lin, et al.
Published: (2024)
by: Chen, Lin, et al.
Published: (2024)
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
by: Dong, Yuhao, et al.
Published: (2026)
by: Dong, Yuhao, et al.
Published: (2026)
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
by: Sun, Zeyi, et al.
Published: (2024)
by: Sun, Zeyi, et al.
Published: (2024)
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
by: Li, Yifei, et al.
Published: (2025)
by: Li, Yifei, et al.
Published: (2025)
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
by: Dai, Xuanlang, et al.
Published: (2026)
by: Dai, Xuanlang, et al.
Published: (2026)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
Unified Scene Representation and Reconstruction for 3D Large Language Models
by: Chu, Tao, et al.
Published: (2024)
by: Chu, Tao, et al.
Published: (2024)
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
by: Sun, Zeyi, et al.
Published: (2024)
by: Sun, Zeyi, et al.
Published: (2024)
Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models
by: Li, Jinsong, et al.
Published: (2025)
by: Li, Jinsong, et al.
Published: (2025)
DiCache: Let Diffusion Model Determine Its Own Cache
by: Bu, Jiazi, et al.
Published: (2025)
by: Bu, Jiazi, et al.
Published: (2025)
SIM-CoT: Supervised Implicit Chain-of-Thought
by: Wei, Xilin, et al.
Published: (2025)
by: Wei, Xilin, et al.
Published: (2025)
Similar Items
-
Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025) -
Visual-RFT: Visual Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025) -
Visual Agentic Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025) -
Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
by: Li, Jinsong, et al.
Published: (2026) -
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)