Saved in:
| Main Authors: | Wang, Junjie, Lou, Xinghua, Li, Jason, Tian, Ye, Chen, Keyu, Li, Yulin, Kang, Bin, Mai, Jacky, Li, Yanwei, Tian, Zhuotao, Nie, Liqiang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.19639 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
by: Wang, Junjie, et al.
Published: (2025)
by: Wang, Junjie, et al.
Published: (2025)
CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval
by: Kang, Bin, et al.
Published: (2025)
by: Kang, Bin, et al.
Published: (2025)
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
by: Wang, Junjie, et al.
Published: (2025)
by: Wang, Junjie, et al.
Published: (2025)
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior
by: Li, Yulin, et al.
Published: (2025)
by: Li, Yulin, et al.
Published: (2025)
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
by: Fan, Ziyang, et al.
Published: (2026)
by: Fan, Ziyang, et al.
Published: (2026)
AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech
by: Kang, Bin, et al.
Published: (2026)
by: Kang, Bin, et al.
Published: (2026)
LISA: Reasoning Segmentation via Large Language Model
by: Lai, Xin, et al.
Published: (2023)
by: Lai, Xin, et al.
Published: (2023)
SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
Rectifying Latent Space for Generative Single-Image Reflection Removal
by: Li, Mingjia, et al.
Published: (2025)
by: Li, Mingjia, et al.
Published: (2025)
A Visual-inertial Localization Algorithm using Opportunistic Visual Beacons and Dead-Reckoning for GNSS-Denied Large-scale Applications
by: Zhang, Liqiang, et al.
Published: (2024)
by: Zhang, Liqiang, et al.
Published: (2024)
Efficient Reasoning with Balanced Thinking
by: Li, Yulin, et al.
Published: (2026)
by: Li, Yulin, et al.
Published: (2026)
Mitigating Object Hallucinations via Sentence-Level Early Intervention
by: Peng, Shangpin, et al.
Published: (2025)
by: Peng, Shangpin, et al.
Published: (2025)
Unified Language-driven Zero-shot Domain Adaptation
by: Yang, Senqiao, et al.
Published: (2024)
by: Yang, Senqiao, et al.
Published: (2024)
Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation
by: Ning, Zhenhua, et al.
Published: (2025)
by: Ning, Zhenhua, et al.
Published: (2025)
MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions
by: Zhang, Haoyu, et al.
Published: (2026)
by: Zhang, Haoyu, et al.
Published: (2026)
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
by: Yuan, Haobo, et al.
Published: (2025)
by: Yuan, Haobo, et al.
Published: (2025)
Video-ToC: Video Tree-of-Cue Reasoning
by: Tan, Qizhong, et al.
Published: (2026)
by: Tan, Qizhong, et al.
Published: (2026)
Tracking Reflected Objects: A Benchmark
by: Guo, Xiaoyu, et al.
Published: (2024)
by: Guo, Xiaoyu, et al.
Published: (2024)
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
by: Shao, Shitong, et al.
Published: (2025)
by: Shao, Shitong, et al.
Published: (2025)
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
by: Liu, Zhihang, et al.
Published: (2025)
by: Liu, Zhihang, et al.
Published: (2025)
SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents
by: Yang, Yu, et al.
Published: (2026)
by: Yang, Yu, et al.
Published: (2026)
Towards Reflected Object Detection: A Benchmark
by: Wu, Yiquan, et al.
Published: (2024)
by: Wu, Yiquan, et al.
Published: (2024)
SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain
by: Zhou, Jiawei, et al.
Published: (2025)
by: Zhou, Jiawei, et al.
Published: (2025)
Edit360: 2D Image Edits to 3D Assets from Any Angle
by: Huang, Junchao, et al.
Published: (2025)
by: Huang, Junchao, et al.
Published: (2025)
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
by: Yang, Zheyuan, et al.
Published: (2026)
by: Yang, Zheyuan, et al.
Published: (2026)
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
by: Chen, Sixiang, et al.
Published: (2026)
by: Chen, Sixiang, et al.
Published: (2026)
Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft
by: Huang, Junchao, et al.
Published: (2025)
by: Huang, Junchao, et al.
Published: (2025)
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
by: Tian, Keyu, et al.
Published: (2024)
by: Tian, Keyu, et al.
Published: (2024)
SJD-VP: Speculative Jacobi Decoding with Verification Prediction for Autoregressive Image Generation
by: Shan, Bingqi, et al.
Published: (2026)
by: Shan, Bingqi, et al.
Published: (2026)
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
by: Chen, Chao, et al.
Published: (2025)
by: Chen, Chao, et al.
Published: (2025)
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
by: Zhang, Fengji, et al.
Published: (2024)
by: Zhang, Fengji, et al.
Published: (2024)
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
by: Khalid, Umar, et al.
Published: (2024)
by: Khalid, Umar, et al.
Published: (2024)
Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
by: Liu, Qingyang, et al.
Published: (2026)
by: Liu, Qingyang, et al.
Published: (2026)
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
by: Shao, Tong, et al.
Published: (2024)
by: Shao, Tong, et al.
Published: (2024)
Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning
by: Guo, Hao, et al.
Published: (2026)
by: Guo, Hao, et al.
Published: (2026)
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
by: Yang, Senqiao, et al.
Published: (2023)
by: Yang, Senqiao, et al.
Published: (2023)
ReflectCAP: Detailed Image Captioning with Reflective Memory
by: Min, Kyungmin, et al.
Published: (2026)
by: Min, Kyungmin, et al.
Published: (2026)
A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
by: Zhang, Yue, et al.
Published: (2026)
by: Zhang, Yue, et al.
Published: (2026)
Reflection Generation for Composite Image Using Diffusion Model
by: Zhao, Haonan, et al.
Published: (2026)
by: Zhao, Haonan, et al.
Published: (2026)
A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation
by: Wang, Liping, et al.
Published: (2026)
by: Wang, Liping, et al.
Published: (2026)
Similar Items
-
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
by: Wang, Junjie, et al.
Published: (2025) -
CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval
by: Kang, Bin, et al.
Published: (2025) -
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
by: Wang, Junjie, et al.
Published: (2025) -
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior
by: Li, Yulin, et al.
Published: (2025) -
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
by: Fan, Ziyang, et al.
Published: (2026)