Saved in:
| Main Authors: | Liu, Qing'an, Feng, Juntong, Wang, Yuhao, Han, Xinzhe, Cheng, Yujie, Zhu, Yue, Diao, Haiwen, Zhuge, Yunzhi, Lu, Huchuan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.04802 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
by: Diao, Haiwen, et al.
Published: (2024)
by: Diao, Haiwen, et al.
Published: (2024)
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching
by: Diao, Haiwen, et al.
Published: (2024)
by: Diao, Haiwen, et al.
Published: (2024)
KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
by: Zhu, Yue, et al.
Published: (2025)
by: Zhu, Yue, et al.
Published: (2025)
Complementary and Contrastive Learning for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
by: Xiong, Haomiao, et al.
Published: (2025)
by: Xiong, Haomiao, et al.
Published: (2025)
Regularizing Subspace Redundancy of Low-Rank Adaptation
by: Zhu, Yue, et al.
Published: (2025)
by: Zhu, Yue, et al.
Published: (2025)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning
by: Yu, Jiazuo, et al.
Published: (2024)
by: Yu, Jiazuo, et al.
Published: (2024)
Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding
by: Zhang, Wenbo, et al.
Published: (2024)
by: Zhang, Wenbo, et al.
Published: (2024)
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models
by: Li, Xiaomin, et al.
Published: (2024)
by: Li, Xiaomin, et al.
Published: (2024)
Unveiling Encoder-Free Vision-Language Models
by: Diao, Haiwen, et al.
Published: (2024)
by: Diao, Haiwen, et al.
Published: (2024)
Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation
by: Zhuge, Yunzhi, et al.
Published: (2025)
by: Zhuge, Yunzhi, et al.
Published: (2025)
Towards Cross-Platform Generalization: Domain Adaptive 3D Detection with Augmentation and Pseudo-Labeling
by: Feng, Xiyan, et al.
Published: (2026)
by: Feng, Xiyan, et al.
Published: (2026)
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
by: Xiong, Haomiao, et al.
Published: (2025)
by: Xiong, Haomiao, et al.
Published: (2025)
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
by: Yu, Jiazuo, et al.
Published: (2024)
by: Yu, Jiazuo, et al.
Published: (2024)
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
by: Gong, Kaixiong, et al.
Published: (2024)
by: Gong, Kaixiong, et al.
Published: (2024)
Do Vision-Language Models Really Understand Visual Language?
by: Hou, Yifan, et al.
Published: (2024)
by: Hou, Yifan, et al.
Published: (2024)
Parameter Aware Mamba Model for Multi-task Dense Prediction
by: Yu, Xinzhuo, et al.
Published: (2025)
by: Yu, Xinzhuo, et al.
Published: (2025)
Reinforcing Video Reasoning Segmentation to Think Before It Segments
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
Learning Universal Features for Generalizable Image Forgery Localization
by: Zhao, Hengrun, et al.
Published: (2025)
by: Zhao, Hengrun, et al.
Published: (2025)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
End-to-End Vision Tokenizer Tuning
by: Wang, Wenxuan, et al.
Published: (2025)
by: Wang, Wenxuan, et al.
Published: (2025)
SUPQA: LLM‐based Geo‐Visualization for Subjective Urban Performance Question‐Answering
by: Haiwen Huang, et al.
Published: (2025)
by: Haiwen Huang, et al.
Published: (2025)
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
by: Wang, Yuhao, et al.
Published: (2025)
by: Wang, Yuhao, et al.
Published: (2025)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
by: Zhang, Pingping, et al.
Published: (2025)
by: Zhang, Pingping, et al.
Published: (2025)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning
by: Zhang, Lu, et al.
Published: (2025)
by: Zhang, Lu, et al.
Published: (2025)
StableIdentity: Inserting Anybody into Anywhere at First Sight
by: Wang, Qinghe, et al.
Published: (2024)
by: Wang, Qinghe, et al.
Published: (2024)
Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking
by: Zheng, Zirui, et al.
Published: (2025)
by: Zheng, Zirui, et al.
Published: (2025)
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning
by: Diao, Haiwen, et al.
Published: (2024)
by: Diao, Haiwen, et al.
Published: (2024)
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
by: Diao, Haiwen, et al.
Published: (2023)
by: Diao, Haiwen, et al.
Published: (2023)
TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
by: Guo, Xinyu, et al.
Published: (2026)
by: Guo, Xinyu, et al.
Published: (2026)
Do MLLMs Really Understand the Charts?
by: Zhang, Xiao, et al.
Published: (2025)
by: Zhang, Xiao, et al.
Published: (2025)
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
by: Diao, Haiwen, et al.
Published: (2025)
by: Diao, Haiwen, et al.
Published: (2025)
Rethinking Text-based Protein Understanding: Retrieval or LLM?
by: Wu, Juntong, et al.
Published: (2025)
by: Wu, Juntong, et al.
Published: (2025)
Extracting Abstraction Dimensions by Identifying Syntax Pattern from Texts
by: Zhou, Jian, et al.
Published: (2025)
by: Zhou, Jian, et al.
Published: (2025)
Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation
by: Ye, Chengyang, et al.
Published: (2024)
by: Ye, Chengyang, et al.
Published: (2024)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
by: Han, Jiaming, et al.
Published: (2025)
by: Han, Jiaming, et al.
Published: (2025)
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning
by: Li, Baolu, et al.
Published: (2025)
by: Li, Baolu, et al.
Published: (2025)
LitVISTA: A Benchmark for Narrative Orchestration in Literary Text
by: Lu, Mingzhe, et al.
Published: (2026)
by: Lu, Mingzhe, et al.
Published: (2026)
Similar Items
-
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
by: Diao, Haiwen, et al.
Published: (2024) -
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching
by: Diao, Haiwen, et al.
Published: (2024) -
KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
by: Zhu, Yue, et al.
Published: (2025) -
Complementary and Contrastive Learning for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025) -
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
by: Xiong, Haomiao, et al.
Published: (2025)