:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Qing'an, Feng, Juntong, Wang, Yuhao, Han, Xinzhe, Cheng, Yujie, Zhu, Yue, Diao, Haiwen, Zhuge, Yunzhi, Lu, Huchuan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.04802
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
by: Diao, Haiwen, et al.
Published: (2024)

Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching
by: Diao, Haiwen, et al.
Published: (2024)

KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
by: Zhu, Yue, et al.
Published: (2025)

Complementary and Contrastive Learning for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025)

3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
by: Xiong, Haomiao, et al.
Published: (2025)

Regularizing Subspace Redundancy of Low-Rank Adaptation
by: Zhu, Yue, et al.
Published: (2025)

LLMs Can Evolve Continually on Modality for X-Modal Reasoning
by: Yu, Jiazuo, et al.
Published: (2024)

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding
by: Zhang, Wenbo, et al.
Published: (2024)

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models
by: Li, Xiaomin, et al.
Published: (2024)

Unveiling Encoder-Free Vision-Language Models
by: Diao, Haiwen, et al.
Published: (2024)

Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation
by: Zhuge, Yunzhi, et al.
Published: (2025)

Towards Cross-Platform Generalization: Domain Adaptive 3D Detection with Augmentation and Pseudo-Labeling
by: Feng, Xiyan, et al.
Published: (2026)

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
by: Xiong, Haomiao, et al.
Published: (2025)

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
by: Yu, Jiazuo, et al.
Published: (2024)

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025)

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
by: Gong, Kaixiong, et al.
Published: (2024)

Do Vision-Language Models Really Understand Visual Language?
by: Hou, Yifan, et al.
Published: (2024)

Parameter Aware Mamba Model for Multi-task Dense Prediction
by: Yu, Xinzhuo, et al.
Published: (2025)

Reinforcing Video Reasoning Segmentation to Think Before It Segments
by: Gong, Sitong, et al.
Published: (2025)

Learning Universal Features for Generalizable Image Forgery Localization
by: Zhao, Hengrun, et al.
Published: (2025)

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
by: Gong, Sitong, et al.
Published: (2025)

End-to-End Vision Tokenizer Tuning
by: Wang, Wenxuan, et al.
Published: (2025)

SUPQA: LLM‐based Geo‐Visualization for Subjective Urban Performance Question‐Answering
by: Haiwen Huang, et al.
Published: (2025)

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
by: Wang, Yuhao, et al.
Published: (2025)

LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
by: Zhang, Pingping, et al.
Published: (2025)

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning
by: Zhang, Lu, et al.
Published: (2025)

StableIdentity: Inserting Anybody into Anywhere at First Sight
by: Wang, Qinghe, et al.
Published: (2024)

Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking
by: Zheng, Zirui, et al.
Published: (2025)

GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning
by: Diao, Haiwen, et al.
Published: (2024)

UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
by: Diao, Haiwen, et al.
Published: (2023)

TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
by: Guo, Xinyu, et al.
Published: (2026)

Do MLLMs Really Understand the Charts?
by: Zhang, Xiao, et al.
Published: (2025)

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
by: Zhou, Junjie, et al.
Published: (2024)

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
by: Diao, Haiwen, et al.
Published: (2025)

Rethinking Text-based Protein Understanding: Retrieval or LLM?
by: Wu, Juntong, et al.
Published: (2025)

Extracting Abstraction Dimensions by Identifying Syntax Pattern from Texts
by: Zhou, Jian, et al.
Published: (2025)

Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation
by: Ye, Chengyang, et al.
Published: (2024)

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
by: Han, Jiaming, et al.
Published: (2025)

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning
by: Li, Baolu, et al.
Published: (2025)

LitVISTA: A Benchmark for Narrative Orchestration in Literary Text
by: Lu, Mingzhe, et al.
Published: (2026)