:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cui, Fangming, Zhang, Yonggang, Wang, Xuan, Tian, Xinmei, Yu, Jun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2505.03414
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Similarity Paradigm Through Textual Regularization Without Forgetting
by: Cui, Fangming, et al.
Published: (2025)

Generalizable Prompt Learning of CLIP: A Brief Overview
by: Cui, Fangming, et al.
Published: (2025)

Advancing Prompt Learning through an External Layer
by: Cui, Fangming, et al.
Published: (2024)

Detecting Generated Images by Fitting Natural Image Distributions
by: Zhang, Yonggang, et al.
Published: (2025)

Epistemic Uncertainty for Generated Image Detection
by: Nie, Jun, et al.
Published: (2024)

TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
by: Li, Yanshu, et al.
Published: (2025)

Linking Representations with Multimodal Contrastive Learning
by: Arora, Abhishek, et al.
Published: (2023)

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
by: Joshi, Siddharth, et al.
Published: (2025)

MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation
by: Loo, Gowen, et al.
Published: (2025)

Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models
by: Li, Changqun, et al.
Published: (2024)

Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation
by: Tian, Yuanhe, et al.
Published: (2025)

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing
by: Tian, Yuanhe, et al.
Published: (2025)

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
by: Jia, Mengzhao, et al.
Published: (2024)

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
by: Zhao, Zhiyuan, et al.
Published: (2023)

Learning Speaker-Invariant Visual Features for Lipreading
by: Li, Yu, et al.
Published: (2025)

CLIP-Adapter: Better Vision-Language Models with Feature Adapters
by: Gao, Peng, et al.
Published: (2021)

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
by: Wang, Zhenhailong, et al.
Published: (2025)

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
by: Tian, Yu, et al.
Published: (2024)

Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
by: Zhao, Junbo, et al.
Published: (2025)

CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks
by: Wang, Yanan, et al.
Published: (2025)

Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow
by: Zhang, Chengsheng, et al.
Published: (2026)

Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation
by: Zhang, Jia-Chen, et al.
Published: (2026)

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
by: Shao, Zhenwei, et al.
Published: (2025)

Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach
by: Zhao, Taoxu, et al.
Published: (2025)

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
by: Chen, Zeren, et al.
Published: (2023)

TASO: Task-Aligned Sparse Optimization for Parameter-Efficient Model Adaptation
by: Miao, Daiye, et al.
Published: (2025)

Superpixel Semantics Representation and Pre-training for Vision-Language Task
by: Zhang, Siyu, et al.
Published: (2023)

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
by: Xu, Zhiyang, et al.
Published: (2024)

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
by: Zhang, Xuan, et al.
Published: (2025)

EFLNet: Enhancing Feature Learning for Infrared Small Target Detection
by: Yang, Bo, et al.
Published: (2023)

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval
by: Xie, Jingyou, et al.
Published: (2024)

Progressive Feature Fusion Network for Enhancing Image Quality Assessment
by: Wu, Kaiqun, et al.
Published: (2024)

Drawing the Line: Enhancing Trustworthiness of MLLMs Through the Power of Refusal
by: Wang, Yuhao, et al.
Published: (2024)

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)

The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning
by: Chen, Renmiao, et al.
Published: (2026)

GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
by: Siingh, Shikhhar, et al.
Published: (2025)

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
by: Zhang, Wenqi, et al.
Published: (2025)

FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts
by: Kawamura, Kazuki, et al.
Published: (2024)

Enhancing Chest X-ray Classification through Knowledge Injection in Cross-Modality Learning
by: Yan, Yang, et al.
Published: (2025)