Saved in:
| Main Authors: | Peng, Wujian, Meng, Lingchen, Chen, Yitong, Xie, Yiweng, Liu, Yang, Gui, Tao, Xu, Hang, Qiu, Xipeng, Wu, Zuxuan, Jiang, Yu-Gang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.03565 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
by: Chen, Yitong, et al.
Published: (2025)
by: Chen, Yitong, et al.
Published: (2025)
Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
by: Liu, Zhuohan, et al.
Published: (2026)
by: Liu, Zhuohan, et al.
Published: (2026)
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
by: Chen, Yitong, et al.
Published: (2026)
by: Chen, Yitong, et al.
Published: (2026)
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
by: Peng, Wujian, et al.
Published: (2023)
by: Peng, Wujian, et al.
Published: (2023)
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
by: Chen, Yitong, et al.
Published: (2024)
by: Chen, Yitong, et al.
Published: (2024)
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
by: Xie, Yiweng, et al.
Published: (2026)
by: Xie, Yiweng, et al.
Published: (2026)
MedINST: Meta Dataset of Biomedical Instructions
by: Han, Wenhan, et al.
Published: (2024)
by: Han, Wenhan, et al.
Published: (2024)
Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue
by: Lin, Xingyao, et al.
Published: (2025)
by: Lin, Xingyao, et al.
Published: (2025)
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
by: Meng, Lingchen, et al.
Published: (2024)
by: Meng, Lingchen, et al.
Published: (2024)
FOCUS: Towards Universal Foreground Segmentation
by: You, Zuyao, et al.
Published: (2025)
by: You, Zuyao, et al.
Published: (2025)
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
by: Meng, Lingchen, et al.
Published: (2023)
by: Meng, Lingchen, et al.
Published: (2023)
Visual Instance-aware Prompt Tuning
by: Xiao, Xi, et al.
Published: (2025)
by: Xiao, Xi, et al.
Published: (2025)
ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
by: Sun, Zhihao, et al.
Published: (2024)
by: Sun, Zhihao, et al.
Published: (2024)
Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning
by: Chen, Haoran, et al.
Published: (2025)
by: Chen, Haoran, et al.
Published: (2025)
FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions
by: Li, Peng, et al.
Published: (2026)
by: Li, Peng, et al.
Published: (2026)
Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation
by: Chen, Haoran, et al.
Published: (2022)
by: Chen, Haoran, et al.
Published: (2022)
Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy
by: Gao, Shujian, et al.
Published: (2026)
by: Gao, Shujian, et al.
Published: (2026)
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
by: Wang, Junke, et al.
Published: (2024)
by: Wang, Junke, et al.
Published: (2024)
PromptFusion: Decoupling Stability and Plasticity for Continual Learning
by: Chen, Haoran, et al.
Published: (2023)
by: Chen, Haoran, et al.
Published: (2023)
Boosting Visual Instruction Tuning with Self-Supervised Guidance
by: Sirko-Galouchenko, Sophia, et al.
Published: (2026)
by: Sirko-Galouchenko, Sophia, et al.
Published: (2026)
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
by: Zhou, Ziwei, et al.
Published: (2025)
by: Zhou, Ziwei, et al.
Published: (2025)
Hint-before-Solving Prompting: Guiding LLMs to Effectively Utilize Encoded Knowledge
by: Fu, Jinlan, et al.
Published: (2024)
by: Fu, Jinlan, et al.
Published: (2024)
Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation
by: Chen, Haoran, et al.
Published: (2025)
by: Chen, Haoran, et al.
Published: (2025)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
by: Xing, Zhen, et al.
Published: (2024)
by: Xing, Zhen, et al.
Published: (2024)
MouSi: Poly-Visual-Expert Vision-Language Models
by: Fan, Xiaoran, et al.
Published: (2024)
by: Fan, Xiaoran, et al.
Published: (2024)
Osprey: Pixel Understanding with Visual Instruction Tuning
by: Yuan, Yuqian, et al.
Published: (2023)
by: Yuan, Yuqian, et al.
Published: (2023)
The NavINST Dataset for Multi-Sensor Autonomous Navigation
by: de Araujo, Paulo Ricardo Marques, et al.
Published: (2025)
by: de Araujo, Paulo Ricardo Marques, et al.
Published: (2025)
Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning
by: Wang, Qian-Wei, et al.
Published: (2026)
by: Wang, Qian-Wei, et al.
Published: (2026)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding
by: Zhong, Qihuang, et al.
Published: (2024)
by: Zhong, Qihuang, et al.
Published: (2024)
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
by: Zhao, Hang, et al.
Published: (2024)
by: Zhao, Hang, et al.
Published: (2024)
Unify Robot Actions in Camera Frame
by: Xie, Sicheng, et al.
Published: (2025)
by: Xie, Sicheng, et al.
Published: (2025)
Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning
by: Wu, Ruiqi, et al.
Published: (2024)
by: Wu, Ruiqi, et al.
Published: (2024)
Boosting Adversarial Transferability with Low-Cost Optimization via Maximin Expected Flatness
by: Qiu, Chunlin, et al.
Published: (2024)
by: Qiu, Chunlin, et al.
Published: (2024)
Embedded Visual Prompt Tuning
by: Zu, Wenqiang, et al.
Published: (2024)
by: Zu, Wenqiang, et al.
Published: (2024)
NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models
by: Zhang, Jiaming, et al.
Published: (2025)
by: Zhang, Jiaming, et al.
Published: (2025)
Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework
by: Liu, Jiang, et al.
Published: (2024)
by: Liu, Jiang, et al.
Published: (2024)
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
by: Zhang, Hui, et al.
Published: (2024)
by: Zhang, Hui, et al.
Published: (2024)
Explicit Multi-head Attention for Inter-head Interaction in Large Language Models
by: Peng, Runyu, et al.
Published: (2026)
by: Peng, Runyu, et al.
Published: (2026)
Adversarial Prompt Tuning for Vision-Language Models
by: Zhang, Jiaming, et al.
Published: (2023)
by: Zhang, Jiaming, et al.
Published: (2023)
Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning
by: Liu, Xixi, et al.
Published: (2026)
by: Liu, Xixi, et al.
Published: (2026)
Similar Items
-
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
by: Chen, Yitong, et al.
Published: (2025) -
Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
by: Liu, Zhuohan, et al.
Published: (2026) -
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
by: Chen, Yitong, et al.
Published: (2026) -
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
by: Peng, Wujian, et al.
Published: (2023) -
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
by: Chen, Yitong, et al.
Published: (2024)