Saved in:
| Main Authors: | Singh, Akshit, Marjit, Shyam, Lin, Wei, Gavrikov, Paul, Yeung-Levy, Serena, Kuehne, Hilde, Feris, Rogerio, Doveh, Sivan, Glass, James, Mirza, M. Jehanzeb |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.06783 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
by: Gavrikov, Paul, et al.
Published: (2025)
by: Gavrikov, Paul, et al.
Published: (2025)
CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
by: Mehta, Videet, et al.
Published: (2026)
by: Mehta, Videet, et al.
Published: (2026)
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
by: Selch, Lukas, et al.
Published: (2025)
by: Selch, Lukas, et al.
Published: (2025)
Teaching VLMs to Localize Specific Objects from In-context Examples
by: Doveh, Sivan, et al.
Published: (2024)
by: Doveh, Sivan, et al.
Published: (2024)
Towards Audio Token Compression in Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2025)
by: Bhati, Saurabhchand, et al.
Published: (2025)
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
by: Araujo, Edson, et al.
Published: (2026)
by: Araujo, Edson, et al.
Published: (2026)
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
by: Mirza, M. Jehanzeb, et al.
Published: (2024)
by: Mirza, M. Jehanzeb, et al.
Published: (2024)
TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning
by: Jahagirdar, Soumya Shamarao, et al.
Published: (2026)
by: Jahagirdar, Soumya Shamarao, et al.
Published: (2026)
Comparison Visual Instruction Tuning
by: Lin, Wei, et al.
Published: (2024)
by: Lin, Wei, et al.
Published: (2024)
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
by: Rouditchenko, Andrew, et al.
Published: (2025)
by: Rouditchenko, Andrew, et al.
Published: (2025)
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
by: Mirza, M. Jehanzeb, et al.
Published: (2024)
by: Mirza, M. Jehanzeb, et al.
Published: (2024)
State-Space Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2024)
by: Bhati, Saurabhchand, et al.
Published: (2024)
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
by: Bhati, Saurabhchand, et al.
Published: (2024)
by: Bhati, Saurabhchand, et al.
Published: (2024)
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
by: Huang, Irene, et al.
Published: (2024)
by: Huang, Irene, et al.
Published: (2024)
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)
by: Rouditchenko, Andrew, et al.
Published: (2024)
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
by: Rouditchenko, Andrew, et al.
Published: (2025)
by: Rouditchenko, Andrew, et al.
Published: (2025)
Towards Multimodal In-Context Learning for Vision & Language Models
by: Doveh, Sivan, et al.
Published: (2024)
by: Doveh, Sivan, et al.
Published: (2024)
Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation
by: Spoecklberger, Johannes, et al.
Published: (2025)
by: Spoecklberger, Johannes, et al.
Published: (2025)
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
by: Hansen, Jacob, et al.
Published: (2025)
by: Hansen, Jacob, et al.
Published: (2025)
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
by: Chen, Brian, et al.
Published: (2023)
by: Chen, Brian, et al.
Published: (2023)
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
by: Sui, Elaine, et al.
Published: (2024)
by: Sui, Elaine, et al.
Published: (2024)
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
by: Araujo, Edson, et al.
Published: (2025)
by: Araujo, Edson, et al.
Published: (2025)
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
by: Bousselham, Walid, et al.
Published: (2025)
by: Bousselham, Walid, et al.
Published: (2025)
When LLaVA Meets Objects: Token Composition for Vision-Language-Models
by: Jahagirdar, Soumya, et al.
Published: (2026)
by: Jahagirdar, Soumya, et al.
Published: (2026)
DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
by: Bousselham, Walid, et al.
Published: (2026)
by: Bousselham, Walid, et al.
Published: (2026)
Zero-shot Action Localization via the Confidence of Large Vision-Language Models
by: Aklilu, Josiah, et al.
Published: (2024)
by: Aklilu, Josiah, et al.
Published: (2024)
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
by: Endo, Mark, et al.
Published: (2024)
by: Endo, Mark, et al.
Published: (2024)
Tool Verification for Test-Time Reinforcement Learning
by: Liao, Ruotong, et al.
Published: (2026)
by: Liao, Ruotong, et al.
Published: (2026)
TimeLogic: A Temporal Logic Benchmark for Video QA
by: Swetha, Sirnam, et al.
Published: (2025)
by: Swetha, Sirnam, et al.
Published: (2025)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
by: Shabtay, Nimrod, et al.
Published: (2024)
by: Shabtay, Nimrod, et al.
Published: (2024)
Can We Talk Models Into Seeing the World Differently?
by: Gavrikov, Paul, et al.
Published: (2024)
by: Gavrikov, Paul, et al.
Published: (2024)
FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation
by: Yashwanth, M, et al.
Published: (2025)
by: Yashwanth, M, et al.
Published: (2025)
NegVQA: Can Vision Language Models Understand Negation?
by: Zhang, Yuhui, et al.
Published: (2025)
by: Zhang, Yuhui, et al.
Published: (2025)
TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models
by: Weijler, Lisa, et al.
Published: (2024)
by: Weijler, Lisa, et al.
Published: (2024)
LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models
by: Pathak, Priyank, et al.
Published: (2025)
by: Pathak, Priyank, et al.
Published: (2025)
CLIPDraw++: Text-to-Sketch Synthesis with Simple Primitives
by: Mathur, Nityanand, et al.
Published: (2023)
by: Mathur, Nityanand, et al.
Published: (2023)
O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
by: Gupta, Rishi, et al.
Published: (2025)
by: Gupta, Rishi, et al.
Published: (2025)
Overflow Prevention Enhances Long-Context Recurrent LLMs
by: Ben-Kish, Assaf, et al.
Published: (2025)
by: Ben-Kish, Assaf, et al.
Published: (2025)
$\texttt{BATCLIP}$: Bimodal Online Test-Time Adaptation for CLIP
by: Maharana, Sarthak Kumar, et al.
Published: (2024)
by: Maharana, Sarthak Kumar, et al.
Published: (2024)
DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models
by: Marjit, Shyam, et al.
Published: (2024)
by: Marjit, Shyam, et al.
Published: (2024)
Similar Items
-
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
by: Gavrikov, Paul, et al.
Published: (2025) -
CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
by: Mehta, Videet, et al.
Published: (2026) -
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
by: Selch, Lukas, et al.
Published: (2025) -
Teaching VLMs to Localize Specific Objects from In-context Examples
by: Doveh, Sivan, et al.
Published: (2024) -
Towards Audio Token Compression in Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2025)