:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Singh, Akshit, Marjit, Shyam, Lin, Wei, Gavrikov, Paul, Yeung-Levy, Serena, Kuehne, Hilde, Feris, Rogerio, Doveh, Sivan, Glass, James, Mirza, M. Jehanzeb
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.06783
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
by: Gavrikov, Paul, et al.
Published: (2025)

CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
by: Mehta, Videet, et al.
Published: (2026)

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
by: Selch, Lukas, et al.
Published: (2025)

Teaching VLMs to Localize Specific Objects from In-context Examples
by: Doveh, Sivan, et al.
Published: (2024)

Towards Audio Token Compression in Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2025)

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
by: Araujo, Edson, et al.
Published: (2026)

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
by: Mirza, M. Jehanzeb, et al.
Published: (2024)

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning
by: Jahagirdar, Soumya Shamarao, et al.
Published: (2026)

Comparison Visual Instruction Tuning
by: Lin, Wei, et al.
Published: (2024)

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
by: Rouditchenko, Andrew, et al.
Published: (2025)

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
by: Mirza, M. Jehanzeb, et al.
Published: (2024)

State-Space Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2024)

DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
by: Bhati, Saurabhchand, et al.
Published: (2024)

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
by: Huang, Irene, et al.
Published: (2024)

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
by: Rouditchenko, Andrew, et al.
Published: (2025)

Towards Multimodal In-Context Learning for Vision & Language Models
by: Doveh, Sivan, et al.
Published: (2024)

Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation
by: Spoecklberger, Johannes, et al.
Published: (2025)

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
by: Hansen, Jacob, et al.
Published: (2025)

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
by: Chen, Brian, et al.
Published: (2023)

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
by: Sui, Elaine, et al.
Published: (2024)

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
by: Araujo, Edson, et al.
Published: (2025)

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
by: Bousselham, Walid, et al.
Published: (2025)

When LLaVA Meets Objects: Token Composition for Vision-Language-Models
by: Jahagirdar, Soumya, et al.
Published: (2026)

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
by: Bousselham, Walid, et al.
Published: (2026)

Zero-shot Action Localization via the Confidence of Large Vision-Language Models
by: Aklilu, Josiah, et al.
Published: (2024)

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
by: Endo, Mark, et al.
Published: (2024)

Tool Verification for Test-Time Reinforcement Learning
by: Liao, Ruotong, et al.
Published: (2026)

TimeLogic: A Temporal Logic Benchmark for Video QA
by: Swetha, Sirnam, et al.
Published: (2025)

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
by: Shabtay, Nimrod, et al.
Published: (2024)

Can We Talk Models Into Seeing the World Differently?
by: Gavrikov, Paul, et al.
Published: (2024)

FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation
by: Yashwanth, M, et al.
Published: (2025)

NegVQA: Can Vision Language Models Understand Negation?
by: Zhang, Yuhui, et al.
Published: (2025)

TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models
by: Weijler, Lisa, et al.
Published: (2024)

LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models
by: Pathak, Priyank, et al.
Published: (2025)

CLIPDraw++: Text-to-Sketch Synthesis with Simple Primitives
by: Mathur, Nityanand, et al.
Published: (2023)

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
by: Gupta, Rishi, et al.
Published: (2025)

Overflow Prevention Enhances Long-Context Recurrent LLMs
by: Ben-Kish, Assaf, et al.
Published: (2025)

$\texttt{BATCLIP}$: Bimodal Online Test-Time Adaptation for CLIP
by: Maharana, Sarthak Kumar, et al.
Published: (2024)

DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models
by: Marjit, Shyam, et al.
Published: (2024)