Saved in:
| Main Authors: | Mirza, M. Jehanzeb, Zhao, Mengjie, Mao, Zhuoyuan, Doveh, Sivan, Lin, Wei, Gavrikov, Paul, Dorkenwald, Michael, Yang, Shiqi, Jha, Saurav, Wakaki, Hiromi, Mitsufuji, Yuki, Possegger, Horst, Feris, Rogerio, Karlinsky, Leonid, Glass, James |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.06154 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Comparison Visual Instruction Tuning
by: Lin, Wei, et al.
Published: (2024)
by: Lin, Wei, et al.
Published: (2024)
TTRV: Test-Time Reinforcement Learning for Vision Language Models
by: Singh, Akshit, et al.
Published: (2025)
by: Singh, Akshit, et al.
Published: (2025)
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
by: Selch, Lukas, et al.
Published: (2025)
by: Selch, Lukas, et al.
Published: (2025)
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
by: Mirza, M. Jehanzeb, et al.
Published: (2024)
by: Mirza, M. Jehanzeb, et al.
Published: (2024)
Towards Multimodal In-Context Learning for Vision & Language Models
by: Doveh, Sivan, et al.
Published: (2024)
by: Doveh, Sivan, et al.
Published: (2024)
Teaching VLMs to Localize Specific Objects from In-context Examples
by: Doveh, Sivan, et al.
Published: (2024)
by: Doveh, Sivan, et al.
Published: (2024)
DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
by: Mao, Zhuoyuan, et al.
Published: (2025)
by: Mao, Zhuoyuan, et al.
Published: (2025)
Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation
by: Spoecklberger, Johannes, et al.
Published: (2025)
by: Spoecklberger, Johannes, et al.
Published: (2025)
State-Space Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2024)
by: Bhati, Saurabhchand, et al.
Published: (2024)
CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
by: Mehta, Videet, et al.
Published: (2026)
by: Mehta, Videet, et al.
Published: (2026)
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
by: Hansen, Jacob, et al.
Published: (2025)
by: Hansen, Jacob, et al.
Published: (2025)
Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models
by: Jha, Saurav, et al.
Published: (2024)
by: Jha, Saurav, et al.
Published: (2024)
OpenMU: Your Swiss Army Knife for Music Understanding
by: Zhao, Mengjie, et al.
Published: (2024)
by: Zhao, Mengjie, et al.
Published: (2024)
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
by: Bhati, Saurabhchand, et al.
Published: (2024)
by: Bhati, Saurabhchand, et al.
Published: (2024)
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
by: Gavrikov, Paul, et al.
Published: (2025)
by: Gavrikov, Paul, et al.
Published: (2025)
Cross-Modal Learning for Music-to-Music-Video Description Generation
by: Mao, Zhuoyuan, et al.
Published: (2025)
by: Mao, Zhuoyuan, et al.
Published: (2025)
Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling
by: Jha, Saurav, et al.
Published: (2025)
by: Jha, Saurav, et al.
Published: (2025)
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
by: Huang, Irene, et al.
Published: (2024)
by: Huang, Irene, et al.
Published: (2024)
Self-Specialization: Uncovering Latent Expertise within Large Language Models
by: Kang, Junmo, et al.
Published: (2023)
by: Kang, Junmo, et al.
Published: (2023)
Latent Implicit Visual Reasoning
by: Li, Kelvin, et al.
Published: (2025)
by: Li, Kelvin, et al.
Published: (2025)
Learning to Route Languages for Multilingual Policy Optimization
by: Guo, Geyang, et al.
Published: (2026)
by: Guo, Geyang, et al.
Published: (2026)
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory
by: He, Zexue, et al.
Published: (2024)
by: He, Zexue, et al.
Published: (2024)
DiffuCOMET: Contextual Commonsense Knowledge Diffusion
by: Gao, Silin, et al.
Published: (2024)
by: Gao, Silin, et al.
Published: (2024)
MAEDAY: MAE for few and zero shot AnomalY-Detection
by: Schwartz, Eli, et al.
Published: (2022)
by: Schwartz, Eli, et al.
Published: (2022)
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
by: Kang, Junmo, et al.
Published: (2024)
by: Kang, Junmo, et al.
Published: (2024)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
by: Shabtay, Nimrod, et al.
Published: (2024)
by: Shabtay, Nimrod, et al.
Published: (2024)
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)
by: Rouditchenko, Andrew, et al.
Published: (2024)
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning
by: Schwartz, Eli, et al.
Published: (2024)
by: Schwartz, Eli, et al.
Published: (2024)
Overflow Prevention Enhances Long-Context Recurrent LLMs
by: Ben-Kish, Assaf, et al.
Published: (2025)
by: Ben-Kish, Assaf, et al.
Published: (2025)
Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems
by: Borazjanizadeh, Nasim, et al.
Published: (2024)
by: Borazjanizadeh, Nasim, et al.
Published: (2024)
Into the Fog: Evaluating Robustness of Multiple Object Tracking
by: Kirillova, Nadezda, et al.
Published: (2024)
by: Kirillova, Nadezda, et al.
Published: (2024)
Towards Audio Token Compression in Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2025)
by: Bhati, Saurabhchand, et al.
Published: (2025)
Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
by: Hayakawa, Satoshi, et al.
Published: (2025)
by: Hayakawa, Satoshi, et al.
Published: (2025)
Distillation of Discrete Diffusion through Dimensional Correlations
by: Hayakawa, Satoshi, et al.
Published: (2024)
by: Hayakawa, Satoshi, et al.
Published: (2024)
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
by: Araujo, Edson, et al.
Published: (2026)
by: Araujo, Edson, et al.
Published: (2026)
KV Cache Steering for Controlling Frozen LLMs
by: Belitsky, Max, et al.
Published: (2025)
by: Belitsky, Max, et al.
Published: (2025)
$\texttt{BATCLIP}$: Bimodal Online Test-Time Adaptation for CLIP
by: Maharana, Sarthak Kumar, et al.
Published: (2024)
by: Maharana, Sarthak Kumar, et al.
Published: (2024)
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark
by: Wakaki, Hiromi, et al.
Published: (2024)
by: Wakaki, Hiromi, et al.
Published: (2024)
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
by: Borazjanizadeh, Nasim, et al.
Published: (2025)
by: Borazjanizadeh, Nasim, et al.
Published: (2025)
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features
by: Mitra, Chancharik, et al.
Published: (2024)
by: Mitra, Chancharik, et al.
Published: (2024)
Similar Items
-
Comparison Visual Instruction Tuning
by: Lin, Wei, et al.
Published: (2024) -
TTRV: Test-Time Reinforcement Learning for Vision Language Models
by: Singh, Akshit, et al.
Published: (2025) -
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
by: Selch, Lukas, et al.
Published: (2025) -
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
by: Mirza, M. Jehanzeb, et al.
Published: (2024) -
Towards Multimodal In-Context Learning for Vision & Language Models
by: Doveh, Sivan, et al.
Published: (2024)