Saved in:
| Main Authors: | Suissa, Omri, Ali, Muhiim, Azarbal, Ariana, Shen, Hui, Pradhan, Shekhar |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.13021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
by: Suissa, Omri, et al.
Published: (2025)
by: Suissa, Omri, et al.
Published: (2025)
Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time
by: Berger, Uri, et al.
Published: (2025)
by: Berger, Uri, et al.
Published: (2025)
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
by: Beňová, Ivana, et al.
Published: (2024)
by: Beňová, Ivana, et al.
Published: (2024)
Generalised Medical Phrase Grounding
by: Zhang, Wenjun, et al.
Published: (2025)
by: Zhang, Wenjun, et al.
Published: (2025)
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
by: Berger, Uri, et al.
Published: (2024)
by: Berger, Uri, et al.
Published: (2024)
Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering
by: Yao, Louie Hong, et al.
Published: (2025)
by: Yao, Louie Hong, et al.
Published: (2025)
FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
by: Tang, Zihan, et al.
Published: (2026)
by: Tang, Zihan, et al.
Published: (2026)
Token Sequence Compression for Efficient Multimodal Computing
by: Omri, Yasmine, et al.
Published: (2025)
by: Omri, Yasmine, et al.
Published: (2025)
Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning
by: Wei, Xinyu, et al.
Published: (2025)
by: Wei, Xinyu, et al.
Published: (2025)
FlairGPT: Repurposing LLMs for Interior Designs
by: Littlefair, Gabrielle, et al.
Published: (2025)
by: Littlefair, Gabrielle, et al.
Published: (2025)
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
by: Omri, Yasmine, et al.
Published: (2025)
by: Omri, Yasmine, et al.
Published: (2025)
$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
by: Fan, Yingqi, et al.
Published: (2025)
by: Fan, Yingqi, et al.
Published: (2025)
Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching
by: Chen, Wenjing
Published: (2024)
by: Chen, Wenjing
Published: (2024)
Unveiling the Invisible: Captioning Videos with Metaphors
by: Kalarani, Abisek Rajakumar, et al.
Published: (2024)
by: Kalarani, Abisek Rajakumar, et al.
Published: (2024)
Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey
by: Kuang, Jiayi, et al.
Published: (2024)
by: Kuang, Jiayi, et al.
Published: (2024)
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
by: Kasaei, Seyed Amir, et al.
Published: (2025)
by: Kasaei, Seyed Amir, et al.
Published: (2025)
Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models
by: Wang, Zehao, et al.
Published: (2024)
by: Wang, Zehao, et al.
Published: (2024)
Few-shot Adaptation to Distribution Shifts By Mixing Source and Target Embeddings
by: Xue, Yihao, et al.
Published: (2023)
by: Xue, Yihao, et al.
Published: (2023)
RAVEL: Rare Concept Generation and Editing via Graph-driven Relational Guidance
by: Venkatesh, Kavana, et al.
Published: (2024)
by: Venkatesh, Kavana, et al.
Published: (2024)
MIEB: Massive Image Embedding Benchmark
by: Xiao, Chenghao, et al.
Published: (2025)
by: Xiao, Chenghao, et al.
Published: (2025)
Lost in Embeddings: Information Loss in Vision-Language Models
by: Li, Wenyan, et al.
Published: (2025)
by: Li, Wenyan, et al.
Published: (2025)
Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
by: Zhang, Xiaofeng, et al.
Published: (2024)
by: Zhang, Xiaofeng, et al.
Published: (2024)
brat: Aligned Multi-View Embeddings for Brain MRI Analysis
by: Kayser, Maxime, et al.
Published: (2025)
by: Kayser, Maxime, et al.
Published: (2025)
Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention
by: Zhang, Zhendong
Published: (2025)
by: Zhang, Zhendong
Published: (2025)
Inference Compute-Optimal Video Vision Language Models
by: Wang, Peiqi, et al.
Published: (2025)
by: Wang, Peiqi, et al.
Published: (2025)
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
by: Zhang, Junzhe, et al.
Published: (2025)
by: Zhang, Junzhe, et al.
Published: (2025)
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
by: Shin, Philip Wootaek, et al.
Published: (2026)
by: Shin, Philip Wootaek, et al.
Published: (2026)
MatFormer: Nested Transformer for Elastic Inference
by: Devvrit, et al.
Published: (2023)
by: Devvrit, et al.
Published: (2023)
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
by: Wang, Shan, et al.
Published: (2025)
by: Wang, Shan, et al.
Published: (2025)
FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
by: Bourigault, Emmanuelle, et al.
Published: (2025)
by: Bourigault, Emmanuelle, et al.
Published: (2025)
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
by: Agrawal, Aakriti, et al.
Published: (2025)
by: Agrawal, Aakriti, et al.
Published: (2025)
DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings
by: Liu, Hong, et al.
Published: (2024)
by: Liu, Hong, et al.
Published: (2024)
Natural Language Inference Improves Compositionality in Vision-Language Models
by: Cascante-Bonilla, Paola, et al.
Published: (2024)
by: Cascante-Bonilla, Paola, et al.
Published: (2024)
Inference-Time Structural Reasoning for Compositional Vision-Language Understanding
by: Bhattacharya, Amartya
Published: (2026)
by: Bhattacharya, Amartya
Published: (2026)
Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
by: Li, Siwei, et al.
Published: (2024)
by: Li, Siwei, et al.
Published: (2024)
REBEL: Reinforcement Learning via Regressing Relative Rewards
by: Gao, Zhaolin, et al.
Published: (2024)
by: Gao, Zhaolin, et al.
Published: (2024)
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
by: Meng, Rui, et al.
Published: (2025)
by: Meng, Rui, et al.
Published: (2025)
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
by: Vo, Khang H. N., et al.
Published: (2025)
by: Vo, Khang H. N., et al.
Published: (2025)
Improving Arabic Multi-Label Emotion Classification using Stacked Embeddings and Hybrid Loss Function
by: Aslam, Muhammad Azeem, et al.
Published: (2024)
by: Aslam, Muhammad Azeem, et al.
Published: (2024)
Similar Items
-
Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
by: Suissa, Omri, et al.
Published: (2025) -
Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time
by: Berger, Uri, et al.
Published: (2025) -
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
by: Beňová, Ivana, et al.
Published: (2024) -
Generalised Medical Phrase Grounding
by: Zhang, Wenjun, et al.
Published: (2025) -
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
by: Berger, Uri, et al.
Published: (2024)