Saved in:
| Main Authors: | Pramanick, Shraman, Chellappa, Rama, Venugopalan, Subhashini |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.09413 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
by: Pramanick, Shraman, et al.
Published: (2023)
by: Pramanick, Shraman, et al.
Published: (2023)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)
by: Pramanick, Shraman, et al.
Published: (2025)
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
by: Wei, Guoyizhe, et al.
Published: (2025)
by: Wei, Guoyizhe, et al.
Published: (2025)
DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning
by: Yilmaz, Abdurrahim, et al.
Published: (2026)
by: Yilmaz, Abdurrahim, et al.
Published: (2026)
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
by: Wang, Haibo, et al.
Published: (2024)
by: Wang, Haibo, et al.
Published: (2024)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding
by: Li, Zekun, et al.
Published: (2024)
by: Li, Zekun, et al.
Published: (2024)
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
by: Pang, Wei, et al.
Published: (2025)
by: Pang, Wei, et al.
Published: (2025)
ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering
by: Wu, Yifan, et al.
Published: (2024)
by: Wu, Yifan, et al.
Published: (2024)
Synthetic Document Question Answering in Hungarian
by: Li, Jonathan, et al.
Published: (2025)
by: Li, Jonathan, et al.
Published: (2025)
RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
by: Butsanets, Léo, et al.
Published: (2025)
by: Butsanets, Léo, et al.
Published: (2025)
MMToM-QA: Multimodal Theory of Mind Question Answering
by: Jin, Chuanyang, et al.
Published: (2024)
by: Jin, Chuanyang, et al.
Published: (2024)
Hallucination Benchmark in Medical Visual Question Answering
by: Wu, Jinge, et al.
Published: (2024)
by: Wu, Jinge, et al.
Published: (2024)
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
by: Cocchi, Federico, et al.
Published: (2024)
by: Cocchi, Federico, et al.
Published: (2024)
SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
by: Kim, Wonjoong, et al.
Published: (2024)
by: Kim, Wonjoong, et al.
Published: (2024)
SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation
by: Kendre, Shrikant, et al.
Published: (2025)
by: Kendre, Shrikant, et al.
Published: (2025)
MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering
by: Srivastava, Varun, et al.
Published: (2025)
by: Srivastava, Varun, et al.
Published: (2025)
Top-down Activity Representation Learning for Video Question Answering
by: Wang, Yanan, et al.
Published: (2024)
by: Wang, Yanan, et al.
Published: (2024)
LOVA3: Learning to Visual Question Answering, Asking and Assessment
by: Zhao, Henry Hengyuan, et al.
Published: (2024)
by: Zhao, Henry Hengyuan, et al.
Published: (2024)
DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
by: Yim, Wen-wai, et al.
Published: (2025)
by: Yim, Wen-wai, et al.
Published: (2025)
SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning
by: Yu, Wenhan, et al.
Published: (2025)
by: Yu, Wenhan, et al.
Published: (2025)
Multi-object event graph representation learning for Video Question Answering
by: Wang, Yanan, et al.
Published: (2024)
by: Wang, Yanan, et al.
Published: (2024)
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering
by: Fu, Xingyu, et al.
Published: (2023)
by: Fu, Xingyu, et al.
Published: (2023)
DiffRegCD: Integrated Registration and Change Detection with Diffusion Features
by: Madani, Seyedehanita, et al.
Published: (2025)
by: Madani, Seyedehanita, et al.
Published: (2025)
Scientific Reasoning: Assessment of Multimodal Generative LLMs
by: Dreyer, Florian, et al.
Published: (2025)
by: Dreyer, Florian, et al.
Published: (2025)
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
by: Saxena, Rohit, et al.
Published: (2025)
by: Saxena, Rohit, et al.
Published: (2025)
MediFact at MEDIQA-M3G 2024: Medical Question Answering in Dermatology with Multimodal Learning
by: Saeed, Nadia
Published: (2024)
by: Saeed, Nadia
Published: (2024)
Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
by: Yeh, Yahsin, et al.
Published: (2025)
by: Yeh, Yahsin, et al.
Published: (2025)
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
by: Ghosh, Shiv, et al.
Published: (2026)
by: Ghosh, Shiv, et al.
Published: (2026)
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
by: Winata, Genta Indra, et al.
Published: (2024)
by: Winata, Genta Indra, et al.
Published: (2024)
Exploring Diverse Methods in Visual Question Answering
by: Li, Panfeng, et al.
Published: (2024)
by: Li, Panfeng, et al.
Published: (2024)
SciMDR: Advancing Scientific Multimodal Document Reasoning
by: Chen, Ziyu, et al.
Published: (2026)
by: Chen, Ziyu, et al.
Published: (2026)
QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
by: Jiang, Zhuohang, et al.
Published: (2025)
by: Jiang, Zhuohang, et al.
Published: (2025)
Federated Document Visual Question Answering: A Pilot Study
by: Nguyen, Khanh, et al.
Published: (2024)
by: Nguyen, Khanh, et al.
Published: (2024)
What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations
by: Liu, Dongqi, et al.
Published: (2025)
by: Liu, Dongqi, et al.
Published: (2025)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
by: Compagnoni, Alberto, et al.
Published: (2025)
by: Compagnoni, Alberto, et al.
Published: (2025)
POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
by: Xu, Yichen, et al.
Published: (2025)
by: Xu, Yichen, et al.
Published: (2025)
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
by: Han, Wei, et al.
Published: (2023)
by: Han, Wei, et al.
Published: (2023)
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
by: Parmar, Paritosh, et al.
Published: (2024)
by: Parmar, Paritosh, et al.
Published: (2024)
ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering
by: Kaur, Rachneet, et al.
Published: (2025)
by: Kaur, Rachneet, et al.
Published: (2025)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
by: Romero, David, et al.
Published: (2024)
by: Romero, David, et al.
Published: (2024)
Similar Items
-
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
by: Pramanick, Shraman, et al.
Published: (2023) -
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025) -
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
by: Wei, Guoyizhe, et al.
Published: (2025) -
DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning
by: Yilmaz, Abdurrahim, et al.
Published: (2026) -
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
by: Wang, Haibo, et al.
Published: (2024)