Saved in:
| Main Authors: | Mo, Wentao, Liu, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.15933 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Find The Gap: Knowledge Base Reasoning For Visual Question Answering
by: Barezi, Elham J., et al.
Published: (2024)
by: Barezi, Elham J., et al.
Published: (2024)
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
Exploring Diverse Methods in Visual Question Answering
by: Li, Panfeng, et al.
Published: (2024)
by: Li, Panfeng, et al.
Published: (2024)
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
by: Jiang, Bowen, et al.
Published: (2024)
by: Jiang, Bowen, et al.
Published: (2024)
Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs
by: Mo, Wentao, et al.
Published: (2026)
by: Mo, Wentao, et al.
Published: (2026)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
by: Romero, David, et al.
Published: (2024)
by: Romero, David, et al.
Published: (2024)
Federated Document Visual Question Answering: A Pilot Study
by: Nguyen, Khanh, et al.
Published: (2024)
by: Nguyen, Khanh, et al.
Published: (2024)
ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering
by: Diao, Xingjian, et al.
Published: (2025)
by: Diao, Xingjian, et al.
Published: (2025)
Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
by: Gupta, Akash, et al.
Published: (2025)
by: Gupta, Akash, et al.
Published: (2025)
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
by: Feng, Chun, et al.
Published: (2024)
by: Feng, Chun, et al.
Published: (2024)
Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering
by: Lagos, Maximiliano Hormazábal, et al.
Published: (2025)
by: Lagos, Maximiliano Hormazábal, et al.
Published: (2025)
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
by: Ghosh, Shiv, et al.
Published: (2026)
by: Ghosh, Shiv, et al.
Published: (2026)
VQA-Levels: A Hierarchical Approach for Classifying Questions in VQA
by: Madaka, Madhuri Latha, et al.
Published: (2025)
by: Madaka, Madhuri Latha, et al.
Published: (2025)
Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering
by: Guo, Danfeng, et al.
Published: (2024)
by: Guo, Danfeng, et al.
Published: (2024)
MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors
by: Tang, Yuan, et al.
Published: (2024)
by: Tang, Yuan, et al.
Published: (2024)
TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices
by: Rashid, Hasib-Al, et al.
Published: (2024)
by: Rashid, Hasib-Al, et al.
Published: (2024)
Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
by: Yeh, Yahsin, et al.
Published: (2025)
by: Yeh, Yahsin, et al.
Published: (2025)
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
by: Sinha, Neelabh, et al.
Published: (2024)
by: Sinha, Neelabh, et al.
Published: (2024)
PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science
by: Sakib, Syed Nazmus, et al.
Published: (2025)
by: Sakib, Syed Nazmus, et al.
Published: (2025)
MediFact at MEDIQA-M3G 2024: Medical Question Answering in Dermatology with Multimodal Learning
by: Saeed, Nadia
Published: (2024)
by: Saeed, Nadia
Published: (2024)
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
by: Min, Juhong, et al.
Published: (2024)
by: Min, Juhong, et al.
Published: (2024)
Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis
by: Fan, Lin, et al.
Published: (2024)
by: Fan, Lin, et al.
Published: (2024)
MMToM-QA: Multimodal Theory of Mind Question Answering
by: Jin, Chuanyang, et al.
Published: (2024)
by: Jin, Chuanyang, et al.
Published: (2024)
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
by: Lin, Xuewu, et al.
Published: (2024)
by: Lin, Xuewu, et al.
Published: (2024)
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
by: Parmar, Paritosh, et al.
Published: (2024)
by: Parmar, Paritosh, et al.
Published: (2024)
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)
by: Yang, Jianing, et al.
Published: (2024)
An Embodied Generalist Agent in 3D World
by: Huang, Jiangyong, et al.
Published: (2023)
by: Huang, Jiangyong, et al.
Published: (2023)
Language-Image Models with 3D Understanding
by: Cho, Jang Hyun, et al.
Published: (2024)
by: Cho, Jang Hyun, et al.
Published: (2024)
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
by: Singh, Shubhankar, et al.
Published: (2024)
by: Singh, Shubhankar, et al.
Published: (2024)
MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering
by: Srivastava, Varun, et al.
Published: (2025)
by: Srivastava, Varun, et al.
Published: (2025)
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
by: Dong, Junnan, et al.
Published: (2024)
by: Dong, Junnan, et al.
Published: (2024)
RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
by: Butsanets, Léo, et al.
Published: (2025)
by: Butsanets, Léo, et al.
Published: (2025)
Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
by: Li, Kailing, et al.
Published: (2026)
by: Li, Kailing, et al.
Published: (2026)
BERT-VQA: Visual Question Answering on Plots
by: Vu, Tai, et al.
Published: (2025)
by: Vu, Tai, et al.
Published: (2025)
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
by: Shang, Chuyi, et al.
Published: (2024)
by: Shang, Chuyi, et al.
Published: (2024)
Situational Awareness Matters in 3D Vision Language Reasoning
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
Bridging the Gap Between Multimodal Foundation Models and World Models
by: He, Xuehai
Published: (2025)
by: He, Xuehai
Published: (2025)
Visual Question Decomposition on Multimodal Large Language Models
by: Zhang, Haowei, et al.
Published: (2024)
by: Zhang, Haowei, et al.
Published: (2024)
Improving Automatic VQA Evaluation Using Large Language Models
by: Mañas, Oscar, et al.
Published: (2023)
by: Mañas, Oscar, et al.
Published: (2023)
Similar Items
-
Find The Gap: Knowledge Base Reasoning For Visual Question Answering
by: Barezi, Elham J., et al.
Published: (2024) -
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
by: Huang, Chengyue, et al.
Published: (2025) -
Exploring Diverse Methods in Visual Question Answering
by: Li, Panfeng, et al.
Published: (2024) -
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
by: Jiang, Bowen, et al.
Published: (2024) -
Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs
by: Mo, Wentao, et al.
Published: (2026)