Saved in:
Bibliographic Details
Main Authors: Movva, Prahitha, Marupaka, Naga Harshita
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.06183
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911045682987008
author Movva, Prahitha
Marupaka, Naga Harshita
author_facet Movva, Prahitha
Marupaka, Naga Harshita
contents Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
format Preprint
id arxiv_https___arxiv_org_abs_2507_06183
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
Movva, Prahitha
Marupaka, Naga Harshita
Computer Vision and Pattern Recognition
Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
title Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.06183