Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Movva, Prahitha, Marupaka, Naga Harshita
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.06183
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911045682987008
author	Movva, Prahitha Marupaka, Naga Harshita
author_facet	Movva, Prahitha Marupaka, Naga Harshita
contents	Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_06183
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling Movva, Prahitha Marupaka, Naga Harshita Computer Vision and Pattern Recognition Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
title	Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.06183

Similar Items