Saved in:
Bibliographic Details
Main Authors: Wang, Kesen, Toibazar, Daulet, Alfulayt, Abdulrahman, Albadawi, Abdulaziz S., Alkahtani, Ranya A., Ibrahim, Asma A., Alhomoud, Haneen A., Mohamed, Sherif, Moreno, Pedro J.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.20145
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.