Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Tong, Markchom, Thanet
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.03073
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918274131820544
author	Wu, Tong Markchom, Thanet
author_facet	Wu, Tong Markchom, Thanet
contents	Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_03073
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA Wu, Tong Markchom, Thanet Computer Vision and Pattern Recognition Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.
title	Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.03073

Similar Items