Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.15928 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915297713192960 |
|---|---|
| author | Montes, Tony Lozano, Fernando |
| author_facet | Montes, Tony Lozano, Fernando |
| contents | Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_15928 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation Montes, Tony Lozano, Fernando Computer Vision and Pattern Recognition Computation and Language I.4.8 Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent. |
| title | ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation |
| topic | Computer Vision and Pattern Recognition Computation and Language I.4.8 |
| url | https://arxiv.org/abs/2505.15928 |