Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tilak, Advait, Choi, Jiwon, Mouli, Nazifa, Le, Wei
Format:	Preprint
Published:	2026
Subjects:	Multimedia Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.00873
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913082586955776
author	Tilak, Advait Choi, Jiwon Mouli, Nazifa Le, Wei
author_facet	Tilak, Advait Choi, Jiwon Mouli, Nazifa Le, Wei
contents	The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_00873
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios Tilak, Advait Choi, Jiwon Mouli, Nazifa Le, Wei Multimedia Artificial Intelligence Computer Vision and Pattern Recognition The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
title	BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios
topic	Multimedia Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.00873

Similar Items