Saved in:
Bibliographic Details
Main Authors: Tilak, Advait, Choi, Jiwon, Mouli, Nazifa, Le, Wei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.00873
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913082586955776
author Tilak, Advait
Choi, Jiwon
Mouli, Nazifa
Le, Wei
author_facet Tilak, Advait
Choi, Jiwon
Mouli, Nazifa
Le, Wei
contents The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
format Preprint
id arxiv_https___arxiv_org_abs_2605_00873
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios
Tilak, Advait
Choi, Jiwon
Mouli, Nazifa
Le, Wei
Multimedia
Artificial Intelligence
Computer Vision and Pattern Recognition
The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
title BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios
topic Multimedia
Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.00873