Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.00873 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913082586955776 |
|---|---|
| author | Tilak, Advait Choi, Jiwon Mouli, Nazifa Le, Wei |
| author_facet | Tilak, Advait Choi, Jiwon Mouli, Nazifa Le, Wei |
| contents | The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_00873 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios Tilak, Advait Choi, Jiwon Mouli, Nazifa Le, Wei Multimedia Artificial Intelligence Computer Vision and Pattern Recognition The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts |
| title | BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios |
| topic | Multimedia Artificial Intelligence Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2605.00873 |