Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Yi, Han, Pan, Yulu, He, Feihong, Liu, Xinyu, Zhang, Benjamin, Oguntola, Oluwatumininu, Bertasius, Gedas
Format:	Preprint
Publié:	2025
Sujets:	Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2506.06277
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866917138273402880
author	Yi, Han Pan, Yulu He, Feihong Liu, Xinyu Zhang, Benjamin Oguntola, Oluwatumininu Bertasius, Gedas
author_facet	Yi, Han Pan, Yulu He, Feihong Liu, Xinyu Zhang, Benjamin Oguntola, Oluwatumininu Bertasius, Gedas
contents	We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_06277
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ExAct: A Video-Language Benchmark for Expert Action Analysis Yi, Han Pan, Yulu He, Feihong Liu, Xinyu Zhang, Benjamin Oguntola, Oluwatumininu Bertasius, Gedas Computer Vision and Pattern Recognition We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/
title	ExAct: A Video-Language Benchmark for Expert Action Analysis
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2506.06277

Documents similaires