Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Padlewski, Piotr, Bain, Max, Henderson, Matthew, Zhu, Zhongkai, Relan, Nishant, Pham, Hai, Ong, Donovan, Aleksiev, Kaloyan, Ormazabal, Aitor, Phua, Samuel, Yeo, Ethan, Lamprecht, Eugenie, Liu, Qi, Wang, Yuqi, Chen, Eric, Fu, Deyu, Li, Lei, Zheng, Che, d'Autume, Cyprien de Masson, Yogatama, Dani, Artetxe, Mikel, Tay, Yi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2405.02287
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929334866935808
author	Padlewski, Piotr Bain, Max Henderson, Matthew Zhu, Zhongkai Relan, Nishant Pham, Hai Ong, Donovan Aleksiev, Kaloyan Ormazabal, Aitor Phua, Samuel Yeo, Ethan Lamprecht, Eugenie Liu, Qi Wang, Yuqi Chen, Eric Fu, Deyu Li, Lei Zheng, Che d'Autume, Cyprien de Masson Yogatama, Dani Artetxe, Mikel Tay, Yi
author_facet	Padlewski, Piotr Bain, Max Henderson, Matthew Zhu, Zhongkai Relan, Nishant Pham, Hai Ong, Donovan Aleksiev, Kaloyan Ormazabal, Aitor Phua, Samuel Yeo, Ethan Lamprecht, Eugenie Liu, Qi Wang, Yuqi Chen, Eric Fu, Deyu Li, Lei Zheng, Che d'Autume, Cyprien de Masson Yogatama, Dani Artetxe, Mikel Tay, Yi
contents	We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_02287
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models Padlewski, Piotr Bain, Max Henderson, Matthew Zhu, Zhongkai Relan, Nishant Pham, Hai Ong, Donovan Aleksiev, Kaloyan Ormazabal, Aitor Phua, Samuel Yeo, Ethan Lamprecht, Eugenie Liu, Qi Wang, Yuqi Chen, Eric Fu, Deyu Li, Lei Zheng, Che d'Autume, Cyprien de Masson Yogatama, Dani Artetxe, Mikel Tay, Yi Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval
title	Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
topic	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2405.02287

Similar Items