Saved in:
Bibliographic Details
Main Authors: Padlewski, Piotr, Bain, Max, Henderson, Matthew, Zhu, Zhongkai, Relan, Nishant, Pham, Hai, Ong, Donovan, Aleksiev, Kaloyan, Ormazabal, Aitor, Phua, Samuel, Yeo, Ethan, Lamprecht, Eugenie, Liu, Qi, Wang, Yuqi, Chen, Eric, Fu, Deyu, Li, Lei, Zheng, Che, d'Autume, Cyprien de Masson, Yogatama, Dani, Artetxe, Mikel, Tay, Yi
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.02287
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929334866935808
author Padlewski, Piotr
Bain, Max
Henderson, Matthew
Zhu, Zhongkai
Relan, Nishant
Pham, Hai
Ong, Donovan
Aleksiev, Kaloyan
Ormazabal, Aitor
Phua, Samuel
Yeo, Ethan
Lamprecht, Eugenie
Liu, Qi
Wang, Yuqi
Chen, Eric
Fu, Deyu
Li, Lei
Zheng, Che
d'Autume, Cyprien de Masson
Yogatama, Dani
Artetxe, Mikel
Tay, Yi
author_facet Padlewski, Piotr
Bain, Max
Henderson, Matthew
Zhu, Zhongkai
Relan, Nishant
Pham, Hai
Ong, Donovan
Aleksiev, Kaloyan
Ormazabal, Aitor
Phua, Samuel
Yeo, Ethan
Lamprecht, Eugenie
Liu, Qi
Wang, Yuqi
Chen, Eric
Fu, Deyu
Li, Lei
Zheng, Che
d'Autume, Cyprien de Masson
Yogatama, Dani
Artetxe, Mikel
Tay, Yi
contents We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval
format Preprint
id arxiv_https___arxiv_org_abs_2405_02287
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
Padlewski, Piotr
Bain, Max
Henderson, Matthew
Zhu, Zhongkai
Relan, Nishant
Pham, Hai
Ong, Donovan
Aleksiev, Kaloyan
Ormazabal, Aitor
Phua, Samuel
Yeo, Ethan
Lamprecht, Eugenie
Liu, Qi
Wang, Yuqi
Chen, Eric
Fu, Deyu
Li, Lei
Zheng, Che
d'Autume, Cyprien de Masson
Yogatama, Dani
Artetxe, Mikel
Tay, Yi
Computation and Language
Artificial Intelligence
Computer Vision and Pattern Recognition
We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval
title Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
topic Computation and Language
Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2405.02287