Saved in:
Bibliographic Details
Main Authors: Li, Wenyan, Zhang, Xinyu, Li, Jiaang, Peng, Qiwei, Tang, Raphael, Zhou, Li, Zhang, Weijia, Hu, Guimin, Yuan, Yifei, Søgaard, Anders, Hershcovich, Daniel, Elliott, Desmond
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.11030
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929521369808896
author Li, Wenyan
Zhang, Xinyu
Li, Jiaang
Peng, Qiwei
Tang, Raphael
Zhou, Li
Zhang, Weijia
Hu, Guimin
Yuan, Yifei
Søgaard, Anders
Hershcovich, Daniel
Elliott, Desmond
author_facet Li, Wenyan
Zhang, Xinyu
Li, Jiaang
Peng, Qiwei
Tang, Raphael
Zhou, Li
Zhang, Weijia
Hu, Guimin
Yuan, Yifei
Søgaard, Anders
Hershcovich, Daniel
Elliott, Desmond
contents Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.
format Preprint
id arxiv_https___arxiv_org_abs_2406_11030
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
Li, Wenyan
Zhang, Xinyu
Li, Jiaang
Peng, Qiwei
Tang, Raphael
Zhou, Li
Zhang, Weijia
Hu, Guimin
Yuan, Yifei
Søgaard, Anders
Hershcovich, Daniel
Elliott, Desmond
Computation and Language
Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.
title FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
topic Computation and Language
url https://arxiv.org/abs/2406.11030