Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gan, Ziliang, Lu, Yu, Zhang, Dong, Li, Haohan, Liu, Che, Liu, Jian, Liu, Ji, Wu, Haipang, Fu, Chaoyou, Xu, Zenglin, Zhang, Rongjunchen, Dai, Yong
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2411.03314
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910686096916480
author	Gan, Ziliang Lu, Yu Zhang, Dong Li, Haohan Liu, Che Liu, Jian Liu, Ji Wu, Haipang Fu, Chaoyou Xu, Zenglin Zhang, Rongjunchen Dai, Yong
author_facet	Gan, Ziliang Lu, Yu Zhang, Dong Li, Haohan Liu, Che Liu, Jian Liu, Ji Wu, Haipang Fu, Chaoyou Xu, Zenglin Zhang, Rongjunchen Dai, Yong
contents	In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_03314
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning Gan, Ziliang Lu, Yu Zhang, Dong Li, Haohan Liu, Che Liu, Jian Liu, Ji Wu, Haipang Fu, Chaoyou Xu, Zenglin Zhang, Rongjunchen Dai, Yong Computer Vision and Pattern Recognition Computation and Language In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
title	MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2411.03314

Similar Items