Saved in:
Bibliographic Details
Main Authors: Xue, Siqiao, Li, Xiaojing, Zhou, Fan, Dai, Qingyang, Chu, Zhixuan, Mei, Hongyuan
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.04526
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908363987615744
author Xue, Siqiao
Li, Xiaojing
Zhou, Fan
Dai, Qingyang
Chu, Zhixuan
Mei, Hongyuan
author_facet Xue, Siqiao
Li, Xiaojing
Zhou, Fan
Dai, Qingyang
Chu, Zhixuan
Mei, Hongyuan
contents In this paper, we introduce FAMMA, an open-source benchmark for \underline{f}in\underline{a}ncial \underline{m}ultilingual \underline{m}ultimodal question \underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.
format Preprint
id arxiv_https___arxiv_org_abs_2410_04526
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
Xue, Siqiao
Li, Xiaojing
Zhou, Fan
Dai, Qingyang
Chu, Zhixuan
Mei, Hongyuan
Computation and Language
Artificial Intelligence
In this paper, we introduce FAMMA, an open-source benchmark for \underline{f}in\underline{a}ncial \underline{m}ultilingual \underline{m}ultimodal question \underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.
title FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2410.04526