Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zheng, Shen, Zhang, Yuyu, Zhu, Yijie, Xi, Chenguang, Gao, Pengyang, Zhou, Xun, Chang, Kevin Chen-Chuan
Format:	Preprint
Published:	2023
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2309.16583
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914737801920512
author	Zheng, Shen Zhang, Yuyu Zhu, Yijie Xi, Chenguang Gao, Pengyang Zhou, Xun Chang, Kevin Chen-Chuan
author_facet	Zheng, Shen Zhang, Yuyu Zhu, Yijie Xi, Chenguang Gao, Pengyang Zhou, Xun Chang, Kevin Chen-Chuan
contents	With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2309_16583
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond Zheng, Shen Zhang, Yuyu Zhu, Yijie Xi, Chenguang Gao, Pengyang Zhou, Xun Chang, Kevin Chen-Chuan Computation and Language With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
title	GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
topic	Computation and Language
url	https://arxiv.org/abs/2309.16583

Similar Items