Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Weimin, Furnas, Alexander C., Yang, Eddie, Liu, Gefei, Akella, Akhil Pandey, Song, Xuefeng, Wang, Dashun, Liu, Han
Format:	Preprint
Published:	2025
Subjects:	Computational Engineering, Finance, and Science
Online Access:	https://arxiv.org/abs/2509.21493
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914337962065920
author	Wu, Weimin Furnas, Alexander C. Yang, Eddie Liu, Gefei Akella, Akhil Pandey Song, Xuefeng Wang, Dashun Liu, Han
author_facet	Wu, Weimin Furnas, Alexander C. Yang, Eddie Liu, Gefei Akella, Akhil Pandey Song, Xuefeng Wang, Dashun Liu, Han
contents	We propose Sci2Pol-Bench and Sci2Pol-Corpus, the first benchmark and training dataset for evaluating and fine-tuning large language models (LLMs) on policy brief generation from a scientific paper. We build Sci2Pol-Bench on a five-stage taxonomy to mirror the human writing process: (i) Autocompletion, (ii) Understanding, (iii) Summarization, (iv) Generation, and (v) Verification. It features 18 tasks in multiple-choice and open-ended formats. Specifically, for the Generation stage, we show that BERTScore and ROUGE scores fail to capture the quality of brief writing, and introduce a new LLM-based evaluation metric aligned with expert judgement. Using this benchmark, we evaluate 13 leading open-source and commercial LLMs to uncover key limitations. To improve LLM performance on brief writing, we curate the Sci2Pol-Corpus for fine-tuning. We start by linking each cited scientific paper to its corresponding policy document, drawn from 5.6 million policy records. This produces 140,000 candidate pairs. We then employ an LLM-as-a-judge to filter high-quality examples, followed by in-context polishing using three expert-written samples as references. This process yields a final set of 639 new pairs. Finally, we fine-tune three models on Sci2Pol-Corpus: LLaMA-3.18B, Gemma-12B, and Gemma-27B. Fine-tuning leads to consistent performance improvements across Sci2Pol-Bench. Notably, after fine-tuning, Gemma-27B surpasses the much larger GPT-4o and DeepSeek-V3 (671B). These demonstrate the effectiveness of our corpus in bridging the gap between science and policy.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_21493
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation Wu, Weimin Furnas, Alexander C. Yang, Eddie Liu, Gefei Akella, Akhil Pandey Song, Xuefeng Wang, Dashun Liu, Han Computational Engineering, Finance, and Science We propose Sci2Pol-Bench and Sci2Pol-Corpus, the first benchmark and training dataset for evaluating and fine-tuning large language models (LLMs) on policy brief generation from a scientific paper. We build Sci2Pol-Bench on a five-stage taxonomy to mirror the human writing process: (i) Autocompletion, (ii) Understanding, (iii) Summarization, (iv) Generation, and (v) Verification. It features 18 tasks in multiple-choice and open-ended formats. Specifically, for the Generation stage, we show that BERTScore and ROUGE scores fail to capture the quality of brief writing, and introduce a new LLM-based evaluation metric aligned with expert judgement. Using this benchmark, we evaluate 13 leading open-source and commercial LLMs to uncover key limitations. To improve LLM performance on brief writing, we curate the Sci2Pol-Corpus for fine-tuning. We start by linking each cited scientific paper to its corresponding policy document, drawn from 5.6 million policy records. This produces 140,000 candidate pairs. We then employ an LLM-as-a-judge to filter high-quality examples, followed by in-context polishing using three expert-written samples as references. This process yields a final set of 639 new pairs. Finally, we fine-tune three models on Sci2Pol-Corpus: LLaMA-3.18B, Gemma-12B, and Gemma-27B. Fine-tuning leads to consistent performance improvements across Sci2Pol-Bench. Notably, after fine-tuning, Gemma-27B surpasses the much larger GPT-4o and DeepSeek-V3 (671B). These demonstrate the effectiveness of our corpus in bridging the gap between science and policy.
title	Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation
topic	Computational Engineering, Finance, and Science
url	https://arxiv.org/abs/2509.21493

Similar Items