Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jeong, So Won, Ročková, Veronika
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2503.01869
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915236636786688
author	Jeong, So Won Ročková, Veronika
author_facet	Jeong, So Won Ročková, Veronika
contents	For a long time, the authorship of the Federalist Papers had been a subject of inquiry and debate, not only by linguists and historians but also by statisticians. In what was arguably the first Bayesian case study, Mosteller and Wallace (1963) provided the first statistical evidence for attributing all disputed papers to Madison. Our paper revisits this historical dataset but from a lens of modern language models, both small and large. We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification. We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution. We explain differences between various word/phrase embeddings and discuss how to aggregate them in a document. Contrary to our expectations, we exemplify that dimension expansion with word embeddings may not always be beneficial for attribution relative to dimension reduction with topic embeddings. Our experiments demonstrate that default LLM embeddings (even after manual fine-tuning) may not consistently improve authorship attribution accuracy. Instead, Bayesian analysis with topic embeddings trained on ``function words" yields superior out-of-sample classification performance. This suggests that traditional (small) statistical language models, with their interpretability and solid theoretical foundation, can offer significant advantages in authorship attribution tasks. The code used in this analysis is available at github.com/sowonjeong/slm-to-llm
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_01869
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	From Small to Large Language Models: Revisiting the Federalist Papers Jeong, So Won Ročková, Veronika Computation and Language Machine Learning For a long time, the authorship of the Federalist Papers had been a subject of inquiry and debate, not only by linguists and historians but also by statisticians. In what was arguably the first Bayesian case study, Mosteller and Wallace (1963) provided the first statistical evidence for attributing all disputed papers to Madison. Our paper revisits this historical dataset but from a lens of modern language models, both small and large. We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification. We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution. We explain differences between various word/phrase embeddings and discuss how to aggregate them in a document. Contrary to our expectations, we exemplify that dimension expansion with word embeddings may not always be beneficial for attribution relative to dimension reduction with topic embeddings. Our experiments demonstrate that default LLM embeddings (even after manual fine-tuning) may not consistently improve authorship attribution accuracy. Instead, Bayesian analysis with topic embeddings trained on ``function words" yields superior out-of-sample classification performance. This suggests that traditional (small) statistical language models, with their interpretability and solid theoretical foundation, can offer significant advantages in authorship attribution tasks. The code used in this analysis is available at github.com/sowonjeong/slm-to-llm
title	From Small to Large Language Models: Revisiting the Federalist Papers
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2503.01869

Similar Items