Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wuhrmann, Arthur, Kucherenko, Anastasiia, Kucharavy, Andrei
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2507.01844
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916822595403776
author	Wuhrmann, Arthur Kucherenko, Anastasiia Kucharavy, Andrei
author_facet	Wuhrmann, Arthur Kucherenko, Anastasiia Kucharavy, Andrei
contents	As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_01844
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Low-Perplexity LLM-Generated Sequences and Where To Find Them Wuhrmann, Arthur Kucherenko, Anastasiia Kucharavy, Andrei Computation and Language Machine Learning As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.
title	Low-Perplexity LLM-Generated Sequences and Where To Find Them
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2507.01844

Similar Items