Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rosillo-Rodes, Pablo, Miguel, Maxi San, Sanchez, David
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Information Retrieval Physics and Society
Online Access:	https://arxiv.org/abs/2411.10227
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911056680452096
author	Rosillo-Rodes, Pablo Miguel, Maxi San Sanchez, David
author_facet	Rosillo-Rodes, Pablo Miguel, Maxi San Sanchez, David
contents	There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_10227
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Entropy and type-token ratio in gigaword corpora Rosillo-Rodes, Pablo Miguel, Maxi San Sanchez, David Computation and Language Information Retrieval Physics and Society There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.
title	Entropy and type-token ratio in gigaword corpora
topic	Computation and Language Information Retrieval Physics and Society
url	https://arxiv.org/abs/2411.10227

Similar Items