Amharc foirne: :: Library Catalog

Sábháilte in:

Sonraí bibleagrafaíochta
Príomhchruthaitheoirí:	Morales, José A., Colman, Ewan, Sánchez, Sergio, Sánchez-Puig, Fernanda, Pineda, Carlos, Iñiguez, Gerardo, Cocho, Germinal, Flores, Jorge, Gershenson, Carlos
Formáid:	Preprint
Foilsithe / Cruthaithe:	2018
Ábhair:	Physics and Society
Rochtain ar líne:	https://arxiv.org/abs/1802.07258
Clibeanna:	Cuir clib leis Níl clibeanna ann, Bí ar an gcéad duine le clib a chur leis an taifead seo!

_version_	1866914304143392768
author	Morales, José A. Colman, Ewan Sánchez, Sergio Sánchez-Puig, Fernanda Pineda, Carlos Iñiguez, Gerardo Cocho, Germinal Flores, Jorge Gershenson, Carlos
author_facet	Morales, José A. Colman, Ewan Sánchez, Sergio Sánchez-Puig, Fernanda Pineda, Carlos Iñiguez, Gerardo Cocho, Germinal Flores, Jorge Gershenson, Carlos
contents	The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.
format	Preprint
id	arxiv_https___arxiv_org_abs_1802_07258
institution	arXiv
publishDate	2018
record_format	arxiv
spellingShingle	Rank dynamics of word usage at multiple scales Morales, José A. Colman, Ewan Sánchez, Sergio Sánchez-Puig, Fernanda Pineda, Carlos Iñiguez, Gerardo Cocho, Germinal Flores, Jorge Gershenson, Carlos Physics and Society The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.
title	Rank dynamics of word usage at multiple scales
topic	Physics and Society
url	https://arxiv.org/abs/1802.07258

Míreanna comhchosúla