Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cruz, Walter Hernandez, Devine, Peter, Vadgama, Nikhil, Tasca, Paolo, Xu, Jiahua
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.22045
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914610918981632
author	Cruz, Walter Hernandez Devine, Peter Vadgama, Nikhil Tasca, Paolo Xu, Jiahua
author_facet	Cruz, Walter Hernandez Devine, Peter Vadgama, Nikhil Tasca, Paolo Xu, Jiahua
contents	We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation. We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_22045
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain Cruz, Walter Hernandez Devine, Peter Vadgama, Nikhil Tasca, Paolo Xu, Jiahua Computation and Language We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation. We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code.
title	DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
topic	Computation and Language
url	https://arxiv.org/abs/2602.22045

Similar Items