Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.22045 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914610918981632 |
|---|---|
| author | Cruz, Walter Hernandez Devine, Peter Vadgama, Nikhil Tasca, Paolo Xu, Jiahua |
| author_facet | Cruz, Walter Hernandez Devine, Peter Vadgama, Nikhil Tasca, Paolo Xu, Jiahua |
| contents | We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution.
We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation.
We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_22045 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain Cruz, Walter Hernandez Devine, Peter Vadgama, Nikhil Tasca, Paolo Xu, Jiahua Computation and Language We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation. We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code. |
| title | DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2602.22045 |