Saved in:
| Main Author: | Church, Kenneth |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.14721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Building and Aligning Comparable Corpora
by: Saad, Motaz, et al.
Published: (2025)
by: Saad, Motaz, et al.
Published: (2025)
Since the Scientific Literature Is Multilingual, Our Models Should Be Too
by: Ebrahimi, Abteen, et al.
Published: (2024)
by: Ebrahimi, Abteen, et al.
Published: (2024)
New Textual Corpora for Serbian Language Modeling
by: Škorić, Mihailo, et al.
Published: (2024)
by: Škorić, Mihailo, et al.
Published: (2024)
Bias in News Summarization: Measures, Pitfalls and Corpora
by: Steen, Julius, et al.
Published: (2023)
by: Steen, Julius, et al.
Published: (2023)
AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
by: Milička, Jiří, et al.
Published: (2025)
by: Milička, Jiří, et al.
Published: (2025)
From Outliers to Topics in Language Models: Anticipating Trends in News Corpora
by: Zve, Evangelia, et al.
Published: (2025)
by: Zve, Evangelia, et al.
Published: (2025)
Data Augmentation for Code Translation with Comparable Corpora and Multiple References
by: Xie, Yiqing, et al.
Published: (2023)
by: Xie, Yiqing, et al.
Published: (2023)
Reap the Wild Wind: Detecting Media Storms in Large-Scale News Corpora
by: Markus, Dror K., et al.
Published: (2024)
by: Markus, Dror K., et al.
Published: (2024)
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation
by: Ljubešić, Nikola, et al.
Published: (2024)
by: Ljubešić, Nikola, et al.
Published: (2024)
Identifying Emerging Concepts in Large Corpora
by: Ma, Sibo, et al.
Published: (2025)
by: Ma, Sibo, et al.
Published: (2025)
Validating and Exploring Large Geographic Corpora
by: Dunn, Jonathan
Published: (2024)
by: Dunn, Jonathan
Published: (2024)
Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT
by: Ahmad, Ibrahim Said, et al.
Published: (2024)
by: Ahmad, Ibrahim Said, et al.
Published: (2024)
Mathematical Entities: Corpora and Benchmarks
by: Collard, Jacob, et al.
Published: (2024)
by: Collard, Jacob, et al.
Published: (2024)
The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
by: Pungeršek, Taja Kuzman, et al.
Published: (2026)
by: Pungeršek, Taja Kuzman, et al.
Published: (2026)
Active Learning for Multilingual Fingerspelling Corpora
by: Wang, Shuai, et al.
Published: (2023)
by: Wang, Shuai, et al.
Published: (2023)
Unsupervised Location Mapping for Narrative Corpora
by: Wagner, Eitan, et al.
Published: (2025)
by: Wagner, Eitan, et al.
Published: (2025)
Guylingo: The Republic of Guyana Creole Corpora
by: Clarke, Christopher, et al.
Published: (2024)
by: Clarke, Christopher, et al.
Published: (2024)
Disambiguating Numeral Sequences to Decipher Ancient Accounting Corpora
by: Born, Logan, et al.
Published: (2025)
by: Born, Logan, et al.
Published: (2025)
Wiki Dumps to Training Corpora: South Slavic Case
by: Škorić, Mihailo, et al.
Published: (2026)
by: Škorić, Mihailo, et al.
Published: (2026)
On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms
by: Yue, Richard, et al.
Published: (2024)
by: Yue, Richard, et al.
Published: (2024)
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
by: Kargupta, Priyanka, et al.
Published: (2025)
by: Kargupta, Priyanka, et al.
Published: (2025)
When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR
by: Ko, Dayoon, et al.
Published: (2025)
by: Ko, Dayoon, et al.
Published: (2025)
Multilingual and Explainable Text Detoxification with Parallel Corpora
by: Dementieva, Daryna, et al.
Published: (2024)
by: Dementieva, Daryna, et al.
Published: (2024)
Attributing Culture-Conditioned Generations to Pretraining Corpora
by: Li, Huihan, et al.
Published: (2024)
by: Li, Huihan, et al.
Published: (2024)
Is Peer-Reviewing Worth the Effort?
by: Church, Kenneth, et al.
Published: (2024)
by: Church, Kenneth, et al.
Published: (2024)
Data Caricatures: On the Representation of African American Language in Pretraining Corpora
by: Deas, Nicholas, et al.
Published: (2025)
by: Deas, Nicholas, et al.
Published: (2025)
MASRAD: Arabic Terminology Management Corpora with Semi-Automatic Construction
by: Nasser, Mahdi, et al.
Published: (2025)
by: Nasser, Mahdi, et al.
Published: (2025)
Multilingual Embedding Probes Fail to Generalize Across Learner Corpora
by: Lyngbaek, Laurits, et al.
Published: (2026)
by: Lyngbaek, Laurits, et al.
Published: (2026)
HistLens: Mapping Idea Change across Concepts and Corpora
by: Jing, Yi, et al.
Published: (2026)
by: Jing, Yi, et al.
Published: (2026)
AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
by: Fedchin, Aleksandr, et al.
Published: (2024)
by: Fedchin, Aleksandr, et al.
Published: (2024)
Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
by: Park, Chanwoo, et al.
Published: (2025)
by: Park, Chanwoo, et al.
Published: (2025)
Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
by: Zheng, Jianyu
Published: (2026)
by: Zheng, Jianyu
Published: (2026)
ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople
by: Paul, Shounak, et al.
Published: (2026)
by: Paul, Shounak, et al.
Published: (2026)
WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
by: Wang, Pengyu, et al.
Published: (2026)
by: Wang, Pengyu, et al.
Published: (2026)
Unearthing Large Scale Domain-Specific Knowledge from Public Corpora
by: Fei, Zhaoye, et al.
Published: (2024)
by: Fei, Zhaoye, et al.
Published: (2024)
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
by: Lin, Peiqin, et al.
Published: (2024)
by: Lin, Peiqin, et al.
Published: (2024)
Building Corpora for Single-Channel Speech Separation Across Multiple Domains
by: Maciejewski, Matthew, et al.
Published: (2018)
by: Maciejewski, Matthew, et al.
Published: (2018)
Semiparametric Latent Topic Modeling on Consumer-Generated Corpora
by: Dayta, Dominic B., et al.
Published: (2021)
by: Dayta, Dominic B., et al.
Published: (2021)
CAMEO: Collection of Multilingual Emotional Speech Corpora
by: Christop, Iwona, et al.
Published: (2025)
by: Christop, Iwona, et al.
Published: (2025)
Identifying Quantum Mechanical Statistics in Italian Corpora
by: Aerts, Diederik, et al.
Published: (2024)
by: Aerts, Diederik, et al.
Published: (2024)
Similar Items
-
Building and Aligning Comparable Corpora
by: Saad, Motaz, et al.
Published: (2025) -
Since the Scientific Literature Is Multilingual, Our Models Should Be Too
by: Ebrahimi, Abteen, et al.
Published: (2024) -
New Textual Corpora for Serbian Language Modeling
by: Škorić, Mihailo, et al.
Published: (2024) -
Bias in News Summarization: Measures, Pitfalls and Corpora
by: Steen, Julius, et al.
Published: (2023) -
AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
by: Milička, Jiří, et al.
Published: (2025)