:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Clarke, Christopher, Daynauth, Roland, Wilkinson, Charlene, Devonish, Hubert, Mars, Jason
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2405.03832
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments
by: Daynauth, Roland, et al.
Published: (2024)

Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat
by: Daynauth, Roland, et al.
Published: (2024)

SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models
by: Daynauth, Roland, et al.
Published: (2025)

PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization
by: Clarke, Christopher, et al.
Published: (2024)

CreoleVal: Multilingual Multitask Benchmarks for Creoles
by: Lent, Heather, et al.
Published: (2023)

AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
by: Milička, Jiří, et al.
Published: (2025)

Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios
by: Joshi, Aditya, et al.
Published: (2024)

Multilingual and Explainable Text Detoxification with Parallel Corpora
by: Dementieva, Daryna, et al.
Published: (2024)

Attributing Culture-Conditioned Generations to Pretraining Corpora
by: Li, Huihan, et al.
Published: (2024)

Bottom-Up and Top-Down Analysis of Values, Agendas, and Observations in Corpora and LLMs
by: Friedman, Scott E., et al.
Published: (2024)

Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
by: Park, Chanwoo, et al.
Published: (2025)

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
by: Basu, Kinjal, et al.
Published: (2024)

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias
by: Xu, Yuemei, et al.
Published: (2024)

AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora
by: Zhou, Zhihan, et al.
Published: (2025)

Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora
by: Abbas, Chaymaa, et al.
Published: (2026)

Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora
by: Majurski, Michael, et al.
Published: (2025)

A First Context-Free Grammar Applied to Nawatl Corpora Augmentation
by: Guzmán-Landa, Juan-José, et al.
Published: (2025)

Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
by: Hennara, Khalil, et al.
Published: (2025)

Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies
by: Artemova, Ekaterina, et al.
Published: (2025)

SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora
by: Qarah, Faisal
Published: (2024)

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
by: Shen, Yingli, et al.
Published: (2025)

Mitigating Stylistic Biases of Machine Translation Systems via Monolingual Corpora Only
by: Gao, Xuanqi, et al.
Published: (2025)

Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
by: Kersting, Nicholas S., et al.
Published: (2026)

MegaMath: Pushing the Limits of Open Math Corpora
by: Zhou, Fan, et al.
Published: (2025)

Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models
by: Ramos, Luis, et al.
Published: (2025)

GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages
by: Gyamfi, Lawrence Adu, et al.
Published: (2026)

Preference Consistency Matters: Enhancing Preference Learning in Language Models with Automated Self-Curation of Training Corpora
by: Lee, JoonHo, et al.
Published: (2024)

CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
by: Bhattacharjee, Soham, et al.
Published: (2025)

Discovering Multi-Scale Semantic Structure in Text Corpora Using Density-Based Trees and LLM Embeddings
by: Haschka, Thomas, et al.
Published: (2025)

Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
by: Kim, Ireh, et al.
Published: (2026)

MTP: A Meaning-Typed Language Abstraction for AI-Integrated Programming
by: Dantanarayana, Jayanaka L., et al.
Published: (2024)

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
by: Kim, Yungi, et al.
Published: (2024)

EmbGen: Teaching with Reassembled Corpora
by: Lenin, Arun K, et al.
Published: (2026)

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
by: Bai, Jiaxin, et al.
Published: (2025)

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
by: Amiraz, Chen, et al.
Published: (2025)

OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora
by: Flynt, Jeffrey
Published: (2026)

Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction
by: Gomes, Juliana Resplande Sant'anna, et al.
Published: (2025)

Building a Chinese Medical Dialogue System: Integrating Large-scale Corpora and Novel Models
by: Wang, Xinyuan, et al.
Published: (2024)

What Makes a Reward Model a Good Teacher? An Optimization Perspective
by: Razin, Noam, et al.
Published: (2025)

Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting
by: Mühlenbernd, Roland
Published: (2026)