:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Held, William, Paranjape, Bhargavi, Koura, Punit Singh, Lewis, Mike, Zhang, Frank, Mihaylov, Todor
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2501.11747
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
by: Kim, Joongwon, et al.
Published: (2024)

Revisiting Multilingual Data Mixtures in Language Model Pretraining
by: Foroutan, Negar, et al.
Published: (2025)

HorizonBench: Long-Horizon Personalization with Evolving Preferences
by: Li, Shuyue Stella, et al.
Published: (2026)

MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
by: Wen, Bingbing, et al.
Published: (2026)

Distilling an End-to-End Voice Assistant Without Instruction Training Data
by: Held, William, et al.
Published: (2024)

Tracing Persona Vectors Through LLM Pretraining
by: Moskvoretskii, Viktor, et al.
Published: (2026)

daVinci-LLM:Towards the Science of Pretraining
by: Qin, Yiwei, et al.
Published: (2026)

IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining
by: Feng, Dawei, et al.
Published: (2024)

In-context Pretraining: Language Modeling Beyond Document Boundaries
by: Shi, Weijia, et al.
Published: (2023)

Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning
by: Zhao, Yibo, et al.
Published: (2025)

Assessing and Verifying Task Utility in LLM-Powered Applications
by: Arabzadeh, Negar, et al.
Published: (2024)

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
by: Schaeffer, Rylan, et al.
Published: (2025)

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
by: Feng, Steven, et al.
Published: (2024)

Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data
by: Zhang, Xuemiao, et al.
Published: (2025)

BTS: Harmonizing Specialized Experts into a Generalist LLM
by: Zhang, Qizhen, et al.
Published: (2025)

Scaling LLM Inference with Optimized Sample Compute Allocation
by: Zhang, Kexun, et al.
Published: (2024)

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
by: Du, Xinrun, et al.
Published: (2024)

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
by: Wang, Xinyi, et al.
Published: (2024)

A Continued Pretrained LLM Approach for Automatic Medical Note Generation
by: Yuan, Dong, et al.
Published: (2024)

Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining
by: Guo, Ping, et al.
Published: (2025)

On Linear Representations and Pretraining Data Frequency in Language Models
by: Merullo, Jack, et al.
Published: (2025)

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment
by: Gan, Woody Haosheng, et al.
Published: (2026)

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
by: Zhang, Hengran, et al.
Published: (2025)

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
by: Zhang, Hengran, et al.
Published: (2025)

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
by: Ye, Jiasheng, et al.
Published: (2024)

MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models
by: Li, Jiazheng, et al.
Published: (2025)

MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications
by: Kanithi, Praveenkumar, et al.
Published: (2024)

Output Embedding Centering for Stable LLM Pretraining
by: Stollenwerk, Felix, et al.
Published: (2026)

Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
by: Cen, Zhepeng, et al.
Published: (2025)

Estimating LLM Uncertainty with Evidence
by: Ma, Huan, et al.
Published: (2025)

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
by: Luo, Kairong, et al.
Published: (2025)

Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval
by: Singh, Karanbir, et al.
Published: (2025)

FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model
by: Wu, Xiaobao, et al.
Published: (2024)

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization
by: Chen, Yurun, et al.
Published: (2025)

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
by: Hallinan, Skyler, et al.
Published: (2026)

AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
by: Su, Zhe, et al.
Published: (2024)

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications
by: Arabzadeh, Negar, et al.
Published: (2024)

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment
by: Chakraborty, Souradip, et al.
Published: (2025)

Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
by: Norris, Max, et al.
Published: (2025)