:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Merrick, Luke
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2407.18887
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
by: Yu, Puxuan, et al.
Published: (2024)

Improving Pretraining Data Using Perplexity Correlations
by: Thrush, Tristan, et al.
Published: (2024)

NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text
by: Kailas, Prajwal, et al.
Published: (2024)

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
by: DatologyAI, et al.
Published: (2025)

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
by: Feng, Steven, et al.
Published: (2024)

Language Models Improve When Pretraining Data Matches Target Tasks
by: Mizrahi, David, et al.
Published: (2025)

Detecting Pretraining Data from Large Language Models
by: Shi, Weijia, et al.
Published: (2023)

Diffusion-Pretrained Dense and Contextual Embeddings
by: Eslami, Sedigheh, et al.
Published: (2026)

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
by: Luo, Kairong, et al.
Published: (2025)

An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
by: Sutrakar, Vijay Kumar, et al.
Published: (2025)

Data-Centric Lessons To Improve Speech-Language Pretraining
by: Udandarao, Vishaal, et al.
Published: (2025)

Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences
by: Schneider, Johannes
Published: (2023)

Output Embedding Centering for Stable LLM Pretraining
by: Stollenwerk, Felix, et al.
Published: (2026)

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
by: Grangier, David, et al.
Published: (2024)

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free
by: Li, Ziyue, et al.
Published: (2024)

SDEC: Semantic Deep Embedded Clustering
by: Rahman, Mohammad Wali Ur, et al.
Published: (2025)

Can GRPO Help LLMs Transcend Their Pretraining Origin?
by: Ni, Kangqi, et al.
Published: (2025)

Group-Level Data Selection for Efficient Pretraining
by: Yu, Zichun, et al.
Published: (2025)

Improving Clustering on Occupational Text Data through Dimensionality Reduction
by: García, Iago Xabier Vázquez, et al.
Published: (2025)

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings
by: Xu, Binbin, et al.
Published: (2025)

Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification
by: Mamtani, Sumit, et al.
Published: (2025)

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)

TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs
by: Wang, Cheng, et al.
Published: (2024)

Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings
by: Liu, Junlong, et al.
Published: (2024)

DataDecide: How to Predict Best Pretraining Data with Small Experiments
by: Magnusson, Ian, et al.
Published: (2025)

Scaling Laws for Mixture Pretraining Under Data Constraints
by: Sedova, Anastasiia, et al.
Published: (2026)

Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples
by: Sato, Soma, et al.
Published: (2024)

Improving Language Plasticity via Pretraining with Active Forgetting
by: Chen, Yihong, et al.
Published: (2023)

Repetition Improves Language Model Embeddings
by: Springer, Jacob Mitchell, et al.
Published: (2024)

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
by: Yuan, Yi, et al.
Published: (2024)

Luxical: High-Speed Lexical-Dense Text Embeddings
by: DatologyAI, et al.
Published: (2025)

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
by: Yu, Zichun, et al.
Published: (2024)

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
by: Niklaus, Joel, et al.
Published: (2026)

Perturb Your Data: Paraphrase-Guided Training Data Watermarking
by: Shetty, Pranav, et al.
Published: (2025)

Modeling Caption Diversity in Contrastive Vision-Language Pretraining
by: Lavoie, Samuel, et al.
Published: (2024)

HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?
by: Dipta, Shubhashis Roy, et al.
Published: (2024)

Procedural Pretraining: Warming Up Language Models with Abstract Data
by: Jiang, Liangze, et al.
Published: (2026)

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
by: Bethune, Louis, et al.
Published: (2025)

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
by: Messmer, Bettina, et al.
Published: (2025)

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
by: Sam, Dylan, et al.
Published: (2025)