Saved in:
| Main Author: | Merrick, Luke |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.18887 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
by: Yu, Puxuan, et al.
Published: (2024)
by: Yu, Puxuan, et al.
Published: (2024)
Improving Pretraining Data Using Perplexity Correlations
by: Thrush, Tristan, et al.
Published: (2024)
by: Thrush, Tristan, et al.
Published: (2024)
NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text
by: Kailas, Prajwal, et al.
Published: (2024)
by: Kailas, Prajwal, et al.
Published: (2024)
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
by: DatologyAI, et al.
Published: (2025)
by: DatologyAI, et al.
Published: (2025)
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
by: Feng, Steven, et al.
Published: (2024)
by: Feng, Steven, et al.
Published: (2024)
Language Models Improve When Pretraining Data Matches Target Tasks
by: Mizrahi, David, et al.
Published: (2025)
by: Mizrahi, David, et al.
Published: (2025)
Detecting Pretraining Data from Large Language Models
by: Shi, Weijia, et al.
Published: (2023)
by: Shi, Weijia, et al.
Published: (2023)
Diffusion-Pretrained Dense and Contextual Embeddings
by: Eslami, Sedigheh, et al.
Published: (2026)
by: Eslami, Sedigheh, et al.
Published: (2026)
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
by: Luo, Kairong, et al.
Published: (2025)
by: Luo, Kairong, et al.
Published: (2025)
An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
by: Sutrakar, Vijay Kumar, et al.
Published: (2025)
by: Sutrakar, Vijay Kumar, et al.
Published: (2025)
Data-Centric Lessons To Improve Speech-Language Pretraining
by: Udandarao, Vishaal, et al.
Published: (2025)
by: Udandarao, Vishaal, et al.
Published: (2025)
Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences
by: Schneider, Johannes
Published: (2023)
by: Schneider, Johannes
Published: (2023)
Output Embedding Centering for Stable LLM Pretraining
by: Stollenwerk, Felix, et al.
Published: (2026)
by: Stollenwerk, Felix, et al.
Published: (2026)
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
by: Grangier, David, et al.
Published: (2024)
by: Grangier, David, et al.
Published: (2024)
Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free
by: Li, Ziyue, et al.
Published: (2024)
by: Li, Ziyue, et al.
Published: (2024)
SDEC: Semantic Deep Embedded Clustering
by: Rahman, Mohammad Wali Ur, et al.
Published: (2025)
by: Rahman, Mohammad Wali Ur, et al.
Published: (2025)
Can GRPO Help LLMs Transcend Their Pretraining Origin?
by: Ni, Kangqi, et al.
Published: (2025)
by: Ni, Kangqi, et al.
Published: (2025)
Group-Level Data Selection for Efficient Pretraining
by: Yu, Zichun, et al.
Published: (2025)
by: Yu, Zichun, et al.
Published: (2025)
Improving Clustering on Occupational Text Data through Dimensionality Reduction
by: García, Iago Xabier Vázquez, et al.
Published: (2025)
by: García, Iago Xabier Vázquez, et al.
Published: (2025)
Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings
by: Xu, Binbin, et al.
Published: (2025)
by: Xu, Binbin, et al.
Published: (2025)
Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification
by: Mamtani, Sumit, et al.
Published: (2025)
by: Mamtani, Sumit, et al.
Published: (2025)
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)
by: Ma, Xuezhe, et al.
Published: (2024)
TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs
by: Wang, Cheng, et al.
Published: (2024)
by: Wang, Cheng, et al.
Published: (2024)
Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings
by: Liu, Junlong, et al.
Published: (2024)
by: Liu, Junlong, et al.
Published: (2024)
DataDecide: How to Predict Best Pretraining Data with Small Experiments
by: Magnusson, Ian, et al.
Published: (2025)
by: Magnusson, Ian, et al.
Published: (2025)
Scaling Laws for Mixture Pretraining Under Data Constraints
by: Sedova, Anastasiia, et al.
Published: (2026)
by: Sedova, Anastasiia, et al.
Published: (2026)
Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples
by: Sato, Soma, et al.
Published: (2024)
by: Sato, Soma, et al.
Published: (2024)
Improving Language Plasticity via Pretraining with Active Forgetting
by: Chen, Yihong, et al.
Published: (2023)
by: Chen, Yihong, et al.
Published: (2023)
Repetition Improves Language Model Embeddings
by: Springer, Jacob Mitchell, et al.
Published: (2024)
by: Springer, Jacob Mitchell, et al.
Published: (2024)
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Luxical: High-Speed Lexical-Dense Text Embeddings
by: DatologyAI, et al.
Published: (2025)
by: DatologyAI, et al.
Published: (2025)
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
by: Yu, Zichun, et al.
Published: (2024)
by: Yu, Zichun, et al.
Published: (2024)
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
by: Niklaus, Joel, et al.
Published: (2026)
by: Niklaus, Joel, et al.
Published: (2026)
Perturb Your Data: Paraphrase-Guided Training Data Watermarking
by: Shetty, Pranav, et al.
Published: (2025)
by: Shetty, Pranav, et al.
Published: (2025)
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
by: Lavoie, Samuel, et al.
Published: (2024)
by: Lavoie, Samuel, et al.
Published: (2024)
HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?
by: Dipta, Shubhashis Roy, et al.
Published: (2024)
by: Dipta, Shubhashis Roy, et al.
Published: (2024)
Procedural Pretraining: Warming Up Language Models with Abstract Data
by: Jiang, Liangze, et al.
Published: (2026)
by: Jiang, Liangze, et al.
Published: (2026)
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
by: Bethune, Louis, et al.
Published: (2025)
by: Bethune, Louis, et al.
Published: (2025)
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
by: Messmer, Bettina, et al.
Published: (2025)
by: Messmer, Bettina, et al.
Published: (2025)
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
by: Sam, Dylan, et al.
Published: (2025)
by: Sam, Dylan, et al.
Published: (2025)
Similar Items
-
Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
by: Yu, Puxuan, et al.
Published: (2024) -
Improving Pretraining Data Using Perplexity Correlations
by: Thrush, Tristan, et al.
Published: (2024) -
NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text
by: Kailas, Prajwal, et al.
Published: (2024) -
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
by: DatologyAI, et al.
Published: (2025) -
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
by: Feng, Steven, et al.
Published: (2024)