Saved in:
| Main Authors: | Bommarito II, Michael J, Bommarito, Jillian, Katz, Daniel Martin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.07854 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
by: Bommarito, Michael J, et al.
Published: (2025)
by: Bommarito, Michael J, et al.
Published: (2025)
Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
by: Bommarito, Michael J, et al.
Published: (2025)
by: Bommarito, Michael J, et al.
Published: (2025)
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
by: Bommarito II, Michael J.
Published: (2025)
by: Bommarito II, Michael J.
Published: (2025)
Natural Language Processing in the Legal Domain
by: Hartung, Dirk, et al.
Published: (2023)
by: Hartung, Dirk, et al.
Published: (2023)
Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research
by: Bommarito II, Michael J.
Published: (2026)
by: Bommarito II, Michael J.
Published: (2026)
Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection
by: Bommarito II, Michael J.
Published: (2025)
by: Bommarito II, Michael J.
Published: (2025)
Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis
by: Bommarito II, Michael J.
Published: (2025)
by: Bommarito II, Michael J.
Published: (2025)
Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation
by: Kazemi, Sharif, et al.
Published: (2024)
by: Kazemi, Sharif, et al.
Published: (2024)
Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?
by: Xu, Naen, et al.
Published: (2025)
by: Xu, Naen, et al.
Published: (2025)
Large Language Models as Planning Domain Generators
by: Oswald, James, et al.
Published: (2024)
by: Oswald, James, et al.
Published: (2024)
Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare
by: Barai, P., et al.
Published: (2024)
by: Barai, P., et al.
Published: (2024)
Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models
by: Dhamani, Dhruv, et al.
Published: (2025)
by: Dhamani, Dhruv, et al.
Published: (2025)
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
by: Katz, Shahar, et al.
Published: (2024)
by: Katz, Shahar, et al.
Published: (2024)
SoK: Large Language Model Copyright Auditing via Fingerprinting
by: Shao, Shuo, et al.
Published: (2025)
by: Shao, Shuo, et al.
Published: (2025)
Data Management For Training Large Language Models: A Survey
by: Wang, Zige, et al.
Published: (2023)
by: Wang, Zige, et al.
Published: (2023)
Measuring Copyright Risks of Large Language Model via Partial Information Probing
by: Zhao, Weijie, et al.
Published: (2024)
by: Zhao, Weijie, et al.
Published: (2024)
Multilingual Training and Evaluation Resources for Vision-Language Models
by: Baiamonte, Daniela, et al.
Published: (2026)
by: Baiamonte, Daniela, et al.
Published: (2026)
SUV: Scalable Large Language Model Copyright Compliance with Regularized Selective Unlearning
by: Xu, Tianyang, et al.
Published: (2025)
by: Xu, Tianyang, et al.
Published: (2025)
Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention
by: Lv, Xingtai, et al.
Published: (2024)
by: Lv, Xingtai, et al.
Published: (2024)
Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
by: Liu, Xinyue, et al.
Published: (2026)
by: Liu, Xinyue, et al.
Published: (2026)
Regurgitative Training: The Value of Real Data in Training Large Language Models
by: Zhang, Jinghui, et al.
Published: (2024)
by: Zhang, Jinghui, et al.
Published: (2024)
Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment
by: Yeh, Samuel, et al.
Published: (2025)
by: Yeh, Samuel, et al.
Published: (2025)
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
by: Montalan, Jann Railey, et al.
Published: (2025)
by: Montalan, Jann Railey, et al.
Published: (2025)
Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data
by: Cook, John, et al.
Published: (2026)
by: Cook, John, et al.
Published: (2026)
Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models
by: Zhang, Shuo, et al.
Published: (2025)
by: Zhang, Shuo, et al.
Published: (2025)
Extracting Training Dialogue Data from Large Language Model based Task Bots
by: Zhang, Shuo, et al.
Published: (2026)
by: Zhang, Shuo, et al.
Published: (2026)
Continual-learning for Modelling Low-Resource Languages from Large Language Models
by: K, Santosh Srinath, et al.
Published: (2026)
by: K, Santosh Srinath, et al.
Published: (2026)
A Study on Hidden Layer Distillation for Large Language Model Pre-Training
by: Guigon, Maxime, et al.
Published: (2026)
by: Guigon, Maxime, et al.
Published: (2026)
Evolving Subnetwork Training for Large Language Models
by: Li, Hanqi, et al.
Published: (2024)
by: Li, Hanqi, et al.
Published: (2024)
On the Effectiveness of Incremental Training of Large Language Models
by: Li, Miles Q., et al.
Published: (2024)
by: Li, Miles Q., et al.
Published: (2024)
Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?
by: Xu, Qipan, et al.
Published: (2025)
by: Xu, Qipan, et al.
Published: (2025)
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models
by: Ma, Mingyu Derek, et al.
Published: (2023)
by: Ma, Mingyu Derek, et al.
Published: (2023)
"According to ...": Prompting Language Models Improves Quoting from Pre-Training Data
by: Weller, Orion, et al.
Published: (2023)
by: Weller, Orion, et al.
Published: (2023)
On the Semantics of Large Language Models
by: Schuele, Martin
Published: (2025)
by: Schuele, Martin
Published: (2025)
Training Optimal Large Diffusion Language Models
by: Ni, Jinjie, et al.
Published: (2025)
by: Ni, Jinjie, et al.
Published: (2025)
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research
by: Zhong, Tianyang, et al.
Published: (2024)
by: Zhong, Tianyang, et al.
Published: (2024)
Training-Free Activation Sparsity in Large Language Models
by: Liu, James, et al.
Published: (2024)
by: Liu, James, et al.
Published: (2024)
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
by: Amin, Kareem, et al.
Published: (2025)
by: Amin, Kareem, et al.
Published: (2025)
Unlearning Traces the Influential Training Data of Language Models
by: Isonuma, Masaru, et al.
Published: (2024)
by: Isonuma, Masaru, et al.
Published: (2024)
Balanced Data Sampling for Language Model Training with Clustering
by: Shao, Yunfan, et al.
Published: (2024)
by: Shao, Yunfan, et al.
Published: (2024)
Similar Items
-
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
by: Bommarito, Michael J, et al.
Published: (2025) -
Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
by: Bommarito, Michael J, et al.
Published: (2025) -
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
by: Bommarito II, Michael J.
Published: (2025) -
Natural Language Processing in the Legal Domain
by: Hartung, Dirk, et al.
Published: (2023) -
Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research
by: Bommarito II, Michael J.
Published: (2026)