Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Uğur, Özgür, Göksu, Mahmut, Çimen, Mahmut, Yılmaz, Musa, Şavirdi, Esra, Demir, Alp Talha, Güllüce, Rumeysa, Çetin, İclal, Sağbaş, Ömer Can
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.16018
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909997932216320
author	Uğur, Özgür Göksu, Mahmut Çimen, Mahmut Yılmaz, Musa Şavirdi, Esra Demir, Alp Talha Güllüce, Rumeysa Çetin, İclal Sağbaş, Ömer Can
author_facet	Uğur, Özgür Göksu, Mahmut Çimen, Mahmut Yılmaz, Musa Şavirdi, Esra Demir, Alp Talha Güllüce, Rumeysa Çetin, İclal Sağbaş, Ömer Can
contents	This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_16018
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain Uğur, Özgür Göksu, Mahmut Çimen, Mahmut Yılmaz, Musa Şavirdi, Esra Demir, Alp Talha Güllüce, Rumeysa Çetin, İclal Sağbaş, Ömer Can Computation and Language This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
title	Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
topic	Computation and Language
url	https://arxiv.org/abs/2601.16018

Similar Items