Saved in:
| Main Authors: | Nadas, Mihai Dan, Diosan, Laura, Tomescu, Andreea, Piscoran, Andrei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.10410 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
by: Nadas, Mihai, et al.
Published: (2025)
by: Nadas, Mihai, et al.
Published: (2025)
Building Large-Scale English-Romanian Literary Translation Resources with Open Models
by: Nadas, Mihai, et al.
Published: (2025)
by: Nadas, Mihai, et al.
Published: (2025)
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
by: Nadas, Mihai, et al.
Published: (2025)
by: Nadas, Mihai, et al.
Published: (2025)
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
by: Nadas, Mihai, et al.
Published: (2025)
by: Nadas, Mihai, et al.
Published: (2025)
LLMic: Romanian Foundation Language Model
by: Bădoiu, Vlad-Andrei, et al.
Published: (2025)
by: Bădoiu, Vlad-Andrei, et al.
Published: (2025)
"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models
by: Masala, Mihai, et al.
Published: (2026)
by: Masala, Mihai, et al.
Published: (2026)
FuLG: 150B Romanian Corpus for Language Model Pretraining
by: Bădoiu, Vlad-Andrei, et al.
Published: (2024)
by: Bădoiu, Vlad-Andrei, et al.
Published: (2024)
HistNERo: Historical Named Entity Recognition for the Romanian Language
by: Avram, Andrei-Marius, et al.
Published: (2024)
by: Avram, Andrei-Marius, et al.
Published: (2024)
RELATE: A Modern Processing Platform for Romanian Language
by: Păiş, Vasile, et al.
Published: (2024)
by: Păiş, Vasile, et al.
Published: (2024)
"Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
by: Masala, Mihai, et al.
Published: (2024)
by: Masala, Mihai, et al.
Published: (2024)
RoQLlama: A Lightweight Romanian Adapted Language Model
by: Dima, George-Andrei, et al.
Published: (2024)
by: Dima, George-Andrei, et al.
Published: (2024)
Neural Grammatical Error Correction for Romanian
by: Cotet, Teodor-Mihai, et al.
Published: (2026)
by: Cotet, Teodor-Mihai, et al.
Published: (2026)
Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction
by: Manzanarez, Gerardo Aleman, et al.
Published: (2025)
by: Manzanarez, Gerardo Aleman, et al.
Published: (2025)
Value-Aware Numerical Representations for Transformer Language Models
by: Dutulescu, Andreea, et al.
Published: (2026)
by: Dutulescu, Andreea, et al.
Published: (2026)
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
by: Li, Haoran, et al.
Published: (2024)
by: Li, Haoran, et al.
Published: (2024)
Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models
by: Dima, George-Andrei, et al.
Published: (2025)
by: Dima, George-Andrei, et al.
Published: (2025)
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
by: Yin, Shangjian, et al.
Published: (2025)
by: Yin, Shangjian, et al.
Published: (2025)
RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
by: Diaconu, Alexandra, et al.
Published: (2026)
by: Diaconu, Alexandra, et al.
Published: (2026)
MuSaRoNews: A Multidomain, Multimodal Satire Dataset from Romanian News Articles
by: Smădu, Răzvan-Alexandru, et al.
Published: (2025)
by: Smădu, Răzvan-Alexandru, et al.
Published: (2025)
Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering
by: Negoita, Vlad, et al.
Published: (2025)
by: Negoita, Vlad, et al.
Published: (2025)
F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation
by: Chivereanu, Radu-Gabriel, et al.
Published: (2025)
by: Chivereanu, Radu-Gabriel, et al.
Published: (2025)
LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch
by: Pfister, Jan, et al.
Published: (2024)
by: Pfister, Jan, et al.
Published: (2024)
Improving Legal Judgement Prediction in Romanian with Long Text Encoders
by: Masala, Mihai, et al.
Published: (2024)
by: Masala, Mihai, et al.
Published: (2024)
A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian
by: Rogoz, Ana-Cristina, et al.
Published: (2025)
by: Rogoz, Ana-Cristina, et al.
Published: (2025)
RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation
by: Avram, Andrei-Marius, et al.
Published: (2024)
by: Avram, Andrei-Marius, et al.
Published: (2024)
A Retrieval-Based Approach to Medical Procedure Matching in Romanian
by: Niculae, Andrei, et al.
Published: (2025)
by: Niculae, Andrei, et al.
Published: (2025)
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
by: Masala, Mihai, et al.
Published: (2024)
by: Masala, Mihai, et al.
Published: (2024)
SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
by: Tukenov, Saken
Published: (2026)
by: Tukenov, Saken
Published: (2026)
SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset
by: Smădu, Răzvan-Alexandru, et al.
Published: (2025)
by: Smădu, Răzvan-Alexandru, et al.
Published: (2025)
RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian
by: Avram, Andrei-Marius, et al.
Published: (2026)
by: Avram, Andrei-Marius, et al.
Published: (2026)
MoRoVoc: A Large Dataset for Geographical Variation Identification of the Spoken Romanian Language
by: Avram, Andrei-Marius, et al.
Published: (2025)
by: Avram, Andrei-Marius, et al.
Published: (2025)
Training Language Models with homotokens Leads to Delayed Overfitting
by: Cosma, Adrian, et al.
Published: (2026)
by: Cosma, Adrian, et al.
Published: (2026)
Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
by: Uğur, Özgür, et al.
Published: (2026)
by: Uğur, Özgür, et al.
Published: (2026)
An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
by: Manole, Alexandru, et al.
Published: (2026)
by: Manole, Alexandru, et al.
Published: (2026)
Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian
by: Niculae, Andrei, et al.
Published: (2025)
by: Niculae, Andrei, et al.
Published: (2025)
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
by: Feuer, Benjamin, et al.
Published: (2025)
by: Feuer, Benjamin, et al.
Published: (2025)
Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch
by: Di, Donglin, et al.
Published: (2024)
by: Di, Donglin, et al.
Published: (2024)
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
by: Vacareanu, Robert, et al.
Published: (2024)
by: Vacareanu, Robert, et al.
Published: (2024)
RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams
by: Man, Andrei Vlad, et al.
Published: (2025)
by: Man, Andrei Vlad, et al.
Published: (2025)
Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference
by: Akoju, Sushma Anand, et al.
Published: (2023)
by: Akoju, Sushma Anand, et al.
Published: (2023)
Similar Items
-
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
by: Nadas, Mihai, et al.
Published: (2025) -
Building Large-Scale English-Romanian Literary Translation Resources with Open Models
by: Nadas, Mihai, et al.
Published: (2025) -
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
by: Nadas, Mihai, et al.
Published: (2025) -
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
by: Nadas, Mihai, et al.
Published: (2025) -
LLMic: Romanian Foundation Language Model
by: Bădoiu, Vlad-Andrei, et al.
Published: (2025)