:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Nadas, Mihai Dan, Diosan, Laura, Tomescu, Andreea, Piscoran, Andrei
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.10410
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
by: Nadas, Mihai, et al.
Published: (2025)

Building Large-Scale English-Romanian Literary Translation Resources with Open Models
by: Nadas, Mihai, et al.
Published: (2025)

Synthetic Data Generation Using Large Language Models: Advances in Text and Code
by: Nadas, Mihai, et al.
Published: (2025)

Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
by: Nadas, Mihai, et al.
Published: (2025)

LLMic: Romanian Foundation Language Model
by: Bădoiu, Vlad-Andrei, et al.
Published: (2025)

"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models
by: Masala, Mihai, et al.
Published: (2026)

FuLG: 150B Romanian Corpus for Language Model Pretraining
by: Bădoiu, Vlad-Andrei, et al.
Published: (2024)

HistNERo: Historical Named Entity Recognition for the Romanian Language
by: Avram, Andrei-Marius, et al.
Published: (2024)

RELATE: A Modern Processing Platform for Romanian Language
by: Păiş, Vasile, et al.
Published: (2024)

"Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
by: Masala, Mihai, et al.
Published: (2024)

RoQLlama: A Lightweight Romanian Adapted Language Model
by: Dima, George-Andrei, et al.
Published: (2024)

Neural Grammatical Error Correction for Romanian
by: Cotet, Teodor-Mihai, et al.
Published: (2026)

Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction
by: Manzanarez, Gerardo Aleman, et al.
Published: (2025)

Value-Aware Numerical Representations for Transformer Language Models
by: Dutulescu, Andreea, et al.
Published: (2026)

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
by: Li, Haoran, et al.
Published: (2024)

Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models
by: Dima, George-Andrei, et al.
Published: (2025)

PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
by: Yin, Shangjian, et al.
Published: (2025)

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
by: Diaconu, Alexandra, et al.
Published: (2026)

MuSaRoNews: A Multidomain, Multimodal Satire Dataset from Romanian News Articles
by: Smădu, Răzvan-Alexandru, et al.
Published: (2025)

Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering
by: Negoita, Vlad, et al.
Published: (2025)

F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation
by: Chivereanu, Radu-Gabriel, et al.
Published: (2025)

LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch
by: Pfister, Jan, et al.
Published: (2024)

Improving Legal Judgement Prediction in Romanian with Long Text Encoders
by: Masala, Mihai, et al.
Published: (2024)

A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian
by: Rogoz, Ana-Cristina, et al.
Published: (2025)

RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation
by: Avram, Andrei-Marius, et al.
Published: (2024)

A Retrieval-Based Approach to Medical Procedure Matching in Romanian
by: Niculae, Andrei, et al.
Published: (2025)

OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
by: Masala, Mihai, et al.
Published: (2024)

SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
by: Tukenov, Saken
Published: (2026)

SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset
by: Smădu, Răzvan-Alexandru, et al.
Published: (2025)

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian
by: Avram, Andrei-Marius, et al.
Published: (2026)

MoRoVoc: A Large Dataset for Geographical Variation Identification of the Spoken Romanian Language
by: Avram, Andrei-Marius, et al.
Published: (2025)

Training Language Models with homotokens Leads to Delayed Overfitting
by: Cosma, Adrian, et al.
Published: (2026)

Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
by: Uğur, Özgür, et al.
Published: (2026)

An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
by: Manole, Alexandru, et al.
Published: (2026)

Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian
by: Niculae, Andrei, et al.
Published: (2025)

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
by: Feuer, Benjamin, et al.
Published: (2025)

Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch
by: Di, Donglin, et al.
Published: (2024)

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
by: Vacareanu, Robert, et al.
Published: (2024)

RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams
by: Man, Andrei Vlad, et al.
Published: (2025)

Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference
by: Akoju, Sushma Anand, et al.
Published: (2023)