Saved in:
Bibliographic Details
Main Authors: Wu, Jie, Li, Haoling, Zhang, Xin, Guo, Jiani, Luo, Jane, Liu, Steven, Huang, Yangyu, Chu, Ruihang, Li, Scarlett, Yang, Yujiu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.06953
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908801575157760
author Wu, Jie
Li, Haoling
Zhang, Xin
Guo, Jiani
Luo, Jane
Liu, Steven
Huang, Yangyu
Chu, Ruihang
Li, Scarlett
Yang, Yujiu
author_facet Wu, Jie
Li, Haoling
Zhang, Xin
Guo, Jiani
Luo, Jane
Liu, Steven
Huang, Yangyu
Chu, Ruihang
Li, Scarlett
Yang, Yujiu
contents Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.
format Preprint
id arxiv_https___arxiv_org_abs_2601_06953
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
Wu, Jie
Li, Haoling
Zhang, Xin
Guo, Jiani
Luo, Jane
Liu, Steven
Huang, Yangyu
Chu, Ruihang
Li, Scarlett
Yang, Yujiu
Computation and Language
Machine Learning
Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.
title X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2601.06953