Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Jie, Li, Haoling, Zhang, Xin, Guo, Jiani, Luo, Jane, Liu, Steven, Huang, Yangyu, Chu, Ruihang, Li, Scarlett, Yang, Yujiu
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2601.06953
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908801575157760
author	Wu, Jie Li, Haoling Zhang, Xin Guo, Jiani Luo, Jane Liu, Steven Huang, Yangyu Chu, Ruihang Li, Scarlett Yang, Yujiu
author_facet	Wu, Jie Li, Haoling Zhang, Xin Guo, Jiani Luo, Jane Liu, Steven Huang, Yangyu Chu, Ruihang Li, Scarlett Yang, Yujiu
contents	Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_06953
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests Wu, Jie Li, Haoling Zhang, Xin Guo, Jiani Luo, Jane Liu, Steven Huang, Yangyu Chu, Ruihang Li, Scarlett Yang, Yujiu Computation and Language Machine Learning Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.
title	X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2601.06953

Similar Items