Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Xiying, Wen, Zhoufutu, Chen, Zhixuan, Ding, Jingzhe, Jiao, Jianpeng, Li, Shuai, Li, Xi, Liang, Danni, Long, Shengda, Liu, Qianqian, Wu, Xianbo, Gao, Hongwan, Gao, Xiang, Hu, Liang, Liu, Jiashuo, Liu, Mengyun, Shi, Weiran, Yang, Chenghao, Yang, Qianyu, Zhang, Xuanliang, Zhang, Ge, Huang, Wenhao, Tang, Yuwen
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.10984
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.

Similar Items