Saved in:
| Main Authors: | Domhan, Tobias, Zhu, Dawei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.01761 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English
by: Zhang, Yue, et al.
Published: (2026)
by: Zhang, Yue, et al.
Published: (2026)
Less is more: Not all samples are effective for evaluation
by: Song, Wentang, et al.
Published: (2025)
by: Song, Wentang, et al.
Published: (2025)
MT-Ranker: Reference-free machine translation evaluation by inter-system ranking
by: Moosa, Ibraheem Muhammad, et al.
Published: (2024)
by: Moosa, Ibraheem Muhammad, et al.
Published: (2024)
IsoChronoMeter: A simple and effective isochronic translation evaluation metric
by: Rozanov, Nikolai, et al.
Published: (2024)
by: Rozanov, Nikolai, et al.
Published: (2024)
Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results
by: Peter, Jan-Thorsten, et al.
Published: (2025)
by: Peter, Jan-Thorsten, et al.
Published: (2025)
Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned
by: Jacobs, Cassandra L., et al.
Published: (2024)
by: Jacobs, Cassandra L., et al.
Published: (2024)
Better & Faster Large Language Models via Multi-token Prediction
by: Gloeckle, Fabian, et al.
Published: (2024)
by: Gloeckle, Fabian, et al.
Published: (2024)
LBPE: Long-token-first Tokenization to Improve Large Language Models
by: Lian, Haoran, et al.
Published: (2024)
by: Lian, Haoran, et al.
Published: (2024)
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
by: Thompson, Brian, et al.
Published: (2024)
by: Thompson, Brian, et al.
Published: (2024)
Recurrent babbling: evaluating the acquisition of grammar from limited input data
by: Pannitto, Ludovica, et al.
Published: (2020)
by: Pannitto, Ludovica, et al.
Published: (2020)
Contextual effects of sentiment deployment in human and machine translation
by: Comstock, Lindy, et al.
Published: (2025)
by: Comstock, Lindy, et al.
Published: (2025)
Source framing triggers systematic evaluation bias in Large Language Models
by: Germani, Federico, et al.
Published: (2025)
by: Germani, Federico, et al.
Published: (2025)
Re-evaluating Open-ended Evaluation of Large Language Models
by: Liu, Siqi, et al.
Published: (2025)
by: Liu, Siqi, et al.
Published: (2025)
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
by: Montalan, Jann Railey, et al.
Published: (2025)
by: Montalan, Jann Railey, et al.
Published: (2025)
Assessing "Implicit" Retrieval Robustness of Large Language Models
by: Shen, Xiaoyu, et al.
Published: (2024)
by: Shen, Xiaoyu, et al.
Published: (2024)
MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
by: Juraska, Juraj, et al.
Published: (2025)
by: Juraska, Juraj, et al.
Published: (2025)
Optimizing example selection for retrieval-augmented machine translation with translation memories
by: Bouthors, Maxime, et al.
Published: (2024)
by: Bouthors, Maxime, et al.
Published: (2024)
Mining experimental data from Materials Science literature with Large Language Models: an evaluation study
by: Foppiano, Luca, et al.
Published: (2024)
by: Foppiano, Luca, et al.
Published: (2024)
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models
by: Lin, Junyan, et al.
Published: (2024)
by: Lin, Junyan, et al.
Published: (2024)
DLLMQuant: Quantizing Diffusion-based Large Language Models
by: Xu, Chen, et al.
Published: (2025)
by: Xu, Chen, et al.
Published: (2025)
Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation
by: Sánchez-Cartagena, Víctor M., et al.
Published: (2024)
by: Sánchez-Cartagena, Víctor M., et al.
Published: (2024)
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model
by: Li, Jiangtong, et al.
Published: (2025)
by: Li, Jiangtong, et al.
Published: (2025)
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
by: In, Yeonjun, et al.
Published: (2025)
by: In, Yeonjun, et al.
Published: (2025)
COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain
by: Panagoulias, Dimitrios P., et al.
Published: (2024)
by: Panagoulias, Dimitrios P., et al.
Published: (2024)
Escaping the sentence-level paradigm in machine translation
by: Post, Matt, et al.
Published: (2023)
by: Post, Matt, et al.
Published: (2023)
Multilingual Language Model Pretraining using Machine-translated Data
by: Wang, Jiayi, et al.
Published: (2025)
by: Wang, Jiayi, et al.
Published: (2025)
A Preference-driven Paradigm for Enhanced Translation with Large Language Models
by: Zhu, Dawei, et al.
Published: (2024)
by: Zhu, Dawei, et al.
Published: (2024)
An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses
by: Chandra, Rohitash, et al.
Published: (2025)
by: Chandra, Rohitash, et al.
Published: (2025)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
by: Zhu, Junda, et al.
Published: (2025)
by: Zhu, Junda, et al.
Published: (2025)
Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?
by: Zhu, Dawei, et al.
Published: (2024)
by: Zhu, Dawei, et al.
Published: (2024)
Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models
by: Liu, Xiyu, et al.
Published: (2026)
by: Liu, Xiyu, et al.
Published: (2026)
Byte-token Enhanced Language Models for Temporal Point Processes Analysis
by: Kong, Quyu, et al.
Published: (2025)
by: Kong, Quyu, et al.
Published: (2025)
Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation
by: Shayegh, Behzad, et al.
Published: (2025)
by: Shayegh, Behzad, et al.
Published: (2025)
A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science
by: Yang, Zonglin, et al.
Published: (2026)
by: Yang, Zonglin, et al.
Published: (2026)
Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation
by: Wysocka, Magdalena, et al.
Published: (2023)
by: Wysocka, Magdalena, et al.
Published: (2023)
The first open machine translation system for the Chechen language
by: Umishov, Abu-Viskhan A., et al.
Published: (2025)
by: Umishov, Abu-Viskhan A., et al.
Published: (2025)
More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models
by: Chen, Evan, et al.
Published: (2025)
by: Chen, Evan, et al.
Published: (2025)
Investigating the translation capabilities of Large Language Models trained on parallel data only
by: Gilabert, Javier García, et al.
Published: (2024)
by: Gilabert, Javier García, et al.
Published: (2024)
A Report on the llms evaluating the high school questions
by: Jiawei, Zhu, et al.
Published: (2025)
by: Jiawei, Zhu, et al.
Published: (2025)
An evaluation of DeepSeek Models in Biomedical Natural Language Processing
by: Zhan, Zaifu, et al.
Published: (2025)
by: Zhan, Zaifu, et al.
Published: (2025)
Similar Items
-
Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English
by: Zhang, Yue, et al.
Published: (2026) -
Less is more: Not all samples are effective for evaluation
by: Song, Wentang, et al.
Published: (2025) -
MT-Ranker: Reference-free machine translation evaluation by inter-system ranking
by: Moosa, Ibraheem Muhammad, et al.
Published: (2024) -
IsoChronoMeter: A simple and effective isochronic translation evaluation metric
by: Rozanov, Nikolai, et al.
Published: (2024) -
Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results
by: Peter, Jan-Thorsten, et al.
Published: (2025)