Saved in:
Bibliographic Details
Main Authors: Liu, Tianbo, Lu, Chixiang, Hao, Jing, Zhang, Hengyu, Wang, Lifei, Jiang, Haibo, Qi, Xiaojuan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.12980
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911679347949568
author Liu, Tianbo
Lu, Chixiang
Hao, Jing
Zhang, Hengyu
Wang, Lifei
Jiang, Haibo
Qi, Xiaojuan
author_facet Liu, Tianbo
Lu, Chixiang
Hao, Jing
Zhang, Hengyu
Wang, Lifei
Jiang, Haibo
Qi, Xiaojuan
contents Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.
format Preprint
id arxiv_https___arxiv_org_abs_2605_12980
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions
Liu, Tianbo
Lu, Chixiang
Hao, Jing
Zhang, Hengyu
Wang, Lifei
Jiang, Haibo
Qi, Xiaojuan
Machine Learning
Artificial Intelligence
Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.
title CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2605.12980