Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.12980 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911679347949568 |
|---|---|
| author | Liu, Tianbo Lu, Chixiang Hao, Jing Zhang, Hengyu Wang, Lifei Jiang, Haibo Qi, Xiaojuan |
| author_facet | Liu, Tianbo Lu, Chixiang Hao, Jing Zhang, Hengyu Wang, Lifei Jiang, Haibo Qi, Xiaojuan |
| contents | Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures.
We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_12980 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions Liu, Tianbo Lu, Chixiang Hao, Jing Zhang, Hengyu Wang, Lifei Jiang, Haibo Qi, Xiaojuan Machine Learning Artificial Intelligence Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions. |
| title | CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2605.12980 |