Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.21964 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918314198958080 |
|---|---|
| author | Yang, Qianwei Xu, Dong Yang, Zhangfan Yuan, Sisi Zhu, Zexuan Li, Jianqiang Ji, Junkai |
| author_facet | Yang, Qianwei Xu, Dong Yang, Zhangfan Yuan, Sisi Zhu, Zexuan Li, Jianqiang Ji, Junkai |
| contents | Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_21964 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation Yang, Qianwei Xu, Dong Yang, Zhangfan Yuan, Sisi Zhu, Zexuan Li, Jianqiang Ji, Junkai Machine Learning Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol |
| title | From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2601.21964 |