Enregistré dans:
Détails bibliographiques
Auteurs principaux: Zhang, Mingxu, Shen, Dazhong, Sun, Ying
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2512.03080
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866914178227240960
author Zhang, Mingxu
Shen, Dazhong
Sun, Ying
author_facet Zhang, Mingxu
Shen, Dazhong
Sun, Ying
contents Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2512_03080
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations
Zhang, Mingxu
Shen, Dazhong
Sun, Ying
Chemical Physics
Artificial Intelligence
Computational Engineering, Finance, and Science
Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.
title AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations
topic Chemical Physics
Artificial Intelligence
Computational Engineering, Finance, and Science
url https://arxiv.org/abs/2512.03080