Enregistré dans:
| Auteurs principaux: | , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Publié: |
2025
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2503.00152 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
| _version_ | 1866913713005527040 |
|---|---|
| author | Yan, Keqiang Li, Xiner Ling, Hongyi Ashen, Kenna Edwards, Carl Arróyave, Raymundo Zitnik, Marinka Ji, Heng Qian, Xiaofeng Qian, Xiaoning Ji, Shuiwang |
| author_facet | Yan, Keqiang Li, Xiner Ling, Hongyi Ashen, Kenna Edwards, Carl Arróyave, Raymundo Zitnik, Marinka Ji, Heng Qian, Xiaofeng Qian, Xiaoning Ji, Shuiwang |
| contents | We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_00152 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation Yan, Keqiang Li, Xiner Ling, Hongyi Ashen, Kenna Edwards, Carl Arróyave, Raymundo Zitnik, Marinka Ji, Heng Qian, Xiaofeng Qian, Xiaoning Ji, Shuiwang Machine Learning Materials Science We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods. |
| title | Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation |
| topic | Machine Learning Materials Science |
| url | https://arxiv.org/abs/2503.00152 |