Enregistré dans:
Détails bibliographiques
Auteurs principaux: Yan, Keqiang, Li, Xiner, Ling, Hongyi, Ashen, Kenna, Edwards, Carl, Arróyave, Raymundo, Zitnik, Marinka, Ji, Heng, Qian, Xiaofeng, Qian, Xiaoning, Ji, Shuiwang
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2503.00152
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866913713005527040
author Yan, Keqiang
Li, Xiner
Ling, Hongyi
Ashen, Kenna
Edwards, Carl
Arróyave, Raymundo
Zitnik, Marinka
Ji, Heng
Qian, Xiaofeng
Qian, Xiaoning
Ji, Shuiwang
author_facet Yan, Keqiang
Li, Xiner
Ling, Hongyi
Ashen, Kenna
Edwards, Carl
Arróyave, Raymundo
Zitnik, Marinka
Ji, Heng
Qian, Xiaofeng
Qian, Xiaoning
Ji, Shuiwang
contents We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
format Preprint
id arxiv_https___arxiv_org_abs_2503_00152
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
Yan, Keqiang
Li, Xiner
Ling, Hongyi
Ashen, Kenna
Edwards, Carl
Arróyave, Raymundo
Zitnik, Marinka
Ji, Heng
Qian, Xiaofeng
Qian, Xiaoning
Ji, Shuiwang
Machine Learning
Materials Science
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
title Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
topic Machine Learning
Materials Science
url https://arxiv.org/abs/2503.00152