Saved in:
| Main Authors: | , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.19191 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915507191414784 |
|---|---|
| author | He, Haonan Ren, Yuchen Tang, Yining Xu, Ziyang Li, Junxian Yang, Minghao Zhang, Di Yuan, Dong Chen, Tao Zhang, Shufei Li, Yuqiang Dong, Nanqing Ouyang, Wanli Zhou, Dongzhan Ye, Peng |
| author_facet | He, Haonan Ren, Yuchen Tang, Yining Xu, Ziyang Li, Junxian Yang, Minghao Zhang, Di Yuan, Dong Chen, Tao Zhang, Shufei Li, Yuqiang Dong, Nanqing Ouyang, Wanli Zhou, Dongzhan Ye, Peng |
| contents | Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2412_19191 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models He, Haonan Ren, Yuchen Tang, Yining Xu, Ziyang Li, Junxian Yang, Minghao Zhang, Di Yuan, Dong Chen, Tao Zhang, Shufei Li, Yuqiang Dong, Nanqing Ouyang, Wanli Zhou, Dongzhan Ye, Peng Biomolecules Artificial Intelligence Machine Learning Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions. |
| title | Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models |
| topic | Biomolecules Artificial Intelligence Machine Learning |
| url | https://arxiv.org/abs/2412.19191 |