Saved in:
Bibliographic Details
Main Authors: He, Haonan, Ren, Yuchen, Tang, Yining, Xu, Ziyang, Li, Junxian, Yang, Minghao, Zhang, Di, Yuan, Dong, Chen, Tao, Zhang, Shufei, Li, Yuqiang, Dong, Nanqing, Ouyang, Wanli, Zhou, Dongzhan, Ye, Peng
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.19191
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.