Saved in:
Bibliographic Details
Main Authors: Yu, Tong, Cheng, Lei, Khalitov, Ruslan, Olsson, Erland Brandser, Yang, Zhirong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.08538
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917665485881344
author Yu, Tong
Cheng, Lei
Khalitov, Ruslan
Olsson, Erland Brandser
Yang, Zhirong
author_facet Yu, Tong
Cheng, Lei
Khalitov, Ruslan
Olsson, Erland Brandser
Yang, Zhirong
contents Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a `student' and a `teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
format Preprint
id arxiv_https___arxiv_org_abs_2405_08538
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Self-Distillation Improves DNA Sequence Inference
Yu, Tong
Cheng, Lei
Khalitov, Ruslan
Olsson, Erland Brandser
Yang, Zhirong
Machine Learning
Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a `student' and a `teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
title Self-Distillation Improves DNA Sequence Inference
topic Machine Learning
url https://arxiv.org/abs/2405.08538