EvoLSTM
Sequence to Sequence LSTM based evolution simulator
Install / Use
/learn @DongjoonLim/EvoLSTMREADME
EvoLSTM: Sequence-to-Sequence LSTM-Based Evolution Simulator
Overview
EvoLSTM is a sophisticated deep learning framework that simulates DNA sequence evolution using Long Short-Term Memory (LSTM) networks. This sequence-to-sequence model captures complex mutational patterns and context dependencies to provide realistic evolutionary simulations. EvoLSTM requires an external Nvidia GPU for training and simulation due to the computational demands of the LSTM architecture.
For a comprehensive understanding of the methodology and results, please refer to our publication in Bioinformatics: EvoLSTM: Context-dependent models for sequence evolution using LSTM neural networks
Requirements
- NVIDIA GPU (required for training and simulation)
- Python 3.6+
- TensorFlow 2.0+
- Additional dependencies listed in
requirements.txt
Getting Started
1. Cloning the Repository
git clone https://github.com/DongjoonLim/EvoLSTM.git
cd EvoLSTM
2. Setting Up Directories
Create the necessary directories for storing data, preprocessed files, models, and simulation outputs:
mkdir data
mkdir prepData
mkdir models
mkdir simulation
3. Downloading Training Data
The training data consists of sequence alignment files and phylogenetic tree information:
-
Sequence Alignment MAF Files: Download from McGill University Repository and place them in the
datadirectory. -
Phylogenetic Tree Structure: Access the tree structure and species nomenclature from UCSC Genome Browser.
Note: Ancestral sequences are labeled with a prefix _ followed by the first characters of the descendant species. For example, the most recent common ancestor of hg38 (human) and pantro4 (chimpanzee) is labeled as _HP.
4. Installing Dependencies
pip install -r requirements.txt
Workflow
1. Preprocessing Sequences
Generate meta-nucleotide sequences for training:
python3 prep_insert2.py <chromosome> <ancName> <desName>
Parameters:
chromosome: The chromosome numberancName: The name of the ancestral sequencedesName: The name of the descendant sequence
Example: To preprocess human chromosome 2 from the most recent common ancestor of hg38 and pantro4 evolving to hg38:
python3 prep_insert2.py 2 _HP hg38
2. Training EvoLSTM
Train the EvoLSTM model with preprocessed sequences:
python3 insert2Train_general.py <ancName> <desName> <train_size> <seq_length>
Parameters:
ancName: The name of the ancestral sequencedesName: The name of the descendant sequencetrain_size: The length of the training sequence (recommended starting point: 100,000)seq_length: The context length of the sequence (recommended: 15)
Example:
python3 insert2Train_general.py _HP hg38 100000 15
3. Simulating Sequence Evolution
Simulate sequence evolution with the trained model:
python3 simulate.py <ancName> <desName> <sample_size> <gpu_index> <chromosome>
Parameters:
ancName: The name of the ancestral sequencedesName: The name of the descendant sequencesample_size: Desired input sequence lengthgpu_index: GPU card index (usenvidia-smito find available GPUs; set to 0 if only one GPU is available)chromosome: Chromosome number for the simulation
Example: To simulate the first 100,000 base pairs of the _HP sequence in chromosome 2:
python3 simulate.py _HP hg38 100000 0 2
4. Reading Simulation Output
The simulation output will be saved as simulated_{ancName}_{desName}_{chromosome}.npy. To read this file:
import numpy as np
simulation_data = np.load('simulated__HP_hg38_2.npy')
Citation
If you use EvoLSTM in your research, please cite:
@article{10.1093/bioinformatics/btaa440,
author = {Lim, Dongjoon and Kılıç, Ayşe and Liò, Pietro and Won, Kyoung-Jae},
title = "{EvoLSTM: Context-dependent models for sequence evolution using LSTM neural networks}",
journal = {Bioinformatics},
volume = {36},
number = {Supplement_1},
pages = {i353-i361},
year = {2020},
doi = {10.1093/bioinformatics/btaa440}
}
Contact
For questions, issues, or contributions, please open an issue on the GitHub repository.
