Important

We ask you to cite the main publication related to this software whenever you use any part of this software in any scientific publication.

You may use the following .bibtex to cite the main publication of this software:

@article{10.1093/bib/bbaa185,
    author = {da Cruz, Murilo Horacio Pereira and Domingues, Douglas Silva and Saito, Priscila Tiemi Maeda and Paschoal, Alexandre Rossi and Bugatti, Pedro Henrique},
    title = "{TERL: classification of transposable elements by convolutional neural networks}",
    journal = {Briefings in Bioinformatics},
    year = {2020},
    month = {09},
    issn = {1477-4054},
    doi = {10.1093/bib/bbaa185},
    url = {https://doi.org/10.1093/bib/bbaa185},
    note = {bbaa185},
    eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbaa185/33724804/bbaa185.pdf},
}

Topics

Here are all the topics:

Instalation
TERL Usage
Pre-trained models
Databases used in this work

Instalation

To install TERL you need to clone the repository into your local machine. First you need to have git installed in your local machine. You can follow these steps to install git. Once you have git installed, you can clone this repository with the following command:

git clone https://github.com/muriloHoracio/TERL

After the clone, you have a directory named TERL which contains all codes to train TERL and classify sequences.

Since TERL is made on Python 3.6 and use some libraries, we recomend the use of virtual environmnets to run it. Before installing virtualenv, make sure you have pip3 installed. To install pip3 run the followoing command:

sudo apt-get install python3-pip

Update pip3 by running the following command:

sudo -H pip3 install --upgrade pip

Also, you need to install setup-tools by running the following command:

sudo apt-get install python3-setuptools

To create a virtual environment you need to have virtualenv installed, in order to do that you can run the following command:

sudo apt-get install virtualenv

To create the virtual environment, you need to execute the following command:

virtualenv -p python .venv

Once the virtual environment is created, you need to active the environment in order to install the dependencies of TERL. To do this, run the following command:

. .venv/bin/activate

If everything worked well, you will notice that (.venv) will appear before the user name in the command line.

Now you must install the dependencies needed to run TERL.

If you are using GPU, you must run the following command:

pip3 install -r requirements-gpu.txt

Otherwise, you must run the following command:

pip3 install -r requirements.txt

If everything went well, you have just installed TERL and are ready to train your on model on your sequences or classify some sequences based on a previously trained model.

TERL

Transposable Elements Representation Learner

The TERL can be used to classify any genomic sequence. This framework provides tools to train and test models. Users can opt to deploy the trained network model. There are vast parameters that can be used to define the network architecture and to set the model's parameters. All the set of parameters are described here with examples of usage.

Dataset Organization

In order to use this framework to train and test CNNs models for genomic data, users need to organize the structure of dataset's files The files should be stored in the following way:

Root
└─── Train
|    └─── Class1.fa
|    └─── Class2.fa
|    └─── Class3.fa
|    └─── Class4.fa
└─── Test
     └─── Class1.fa
     └─── Class2.fa
     └─── Class3.fa
     └─── Class4.fa

The filenames on Train and Test folders must be identicals and reflect the class that each file represents.

Train Usage (terl_train.py)

This is an example how to train a model with TERL. The files are stored over Train and Test folders, which are stored on Dataset folder, sotored in the TERL folder.

Dataset
└─── Train
|    └─── LTR.fa
|    └─── LINE.fa
|    └─── SINE.fa
└─── Test
     └─── LTR.fa
     └─── LINE.fa
     └─── SINE.fa

This example model have the following architecture:

       Architecture: conv    pool    conv    pool    fc      fc      
          Functions: relu    avg     relu    avg     relu    relu    
             Widths: 30      20      30      20      1500    500     
            Strides: 1       20      1       20      
       Feature maps: 64      -       32      -       -       -

Example:

python3 terl_train.py -r Dataset -l 6 -a conv pool conv pool fc fc -f relu avg relu avg relu relu -w 30 20 30 20 1500 500 -s 1 20 1 20 -fm 64 32 -sg -sr -sm

Train Parameters

This section describes the parameters with its possible values and examples of usage.

-r, --root

Required parameter that defines the Root folder, where Train and Test folders containing sample sequences files are located.

It can be the relative or absolute path to Root

Example:

python3 terl_train.py -r ~/TERL/Datasets/DS1

-l, --layers

Parameter that defines the number of layers that will be created. It checks if the number of defined layers in the model is correct. The layers can be defined without this parameter, but it is a good practice to use it to guarantee that the model have the correct number of layers.

Default value is 8, which is the number of layers of the default model.

Example:

python3 terl_train.py -l 6

-a, --architecture

Parameter that defines the architecture of the model. This defines the types of the layers and its order in the model.

Input and classification layer should not be included.

The supported values are:

conv (Convolution layer)
pool (Pooling layer)
fc (Fully connected layer)

Default value is: conv pool conv pool conv pool fc fc

Example:

python3 terl_train.py -a conv pool conv pool fc fc

-f, --functions

Parameter that defines the functions of each layer. The functions should be entered according to the --architecture parameter, i.e. the first option should be the function of the first layer defined in --architecture, the second option the function of the second layer and so on...

The available activation functions for convolution and fully connected layers are:

relu
tanh
sigmoid
leaky_relu
elu

The available funcions for pooling layers are:

Default value is: -f relu avg relu avg relu avg relu relu

Example:

python3 terl_train.py -f relu avg relu avg relu relu

-w, --widths

Parameter that defines the widths of the filters (convolution and pooling) and the number of neurons for fully connected layers. The values should be entered according to the --architecture parameter, i.e. the first value should be the width of the first layer filter, the second value should be the width of the second layer's filter, and so on...

Default parameter is: -w 30 20 30 20 30 10 1500 500

Example:

python3 terl_train.py -w 30 20 30 20 1500 500

-s, --strides

Parameter that defines the strides of the layers of the model. The values should be entered according to the parameter --architecture, i.e. the first value should be the stride of the first layer, the second value the stride of the second layer, and so on...

Default value is: -s 1 20 1 20 1 10

Example:

python3 terl_train.py -s 1 20 1 20

-fm, --feature-maps

Parameter that defines the amount of feature maps of each convolution layer. It should be entered n values for a network with n convolution layers.

Default value is: -fm 64 32 16

Example:

python3 terl_train.py -fm 64 32

-o, --optimizer

Parameter that defines the optimizer that will be used to train the model and optimize the values of the learnable parameters (i.e. weights) of the model.

The available optimizers are:

adam
adadelta
adagrad
ftrl
rmsprop
grad_desc

Default value is: -o adam

Example:

python3 terl_train.py -o adagrad

-lr, --learning-rate

Parameter that defines the learning rate to be used by the optimizer.

Default value is: -lr 0.001

Example:

python3 terl_train.py -lr 0.001

-trb, --train-batch-size

Parameter that defines the train batch size. The train batch is the amount of samples that will be presented to the network for each step during training.

Default value is: 32

Example:

python3 terl_train.py -trb 64

-tsb, --test-batch-size

Parameter that defines the test batch size. The test batch is the amount of samples that will be presented to the network for each step during testing.

Default value is: 32

Example:

python3 terl_train.py -tsb 64

-e, --epochs

Parameter that defines the number of epochs that training will be executed. In each epoch all training samples are presented to the network during training.

Default value is: 30

Example:

python3 terl_train.py -e 100

-d, --dropout

Parameter that defines the dropout rate that is used to drop neurons in each convolution and fully connected layer in the model.

Default value is: 0.5

Example:

python3 terl_train.py -d 0.3

-sg, --save-graphs

Parameter that sets confusion matrix and learning curve graphs to be saved. The title of the graphs are defined in the --confusion-matrix-title and --learning-curve-title parameters.

By default, graphs are not saved, meaning you need to set it if you really want to save them.

Example:

python3 terl_train.py -sg

-cmt, --confusion-matrix-title

Parameter that define

TERL

Install / Use

README

Important

Topics

Instalation

TERL

Dataset Organization

Train Usage (terl_train.py)

Train Parameters

-r, --root

-l, --layers

-a, --architecture

-f, --functions

-w, --widths

-s, --strides

-fm, --feature-maps

-o, --optimizer

-lr, --learning-rate

-trb, --train-batch-size

-tsb, --test-batch-size

-e, --epochs

-d, --dropout

-sg, --save-graphs

-cmt, --confusion-matrix-title