<h2>Evodictor and User Manual</h2>

Overview of Evodictor

Evodictor is a software package for learning patterns and predicting the future of evolution by gain/losses of given binary traits (e.g., gene presence/absence). Evodictor takes a phylogenetic tree and presence/absence profiles of every trait for all the extant and the ancestral species in the tree as input, then predicts the gain/loss probability of a target trait from a given trait repertoire of a species (e.g., presence/absence of every gene in the genome of the species). To predict trait gain/loss, Evodictor learns what traits tend to be present/absent prior to gain/losses of the target trait from past gain/loss evolution across diverse species. Evodictor was established in Konno and Iwasaki, Science Advances, 2023, and was demonstrated to predict gene gain/loss evolution of bacterial metabolic systems.

Figure 1. Overview of Evodictor for gene gain/loss prediction.

Supported Environment

Evodictor can be executed on Linux OS / Mac

Software Dependency

<h4>Required</h4>

Python3 (version: 3.7.0 or later) with biopython, scipy, numpy, imblearn, and scikit-learn modules required

You can install these python modules using conda

conda install -c conda-forge biopython imbalanced-learn numpy scikit-learn scipy

Software installation

Each installation step will take less than ~1 min

Installation of Evodictor

Download Evodictor by

 git clone https://github.com/IwasakiLab/Evodictor.git

Add the absolute path of xxx/src directory to $PATH
Make /src/* executable
```
chmod u+x xxx/src/*
```

Sample Codes

This repository contains an example input file in the examples directory so users can quickly try predicting gene gain/loss evolution using Evodictor step-by-step:

Step 1: Dataset Generation

Generate a dataset for machine learning from a phylogenetic tree and presence/absence profiles of every trait for all the extant and the ancestral species in the tree to predict gene gain of a target ortholog group (K00005 in this example)

evodictor generate --target K00005 -X OG_node_state.txt -y OG_node_state.txt -t example.tree --predictor feature_OG.txt --gl gain > branch_X_y.txt

Or you can type "xygen" instead of "evodictor generate".

Input:

example.tree: A phylogenetic tree in a Newick format.

OG_node_state.txt: The presence/absence profile of every ortholog group (OG) for every tip node (extant species) and every internal node (ancestors) of example.tree. There is one row for every internal/tip node in this file. The first, second, and third columns of every row indicate the OG name, node name, and the presence/absence state, respectively. The presence/absence state is represented as 0 (absent), 1 (present), or 0.5 (uncertain; for ancestors). Rows for which states are 0 can be omitted in this file (in other words, states of nodes not defined in this file are treated as 0).

feature_OG.txt: Correspondence between OGs (e.g., K00001) and features (defined as groups of OGs; e.g., M00001). The input of the machine learning model in Evodictor is the vector in which every dimension (feature) corresponds to the number of present OGs included in the feature.

Output:

branch_X_y.txt: The dataset for machine learning which can be an input file of evodictor predict. The first row is the header, and each of the following rows correspond to a branch in the example.tree. The first, second, and third column of every row indicate the node name of a parental species of a branch in example.tree, the number of present traits of every feature in the parental species (separated by ;), and the occurrence of gene gain of predicted OG (K00005) at the branch (1: the gene was gained at the branch; 0: the gene was not gained at the branch).

Step 2: Feature Selection

Select top-20 important input features based on ANOVA F-value to predict gene gain of an OG (K00005).

evodictor select -i branch_X_y.txt --skip_header --o1 feature_importance.txt --o2 selection_result.txt --o3 branch_X_y.selected.txt -k 20

Or you can type "selevo" instead of "evodictor select".

Input:

branch_X_y.txt: The file generated in Step 1

Output:

feature_importance.txt : Importance (ANOVA F-value) of every feature

selection_result.20.txt : Binary values indicating whether each feature was included in top-20 important features or not (1: selected, 0: not selected)

branch_X_y.selected.20.txt : The dataset for machine learning which can be an input file of evodictor predict and contain only selected top-20 important features.

Step 3: Cross-validation

Conduct three-fold cross validation of gene gain prediction by logistic regression for an OG (K00005)

evodictor predict -i branch_X_y.selected.20.txt -c -k 3 -m LR --header > cross_validated_AUCs.txt

Input:

branch_X_y.selected.20.txt : The file generated in Step 3

Output:

cross_validated_AUCs.txt : List of the three AUCs (AUROCs) measured by three-fold cross validation

Step 4: Future gene gain prediction

Conduct training of logistic regression model and prediction of future gene gain probability of an OG (K00005) for every species. All the features were used for model training and prediction in this example. You can also conduct prediction with only selected features by changing two of the input files: feature_OG.txt and branch_X_y.txt.

evodictor generate --target K00005 -X OG_node_state.txt -y OG_node_state.txt -t example.tree --predictor feature_OG.txt --gl gain --ex > extant_X.txt
evodictor predict -m LR --header -i branch_X_y.txt -t extant_X.txt > species_probability.txt

Input:

example.tree: The same input file as Step 1

OG_node_state.txt: The same input file as Step 1

feature_OG.txt: The same input file as Step 1

branch_X_y.txt: The file generated in Step 1

Output:

extant_X.txt : List of input feature vectors of extant species (i.e., tip nodes of example.tree). The first row is a header. The first and second columns in each of the following rows indicate a extant species name and the number of present traits for every feature in the extant species (separated by ;).

species_probability.txt : Predicted gene gain probability of (K00005 for every extant species. The first and second columns in each row indicate a extant species name and the predicted gene gain probability.

Usage

evodictor generate / xygen

usage: evodictor generate [-h] [-v] [-p] [--target TARGET] [-X SPARSE_X] [-y SPARSE_Y]
             [-t TREE] [--predictor PREDICTOR] [--gl GL] [-m MODE] [--ex]

evodictor generate

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Print evodictor version (default: False)
  -p, --print           Print all arguments (default: False)
  --target TARGET       [Required] Prediction target (eg. 'R00001')
  -X SPARSE_X, --sparse_X SPARSE_X
                        [Required] Sparse matrix file path for input features
                        X
  -y SPARSE_Y, --sparse_y SPARSE_Y
                        [Required] Sparse matrix file path for output y
  -t TREE, --tree TREE  [Required] Tree file path
  --predictor PREDICTOR
                        [Required] Predictor definition file path
  --gl GL               [Required] Specify 'gain' or 'loss'
  -m MODE, --mode MODE  Mode of dataset generator (default: 'define')
  --ex                  Print only X for extant species (default: False)

evodictor select / selevo

usage: evodictor select [-h] [-v] [-p] [-i INPUT] [-m METHOD] [--scores SCORES]
              [--mask MASK] [--newXygen NEWXYGEN] [-n NORMALIZE] [-k K]

Evodictor

Install / Use

README

Overview of Evodictor

Supported Environment

Software Dependency

Software installation

Installation of Evodictor

Sample Codes

Usage