EvoMega(Co-Evolutionary+MegaDNA)

Predicting bacteriophage transcription factor binding sites based on evolutionary relationships and large language model.

Introduction
Installation
Usage
- Running the Motif Analyzer
- Example
Output Explanation
Troubleshooting
License
Contact

Introduction

EvoMega is a tool designed to predict transcription factor binding sites in bacteriophages. By analyzing genomic data, it identifies key motifs that are crucial for understanding phage biology and interactions with host organisms.

Installation

Follow the steps below to install EvoMega. The installation process has been streamlined to simplify operations by grouping related tasks together.

Prerequisites

Ensure that your system meets the following requirements:

Operating System: Linux-based systems are recommended.
Conda: For managing packages and environments (optional but recommended).
Python: Version 3.6 or higher.

Create EvoMega Environment

Install Conda (if not already installed):

If you don't have Conda installed, follow the official installation guide.
Create a Conda environment named EvoMega with Python version 3.8.10:
```
conda create -n EvoMega python=3.8.10
```
Activate the EvoMega environment:
```
conda activate EvoMega
```

Install Dependencies and Tools

Update Package Lists and Install Essential Build Tools:

sudo apt update
sudo apt install -y build-essential gcc g++ make zlib1g-dev libbz2-dev liblzma-dev

Install Compression Tools:

Choose either 7za or unzip based on your preference.
- Using 7za:
```
sudo apt install -y p7zip-full
```
Or using unzip:
```
sudo apt install -y unzip
```
Install hmmer Using Conda:
```
conda install -c bioconda hmmer
```
Install Python Dependencies:

Ensure that a requirements.txt file exists in the project root directory. Then run:
```
pip install -r requirements.txt
```

Download and Set Up Databases and Models

Clone EvoMega Repository:

(Assuming the code is hosted on GitHub. Replace the URL with the actual repository link.)
```
git clone https://github.com/yourusername/EvoMega.git
cd EvoMega
```

Download EvoMega Databases:

wget --save-cookies /tmp/cookies.txt --no-check-certificate "https://drive.usercontent.google.com/download?id=1569KsNmwhVuVNduQNfQ2_KLWx1v_fqGo&export=download&authuser=0&confirm=t&uuid=f30f18ad-3133-4dc9-bed3-cd95a448f69f&at=APvzH3rXKE6IjUztvq4HwPbot34Y:1734682290410" -O exclude_GPD_find_key_motif.zip
wget --save-cookies /tmp/cookies.txt --no-check-certificate "https://drive.usercontent.google.com/download?id=1FQyELLO9tk6h6uF7mBxORmkJ8FTc5ib5&export=download&authuser=0&confirm=t&uuid=45b4aea8-b4ff-45fa-89a2-bb39b1dd24fe&at=APvzH3pjYOfsjc_1FQRUKXXu0t4I:1735826436323" -O tfbs_model.joblib
rm -f /tmp/cookies.txt

Extract the Downloaded Database:

Using 7za:

7za x exclude_GPD_find_key_motif.zip -mmt=on

Or using unzip:

unzip exclude_GPD_find_key_motif.zip

Download Additional Models and Databases:
- MegaDNA Model:
  
  You can download the model by clicking the following link: Download megaDNA_145M.pt from Hugging Face
- Pfam-A.hmm:
```
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz
gzip -d Pfam-A.hmm.gz
mv Pfam-A.hmm scripts/Pfam/
```

Install MEME Suite

The MEME Suite is essential for motif analysis. Follow these steps to install it:

Download MEME Suite:

wget https://meme-suite.org/meme/meme-software/5.5.7/meme-5.5.7.tar.gz

Extract the Archive and Navigate to the Directory:
```
tar -xzvf meme-5.5.7.tar.gz
cd meme-5.5.7
```
Configure, Compile, and Install MEME Suite:
```
./configure
make
sudo make install
```

Configure Environment Variables

Add MEME Suite Binaries to Your PATH:

echo 'export PATH=$PATH:/usr/local/meme/bin' >> ~/.bashrc
source ~/.bashrc

Verify MEME Suite Installation:
```
meme --version
```
You should see the installed version of MEME Suite. If not, revisit the Install MEME Suite section.

Summary of Directory Structure

Ensure that the required databases and models are placed in the appropriate directories within your project:

EvoMega/
├── scripts/
│   ├── analysis_meme_file.py
│   ├── feature.py
│   ├── gene_interaction.py
│   ├── key_function.py
│   ├── model/
│   │   ├── megaDNA_phage_145M.pt
│   │   └── tfbs_model.joblib
│   ├── model_scoring.py
│   ├── motif_analysis.py
│   ├── motif_analyzer.py
│   ├── MPI_DFI.py
│   ├── one_key_analysis.py
│   ├── Pfam/
│   │   ├── Pfam-A.hmm
├── exclude_GPD_find_key_motif/  # Extracted database
├── requirements.txt
└── README.md

Usage

Running the Motif Analyzer

To use the motif analyzer, execute the following command:

python scripts/motif_analyzer.py -i INPUT_FILE -o OUTPUT_PATH

Parameters:

-i or --input_file: Path to the input file (e.g., GenBank file).
-o or --output_path: Directory where the output will be saved.

Example

Here is an example of how to run the motif analyzer with a sample input file:

python scripts/motif_analyzer.py -i NC_001416.gbk -o scripts/output

This command analyzes the NC_002371.gbk GenBank file and saves the results in the scripts/output directory.

Output Explanation

After running EvoMega, the output directory will have the following structure: scripts/output/

├── interaction
│   └── batch_figure
│       ├── NC_001416
│       │   ├── contacts_round_1_with_hth_interactions.png
│       │   ├── contacts_round_2_with_hth_interactions.png
│       │   ├── extracted_sequences.fasta
│       │   ├── hmmscan_output.txt
│       │   ├── hth_interactions_round_1.csv
│       │   ├── hth_interactions_round_2.csv
│       │   └── updated_sequences
│       │       ├── updated_extracted_sequences_round_1.fasta
│       │       ├── updated_extracted_sequences_round_2.fasta
├── NC_001416
│   ├── extend_meme_10.csv
│   ├── motif_meme
│   │   ├── fimo_results
│   │   │   ├── fimo.txt
│   │   │   └── tomtom_results
│   │   │       ├── combined_motifs.meme
│   │   ├── final_motif.meme
│   │   └── non_redundant.meme
│   ├── motif_trends.png
│   ├── NC_001416.csv
│   ├── NC_001416_motif_metrix.csv
│   ├── rank_list.csv
│   └── results_df.csv

Detailed Explanation:

interaction/batch_figure/NC_001416:
- contacts_round_1_with_hth_interactions.png: Visualization of gene interactions after the first round.
- contacts_round_2_with_hth_interactions.png: Visualization of gene interactions after the second round.
- extracted_sequences.fasta: FASTA file containing extracted gene sequences.
- hmmscan_output.txt: Output from the HMMER scan.
- hth_interactions_round_1.csv: CSV file detailing helix-turn-helix (HTH) interactions from round 1.
- hth_interactions_round_2.csv: CSV file detailing HTH interactions from round 2.
- updated_sequences/: Contains updated sequences after the rounds.
  - updated_extracted_sequences_round_1.fasta: Updated sequences after round 1.
  - updated_extracted_sequences_round_2.fasta: Updated sequences after round 2.
NC_001416:
- extend_meme_10.csv: Extended MEME analysis results.
- motif_meme/: Directory containing motif analysis results.
  - fimo_results/fimo.txt: Results from the FIMO tool identifying motif occurrences.
  - final_motif.meme: Final set of motifs after analysis.
  - non_redundant.meme: Non-redundant motifs.
- motif_trends.png: Visualization of motif trends.
- NC_001416.csv: General results for NC_001416.
- NC_001416_motif_metrix.csv: Comprehensive scoring matrix.
- rank_list.csv: Ranked list of motifs based on scoring.

Troubleshooting

MEME Suite Not Found:
Ensure that MEME Suite is correctly installed and that the path is added to your PATH environment variable. You can verify by running:
```
meme --version
```
If the command is not found, revisit the Install MEME Suite section.
Permission Issues:
If you encounter permission errors during installation, ensure you have the necessary rights or use sudo where appropriate.
Missing Dependencies:
Make sure all required dependencies are installed. Refer to the Install Dependencies and Tools section.
Conda Environment Issues:
If you have trouble activating the Conda environment, ensure that Conda is properly installed and initialized. You can initialize Conda with:
```
conda init
```
```
source ~/.bashrc
```
Download Failures:
If downloads from Google Drive fail, ensure that you have a stable internet connection and that the download links are accessible.

License

This project is licensed under the MIT License.

Contact

For any questions or support, please contact 1025387313hzq@gmail.com.

EvoMega

Install / Use

README