OrthoSLC
OrthoSLC: A pipeline to get Orthologous Genes using Reciprocal Best Blast Hit (RBBH) Single Linkage Clustering, indenpendent of relational database management system
Install / Use
/learn @JJChenCharly/OrthoSLCREADME
Readme OrthoSLC (1.0.0)
Apologies and Warnings: results generated by versions prior to 0.2.0 had incorrect sequence to ID correspondence. Sincere apologies.
OrthoSLC is a pipline that performs Reciprocal Best Blast Hit (RBBH) Single Linkage Clustering (V1.0.0 onwards allows mcl) to obtain Orthologous Genes, and generate core and accessory genes as final output. <br>
Table of contents
- Readme OrthoSLC (1.0.0)
- Easy Run
- You can run each step independently (check
commandline_template.sh)- Step 1 Annotate Genome information preparation
- Step 2 FASTA dereplication
- Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation
- Step 4 Reciprocal Blast
- Step 5 query binning
- Step 6 Filtering and binning
- Step 7 Reciprocal Best find
- Step 8 Clustering
- Step 9 Write clusters into FASTA
- OrthoSLC ToolKit (OthoSLC_TK)
OrthoSLC ToolKit (OthoSLC_TK, at bottom of this page): From V1.0.0 onwards, we provide OthoSLC_TK to assist some downstream related tasks like MSA, construction of single copy core genome, calculating all pairwise SNP count.
OrthoSLC is: <br>
- lightweight,
- fast:
- 90 E.coli annotated genomes, 10 threads to final cluster -> <150 s;
- 976 E.coli annotated genomes (unique genes -> ~500M), 30 threads -> ~22 min;
- 1200 E.coli annotated genomes (highly diverse, from six different lineages, unique genes -> ~900M), 30 threads -> ~70 min;
- 3050 E.coli annotated genomes (from different lineages, unique genes -> 1.2G), 30 threads -> ~220 min, peak memory < 15G with speed trade off;
- convenient to install;
- and independent of relational database management systems (e.g., MySQL, Redis, SQLite)
Download:<br>
$ git clone https://github.com/JJChenCharly/OrthoSLC
$ cd OrthoSLC
Caveat:<br>
- The pipeline is currently available for linux-like system only.
- From V1.0.0 onwards, the "all-in-one" Jupyter notebook interface
OrthoSLC_Python_jupyter_interface.ipynbno longer gurantee same output as excutable binaries. Users should employ excutable inbins/
Requirement and Run:<br>
- Python3 (suggest newest stable release or > V3.7)<br>
- C++17 ("must" or higher for compiling) users may also directly use pre-compiled binary files by allowing excute access:<br>
- NCBI Blast+ (suggest 2.12 or higher) <br>
mcl(optional but Highly suggested,apt-get install mcl)
You may use pre-compiled:
$ cd OrthoSLC
$ chmod a+x bins/*
Or use install.sh to compile like following:
$ cd OrthoSLC
$ chmod a+x install.sh
$ ./install.sh src/ bins/
Easy Run
The pipeline start with annotated genomes, and can produce clusters of gene id, and FASTA files of each cluster.
Therefore, with:
blastnandmclin$PATH;- one fasta file for each strain, many DNA sequences in each fasta;
you can easily try:
- 90 E.coli genomes and 10 threads (time: <150s).
bash ./commandline_template.sh ./bins \
./test_output \ # your output directory
./test_inputs \ # your input directory
10 \ # threads
60 \ # larger -> lower mem but a bit slower
fasta # input file format
- ~3000 E.coli genomes (from six lineages, unique genes -> 1.2G) on 30 threads:<br>
(Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz, time: ~215 min, peak mem <15G). Peak mem can be reduced to <5G with ~10 min more total time usage, by increasing
500to more.
bash ./commandline_template.sh ./bins \
./test_output \
./test_inputs \
30 \
500 \
ffn
Note that, this pipeline is recommended for sub-species level single copy core genome construction since RBBH may not work well for missions like finding Human-Microbe orthologs.
The programs uses A simple C++ Thread Pool implementation, and sincere thanks to its contributors.
Bug report:
You can run each step independently (check commandline_template.sh):
Step 1 Annotate Genome information preparation
The pipeline starts with annotated genomes in fasta format (accept DNA as default input).<br> Step 1 needs the path to directory of annotated FASTA files as input, your input folder should have Prokka/PGAP output structures:
$ tree ./test_inputs
./test_inputs
├── strain_A
│ ├── strain_A.err
│ ├── strain_A.faa
│ ├── strain_A.ffn
│ ├── strain_A.fna
...
├── strain_B
│ ├── strain_B.err
│ ├── strain_B.faa
│ ├── strain_B.ffn
...
Or simple layout:
$ tree ./test_inputs
./test_inputs
├── K12_MG1655.fasta
├── UTI89.fasta
├── strain_A.fasta
├── strain_B.fasta
By -f, program looks for all files with given extension. Program generates a header-less, tab separated table, in which the
- first column is a short ID (
strainID-GeneID); - second column is the strain name (file name or parent folder name),
- third column as the absolute path.
$ python3 ./bins/Step1_preWalk.py -h
Usage: python Step1_preWalk.py -i input/ -o output/ [options...]
-i or --input_path ---> <dir> path/to/input/directory
-o or --output_path --> <txt> path/to/output_file
-f or --format -------> <str> file extension to look for (e.g. 'sqn', 'ffn')
-h or --help ---------> display this information
Step 2 FASTA dereplication
Step 2 is to remove sequence duplication within each genome (e.g., copies of tRNA, some CDS). This dereplication is equivalent to 100% clustering, to obtain single copy.<br>
Step 2 requires the tab separated table output by Step 1 as input, and specifying a directory (-o) for dereplicated files.<br>
Since, V1.0.0, by specifying -c or --copy_info_path, program will organize ID of copies (identical sequences), and output file is a tsv where one row is one set of identical copies separated by \t.<br>
Note! Dereplication of OrthoSLC in Step 2 here is to simply record indentical sequences, not by region coverage of reads. Hence, annotated results of a complete circular or near complete genome assembly is suggested. What I usually do is leting TellSeq handle my genome extract. I always obtain N<sub>50</sub>>4.5Mbp E.coli assembly (not complete) with 300 to 350 RMB/sample (~<50$).
$ python3 ./bins/Step2_simple_derep -h
Usage: Step2_simple_derep -i input_file -o output/ [options...]
-i or --input_path ------> <txt> path/to/file/output/by/Step1
-o or --output_path -----> <dir> path/to/output/directory
-c or --copy_info_path --> <txt> path/to/output/copy_info.txt
-u or --thread_number ---> <int> thread number, default: 1
-h or --help ------------> display this information
<font color="red">Note before next step</font>
After dereplication, users should give a careful check of size of dereplicated FASTA files. It is worth noting that if a FASTA file with a very low amount of sequences, getting processed together with all the rest, the final "core" clusters will be heavily affected and may bias your analysis.
grep -c ">" $wd/S2_op_dereped/* | sort -t: -k2rn
Since the core genome construction is similar with intersection construction. <font color="red">It is recommend to remove some very small dereplicated fasta files BEFORE NEXT STEP</font>, e.g., remove all dereplicated <i>E.coli </i> genomes with file size lower than 2.5MB as most should be larger than 4 MB.
Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation
This step performs 100% clustering on all dereplicated FASTAs generated in Step 2 to further dereplicate on all sequences.<br>
The program of Step 3 will take the dereplicated fasta files made in step 2 as input, and produce:<br>
- Fasta dereplicate on all sequences for reciprocal blast query
- length of each non-redundant sequence
- pre-clustered results (one row one cluster tab separated format)
- each genome that is redundancy-removed for
makeblastdb - gene id information: a tab delimited table, first column as short id generated, second colums as description line of original fasta
You may run Step3_pre_cluster like following:
Usage: Step3_pre_cluster -i concatenated.fasta -d dereped/ -n nr_genome/ -l seq_len.txt -p pre_cluster.txt [options...]
-i or --input_path -----> <dir> path/to/directory of dereplicated FASTA from Step2
-d or --derep_fasta ----> <fasta> path/to/output/dereplicated concatenated FASTA
-n or --nr_genomes -----> <dir> path/to/directory/of/output/non-reundant/genomes
-l or --seq_len_info ---> <txt> path/to/output/sequence_length_table
-m or --id_info --------> <txt> path/to/output/id_info_table
-p or --pre_cluster ----> <txt> path/to/output/pre_clustered_file
-u or --thread_number --> <int> thread number, default: 1
-h or -
