Readme OrthoSLC (1.0.0)

Apologies and Warnings: results generated by versions prior to 0.2.0 had incorrect sequence to ID correspondence. Sincere apologies.

OrthoSLC is a pipline that performs Reciprocal Best Blast Hit (RBBH) Single Linkage Clustering (V1.0.0 onwards allows mcl) to obtain Orthologous Genes, and generate core and accessory genes as final output.

Readme OrthoSLC (1.0.0)

OrthoSLC ToolKit (OthoSLC_TK, at bottom of this page): From V1.0.0 onwards, we provide OthoSLC_TK to assist some downstream related tasks like MSA, construction of single copy core genome, calculating all pairwise SNP count.

OrthoSLC is:

lightweight,
fast:
- 90 E.coli annotated genomes, 10 threads to final cluster -> <150 s;
- 976 E.coli annotated genomes (unique genes -> ~500M), 30 threads -> ~22 min;
- 1200 E.coli annotated genomes (highly diverse, from six different lineages, unique genes -> ~900M), 30 threads -> ~70 min;
- 3050 E.coli annotated genomes (from different lineages, unique genes -> 1.2G), 30 threads -> ~220 min, peak memory < 15G with speed trade off;
convenient to install;
and independent of relational database management systems (e.g., MySQL, Redis, SQLite)

Download:

$ git clone https://github.com/JJChenCharly/OrthoSLC
$ cd OrthoSLC

Caveat:

The pipeline is currently available for linux-like system only.
From V1.0.0 onwards, the "all-in-one" Jupyter notebook interface OrthoSLC_Python_jupyter_interface.ipynb no longer gurantee same output as excutable binaries. Users should employ excutable in bins/

Requirement and Run:

Python3 (suggest newest stable release or > V3.7)
C++17 ("must" or higher for compiling) users may also directly use pre-compiled binary files by allowing excute access:
NCBI Blast+ (suggest 2.12 or higher)
mcl (optional but Highly suggested, apt-get install mcl)

You may use pre-compiled:

$ cd OrthoSLC
$ chmod a+x bins/*

Or use install.sh to compile like following:

$ cd OrthoSLC
$ chmod a+x install.sh
$ ./install.sh src/ bins/

Easy Run

The pipeline start with annotated genomes, and can produce clusters of gene id, and FASTA files of each cluster.

Therefore, with:

blastn and mcl in $PATH;
one fasta file for each strain, many DNA sequences in each fasta;

you can easily try:

90 E.coli genomes and 10 threads (time: <150s).

bash ./commandline_template.sh ./bins \
./test_output \ # your output directory
./test_inputs \ # your input directory
10 \ # threads
60 \ # larger -> lower mem but a bit slower
fasta # input file format

~3000 E.coli genomes (from six lineages, unique genes -> 1.2G) on 30 threads: (Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz, time: ~215 min, peak mem <15G). Peak mem can be reduced to <5G with ~10 min more total time usage, by increasing 500 to more.

bash ./commandline_template.sh ./bins \
./test_output \
./test_inputs \
30 \
500 \
ffn

Note that, this pipeline is recommended for sub-species level single copy core genome construction since RBBH may not work well for missions like finding Human-Microbe orthologs.

The programs uses A simple C++ Thread Pool implementation, and sincere thanks to its contributors.

Bug report:

jingjie.chencharly@gmail.com

You can run each step independently (check `commandline_template.sh`):

Step 1 Annotate Genome information preparation

The pipeline starts with annotated genomes in fasta format (accept DNA as default input). Step 1 needs the path to directory of annotated FASTA files as input, your input folder should have Prokka/PGAP output structures:

$ tree ./test_inputs
./test_inputs
├── strain_A
│   ├── strain_A.err
│   ├── strain_A.faa
│   ├── strain_A.ffn
│   ├── strain_A.fna
...
├── strain_B
│   ├── strain_B.err
│   ├── strain_B.faa
│   ├── strain_B.ffn
...

Or simple layout:

$ tree ./test_inputs
./test_inputs
├── K12_MG1655.fasta
├── UTI89.fasta
├── strain_A.fasta
├── strain_B.fasta

By -f, program looks for all files with given extension. Program generates a header-less, tab separated table, in which the

first column is a short ID (strainID-GeneID);
second column is the strain name (file name or parent folder name),
third column as the absolute path.

$ python3 ./bins/Step1_preWalk.py -h
Usage: python Step1_preWalk.py -i input/ -o output/ [options...]

  -i or --input_path ---> <dir> path/to/input/directory
  -o or --output_path --> <txt> path/to/output_file
  -f or --format -------> <str> file extension to look for (e.g. 'sqn', 'ffn')
  -h or --help ---------> display this information

Step 2 FASTA dereplication

Step 2 is to remove sequence duplication within each genome (e.g., copies of tRNA, some CDS). This dereplication is equivalent to 100% clustering, to obtain single copy. Step 2 requires the tab separated table output by Step 1 as input, and specifying a directory (-o) for dereplicated files. Since, V1.0.0, by specifying -c or --copy_info_path, program will organize ID of copies (identical sequences), and output file is a tsv where one row is one set of identical copies separated by \t. Note! Dereplication of OrthoSLC in Step 2 here is to simply record indentical sequences, not by region coverage of reads. Hence, annotated results of a complete circular or near complete genome assembly is suggested. What I usually do is leting TellSeq handle my genome extract. I always obtain N50>4.5Mbp E.coli assembly (not complete) with 300 to 350 RMB/sample (~<50$).

$ python3 ./bins/Step2_simple_derep -h
Usage: Step2_simple_derep -i input_file -o output/ [options...]

  -i or --input_path ------> <txt> path/to/file/output/by/Step1
  -o or --output_path -----> <dir> path/to/output/directory
  -c or --copy_info_path --> <txt> path/to/output/copy_info.txt
  -u or --thread_number ---> <int> thread number, default: 1
  -h or --help ------------> display this information

Note before next step

After dereplication, users should give a careful check of size of dereplicated FASTA files. It is worth noting that if a FASTA file with a very low amount of sequences, getting processed together with all the rest, the final "core" clusters will be heavily affected and may bias your analysis.

grep -c ">" $wd/S2_op_dereped/* | sort -t: -k2rn

Since the core genome construction is similar with intersection construction. It is recommend to remove some very small dereplicated fasta files BEFORE NEXT STEP, e.g., remove all dereplicated E.coli genomes with file size lower than 2.5MB as most should be larger than 4 MB.

Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation

This step performs 100% clustering on all dereplicated FASTAs generated in Step 2 to further dereplicate on all sequences.

The program of Step 3 will take the dereplicated fasta files made in step 2 as input, and produce:

Fasta dereplicate on all sequences for reciprocal blast query
length of each non-redundant sequence
pre-clustered results (one row one cluster tab separated format)
each genome that is redundancy-removed for makeblastdb
gene id information: a tab delimited table, first column as short id generated, second colums as description line of original fasta

You may run Step3_pre_cluster like following:

Usage: Step3_pre_cluster -i concatenated.fasta -d dereped/ -n nr_genome/ -l seq_len.txt -p pre_cluster.txt [options...]

  -i or --input_path -----> <dir> path/to/directory of dereplicated FASTA from Step2
  -d or --derep_fasta ----> <fasta> path/to/output/dereplicated concatenated FASTA
  -n or --nr_genomes -----> <dir> path/to/directory/of/output/non-reundant/genomes
  -l or --seq_len_info ---> <txt> path/to/output/sequence_length_table
  -m or --id_info --------> <txt> path/to/output/id_info_table
  -p or --pre_cluster ----> <txt> path/to/output/pre_clustered_file
  -u or --thread_number --> <int> thread number, default: 1
  -h or -

OrthoSLC

Install / Use

README

Readme OrthoSLC (1.0.0)

Table of contents

Easy Run

You can run each step independently (check `commandline_template.sh`):

Step 1 Annotate Genome information preparation

Step 2 FASTA dereplication

<font color="red">Note before next step</font>

Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation

OrthoSLC

Install / Use

README

Readme OrthoSLC (1.0.0)

Table of contents

Easy Run

You can run each step independently (check commandline_template.sh):

Step 1 Annotate Genome information preparation

Step 2 FASTA dereplication

<font color="red">Note before next step</font>

Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation

You can run each step independently (check `commandline_template.sh`):