SkillAgentSearch skills...

OrthoSLC

OrthoSLC: A pipeline to get Orthologous Genes using Reciprocal Best Blast Hit (RBBH) Single Linkage Clustering, indenpendent of relational database management system

Install / Use

/learn @JJChenCharly/OrthoSLC

README

Readme OrthoSLC (1.0.0)

Apologies and Warnings: results generated by versions prior to 0.2.0 had incorrect sequence to ID correspondence. Sincere apologies.

OrthoSLC is a pipline that performs Reciprocal Best Blast Hit (RBBH) Single Linkage Clustering (V1.0.0 onwards allows mcl) to obtain Orthologous Genes, and generate core and accessory genes as final output. <br>

Table of contents

OrthoSLC ToolKit (OthoSLC_TK, at bottom of this page): From V1.0.0 onwards, we provide OthoSLC_TK to assist some downstream related tasks like MSA, construction of single copy core genome, calculating all pairwise SNP count.

OrthoSLC is: <br>

  • lightweight,
  • fast:
    • 90 E.coli annotated genomes, 10 threads to final cluster -> <150 s;
    • 976 E.coli annotated genomes (unique genes -> ~500M), 30 threads -> ~22 min;
    • 1200 E.coli annotated genomes (highly diverse, from six different lineages, unique genes -> ~900M), 30 threads -> ~70 min;
    • 3050 E.coli annotated genomes (from different lineages, unique genes -> 1.2G), 30 threads -> ~220 min, peak memory < 15G with speed trade off;
  • convenient to install;
  • and independent of relational database management systems (e.g., MySQL, Redis, SQLite)

Download:<br>

$ git clone https://github.com/JJChenCharly/OrthoSLC
$ cd OrthoSLC

Caveat:<br>

  • The pipeline is currently available for linux-like system only.
  • From V1.0.0 onwards, the "all-in-one" Jupyter notebook interface OrthoSLC_Python_jupyter_interface.ipynb no longer gurantee same output as excutable binaries. Users should employ excutable in bins/

Requirement and Run:<br>

  • Python3 (suggest newest stable release or > V3.7)<br>
  • C++17 ("must" or higher for compiling) users may also directly use pre-compiled binary files by allowing excute access:<br>
  • NCBI Blast+ (suggest 2.12 or higher) <br>
  • mcl (optional but Highly suggested, apt-get install mcl)

You may use pre-compiled:

$ cd OrthoSLC
$ chmod a+x bins/*

Or use install.sh to compile like following:

$ cd OrthoSLC
$ chmod a+x install.sh
$ ./install.sh src/ bins/

Easy Run

The pipeline start with annotated genomes, and can produce clusters of gene id, and FASTA files of each cluster.

Therefore, with:

  • blastn and mcl in $PATH;
  • one fasta file for each strain, many DNA sequences in each fasta;

you can easily try:

  • 90 E.coli genomes and 10 threads (time: <150s).
bash ./commandline_template.sh ./bins \
./test_output \ # your output directory
./test_inputs \ # your input directory
10 \ # threads
60 \ # larger -> lower mem but a bit slower
fasta # input file format
  • ~3000 E.coli genomes (from six lineages, unique genes -> 1.2G) on 30 threads:<br> (Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz, time: ~215 min, peak mem <15G). Peak mem can be reduced to <5G with ~10 min more total time usage, by increasing 500 to more.
bash ./commandline_template.sh ./bins \
./test_output \
./test_inputs \
30 \
500 \
ffn

Note that, this pipeline is recommended for sub-species level single copy core genome construction since RBBH may not work well for missions like finding Human-Microbe orthologs.

The programs uses A simple C++ Thread Pool implementation, and sincere thanks to its contributors.

Bug report:

You can run each step independently (check commandline_template.sh):

Step 1 Annotate Genome information preparation

The pipeline starts with annotated genomes in fasta format (accept DNA as default input).<br> Step 1 needs the path to directory of annotated FASTA files as input, your input folder should have Prokka/PGAP output structures:

$ tree ./test_inputs
./test_inputs
├── strain_A
│   ├── strain_A.err
│   ├── strain_A.faa
│   ├── strain_A.ffn
│   ├── strain_A.fna
...
├── strain_B
│   ├── strain_B.err
│   ├── strain_B.faa
│   ├── strain_B.ffn
...

Or simple layout:

$ tree ./test_inputs
./test_inputs
├── K12_MG1655.fasta
├── UTI89.fasta
├── strain_A.fasta
├── strain_B.fasta

By -f, program looks for all files with given extension. Program generates a header-less, tab separated table, in which the

  • first column is a short ID (strainID-GeneID);
  • second column is the strain name (file name or parent folder name),
  • third column as the absolute path.
$ python3 ./bins/Step1_preWalk.py -h
Usage: python Step1_preWalk.py -i input/ -o output/ [options...]

  -i or --input_path ---> <dir> path/to/input/directory
  -o or --output_path --> <txt> path/to/output_file
  -f or --format -------> <str> file extension to look for (e.g. 'sqn', 'ffn')
  -h or --help ---------> display this information

Step 2 FASTA dereplication

Step 2 is to remove sequence duplication within each genome (e.g., copies of tRNA, some CDS). This dereplication is equivalent to 100% clustering, to obtain single copy.<br> Step 2 requires the tab separated table output by Step 1 as input, and specifying a directory (-o) for dereplicated files.<br> Since, V1.0.0, by specifying -c or --copy_info_path, program will organize ID of copies (identical sequences), and output file is a tsv where one row is one set of identical copies separated by \t.<br> Note! Dereplication of OrthoSLC in Step 2 here is to simply record indentical sequences, not by region coverage of reads. Hence, annotated results of a complete circular or near complete genome assembly is suggested. What I usually do is leting TellSeq handle my genome extract. I always obtain N<sub>50</sub>>4.5Mbp E.coli assembly (not complete) with 300 to 350 RMB/sample (~<50$).

$ python3 ./bins/Step2_simple_derep -h
Usage: Step2_simple_derep -i input_file -o output/ [options...]

  -i or --input_path ------> <txt> path/to/file/output/by/Step1
  -o or --output_path -----> <dir> path/to/output/directory
  -c or --copy_info_path --> <txt> path/to/output/copy_info.txt
  -u or --thread_number ---> <int> thread number, default: 1
  -h or --help ------------> display this information

<font color="red">Note before next step</font>

After dereplication, users should give a careful check of size of dereplicated FASTA files. It is worth noting that if a FASTA file with a very low amount of sequences, getting processed together with all the rest, the final "core" clusters will be heavily affected and may bias your analysis.

grep -c ">" $wd/S2_op_dereped/* | sort -t: -k2rn

Since the core genome construction is similar with intersection construction. <font color="red">It is recommend to remove some very small dereplicated fasta files BEFORE NEXT STEP</font>, e.g., remove all dereplicated <i>E.coli </i> genomes with file size lower than 2.5MB as most should be larger than 4 MB.

Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation

This step performs 100% clustering on all dereplicated FASTAs generated in Step 2 to further dereplicate on all sequences.<br>

The program of Step 3 will take the dereplicated fasta files made in step 2 as input, and produce:<br>

  • Fasta dereplicate on all sequences for reciprocal blast query
  • length of each non-redundant sequence
  • pre-clustered results (one row one cluster tab separated format)
  • each genome that is redundancy-removed for makeblastdb
  • gene id information: a tab delimited table, first column as short id generated, second colums as description line of original fasta

You may run Step3_pre_cluster like following:

Usage: Step3_pre_cluster -i concatenated.fasta -d dereped/ -n nr_genome/ -l seq_len.txt -p pre_cluster.txt [options...]

  -i or --input_path -----> <dir> path/to/directory of dereplicated FASTA from Step2
  -d or --derep_fasta ----> <fasta> path/to/output/dereplicated concatenated FASTA
  -n or --nr_genomes -----> <dir> path/to/directory/of/output/non-reundant/genomes
  -l or --seq_len_info ---> <txt> path/to/output/sequence_length_table
  -m or --id_info --------> <txt> path/to/output/id_info_table
  -p or --pre_cluster ----> <txt> path/to/output/pre_clustered_file
  -u or --thread_number --> <int> thread number, default: 1
  -h or -
View on GitHub
GitHub Stars7
CategoryData
Updated2mo ago
Forks3

Languages

Jupyter Notebook

Security Score

90/100

Audited on Jan 17, 2026

No findings