StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers

Generate Convert Improve

Install / Use

/learn @liaoherui/StrainScan

About this skill

Quality Score

0/100

README

StrainScan

One efficient, accurate and high-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers. StrainScan takes reference database and sequencing data as input, outputs strain-level microbiome compistion analysis report.

Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut

E-mail: heruiliao2-c@my.cityu.edu.hk / yxjijms@gmail.com

Version: V1.0.14 (update at 2023-10-13)

<details> <summary>Click here to check the log of all updates</summary>

[Update - 2022 - 05 - 01] :

V1.0.3: StrainScan can be installed via bioconda now!

[Update - 2022 - 06 - 07] :

V1.0.10: Add multuple threads to the reference database constrcution!

[Update - 2023 - 02 - 07] :

Two new intra-cluster searching modes are updated: plasmid_mode and extraRegion_mode.

[Update - 2023 - 04 - 22] :

StrainScan is able to take gzipped and PE FASTQs as input now!

[Update - 2023 - 05 - 04] :

StrainScan is able to take the custom clustering file generated by other clustering methods (e.g. PopPunk)! Additionally, we have made updates to the script (StrainScan_subsample.py) which now enables users to subsample their strains using hierarchical clustering.

[Update - 2023 - 05 - 15] :

Add a parameter "-b" that enables StrainScan to output the probability of detecting a strain in samples with low sequencing depth (e.g. <1X).

[Update - 2023 - 09 - 23] :

One database containing 1627 Staphylococcus aureus strains is publicly available! 
</details>

[Update - 2023 - 09 - 29] :

V1.0.13: Update Bioconda version to latest GitHub version, which has more and new functions!!

[Update - 2023 - 10 - 03] :

One database containing 1124 Lactobacillus crispatus strains is publicly available!

[Update - 2023 - 10 - 13] :

V1.0.14: Fix a bug in identify.py about the identification of low-depth strains!

Overview of StrainScan:

Dependencies:

Python ==3.7.x
R
Sibeliaz ==1.2.2 (https://github.com/medvedevgroup/SibeliaZ)
Required python package: numpy==1.17.3, pandas==1.0.1, biopython==1.74, scipy==1.3.1, scikit-learn==0.23.1, bidict==0.21.3, treelib==1.6.1

Make sure these programs have been installed before using StrainScan.

Install (Linux or ubuntu only)

Option 1 - The first way to install StrainScan, is to use bioconda. Once you have bioconda environment installed, install package strainscan:

conda install -c bioconda strainscan

It should be noted that some commands have been replaced if you install StrainScan using bioconda. (See below)

Command (Not bioconda) | Command (bioconda) ------------ | ------------- python StrainScan.py -h | strainscan -h python StrainScan_build.py -h | strainscan_build -h

Option 2 - Also, yon can install StrainScan via Anaconda using the commands below:

git clone https://github.com/liaoherui/StrainScan.git cd StrainScan

conda env create -f environment_candidate.yaml conda activate strainscan

chmod 755 library/jellyfish-linux chmod 755 library/dashing_s128

Option 3 - Or, you can install all dependencies of StrainScan mannually and then run the commands below.

git clone https://github.com/liaoherui/StrainScan.git cd StrainScan

chmod 755 library/jellyfish-linux chmod 755 library/dashing_s128

Pre-built databases download

The table below offers information about the pre-built databases of 6 bacterial species used in the paper and 2 additional bacterial species. Users can download these databases and use them to identify strains directly.

Species | Source | Number of Strains | Number of Clusters | Download link ------------ | -------------| ------------- | ------------- | ------------- Akkermansia muciniphila | NCBI | 157 | 53 | Google drive Cutibacterium acnes | NCBI | 275 | 28 | Google drive Prevotella copri | NCBI | 112 | 51 | Google drive Escherichia coli | NCBI | 1433 | 823 | Google drive Mycobacterium tuberculosis | NCBI | 792 | 25 | Google drive Staphylococcus epidermidis | NCBI | 995 | 378 | Google drive Staphylococcus aureus | NCBI | 1627 | 202 | Google drive Lactobacillus crispatus | NCBI | 1124 | 311 | Google drive

You can also use bash scripts in the folder "Download_DB_script" to download the pre-built databases from Google drive. For example,

cd Download_DB_script sh ecoli_db.sh

If you can not download databases from the google drive, you may try to download databases from the Baidu Netdisk. Baidu Netdisk link: https://pan.baidu.com/s/1YFtHv2weJEBdwTz4LmTKOQ Extraction code: ASDF

Usage

One example about database construction and identification commands can be found in "test_run.sh".

Use StrainScan to build your own custom database.

python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir> eg: python StrainScan_build.py -i Test_genomes -o DB_Small

(Note: input fasta can be gzipped format)

Use StrainScan to build your own custom database with custom clustering file.

python StrainScan_build.py -i <Input_Genomes> -c <Cluster_file> -o <Database_Dir> The data format of the input clustering file can be found in the demo file Custom_cluster_demo/custom_cls.txt, where the first column is the cluster ID, the second column is the cluster size, and the last column is the prefix of the reference genomes in the cluster.

Use StrainScan_subsample to subsample your large-scale strains with high redundancy.

python StrainScan_subsample.py -i <Input_Genomes> -o <Output_Dir> If you have large-scale strains with high redundancy and you want to remove the redundancy. Then you can use this script, which utilizes dashing and hierarchical clustering to subsample strains efficiently. The subsampled strains can be found in <Output_Dir>/Rep_ref and clustering information can be found in <Output_Dir>/Cls_res. More parameters can be found using python StrainScan_subsample.py -h.

Use StrainScan to identify bacterial strains in short reads.

python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir> eg: python StrainScan.py -i Sim_Data/GCF_003812785.fq -d DB_Small -o Test_Sim/GCF_003812785 or python StrainScan.py -i Sim_Data_mul/GCA_000144385_5X_GCF_008868325_5X.fq -d DB_Small -o Test_Sim/GCA_000144385_5X_GCF_008868325_5X

PE reads (can be gzipped FASTQ format) python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785

Note: if the sequencing depth of targeted strains is super low (e.g., <1X), then you may get "Warning: No clusters can be detected!". In this case, you can use parameter "-b" to output the probability of detecting a strain (cluster) in low-depth samples. The higher the probability, the more likely the strain (cluster) is to be present. eg: python StrainScan.py -i <Input_reads> -d <Database_Dir> -b 1 -o <Output_Dir>

Use StrainScan to identify plasmids of bacterial strains in short reads.

option-1: identify possible plasmids by using contigs <100000 bp: python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 1 -r <Ref_genome_Dir> -o <Output_Dir>

option-2: identify possible plasmids (or possible strains) using reference genomes provided by "-r". python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 2 -r <Ref_genome_Dir> -o <Output_Dir>

<Ref_genome_Dir> refer to the dir of reference genomes of identified clusters or all strains used to build the database.

Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.

This mode will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. If there is a novel strain not in the database, then its closest relative can be one specific strain while its partial regions (we call them "extraRegion" ) in the genome can be similar to other strains. In this case, this mode ca

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

liaoherui

View profile

View on GitHub

GitHub Stars43

CategoryDevelopment

Updated6mo ago

Forks7

liaoherui/StrainScan

Languages

Python

Security Score

87/100

Audited on Sep 21, 2025

No findings

StrainScan

Install / Use

README

StrainScan

Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut

E-mail: heruiliao2-c@my.cityu.edu.hk / yxjijms@gmail.com

Version: V1.0.14 (update at 2023-10-13)

[Update - 2022 - 05 - 01] : <BR/>

[Update - 2022 - 06 - 07] : <BR/>

[Update - 2023 - 02 - 07] : <BR/>

[Update - 2023 - 04 - 22] : <BR/>

[Update - 2023 - 05 - 04] : <BR/>

[Update - 2023 - 05 - 15] : <BR/>

[Update - 2023 - 09 - 23] : <BR/>

[Update - 2023 - 09 - 29] : <BR/>

[Update - 2023 - 10 - 03] : <BR/>

[Update - 2023 - 10 - 13] : <BR/>

Overview of StrainScan:

Dependencies:

Install (Linux or ubuntu only)

Pre-built databases download

Usage

Use StrainScan to build your own custom database.<BR/>

Use StrainScan to build your own custom database with custom clustering file.

Use StrainScan_subsample to subsample your large-scale strains with high redundancy.

Use StrainScan to identify bacterial strains in short reads.

Use StrainScan to identify plasmids of bacterial strains in short reads.

Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.

Related Skills