SkillAgentSearch skills...

StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers

Install / Use

/learn @liaoherui/StrainScan
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

install with bioconda

StrainScan

One efficient, accurate and high-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers. StrainScan takes reference database and sequencing data as input, outputs strain-level microbiome compistion analysis report.

Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut

E-mail: heruiliao2-c@my.cityu.edu.hk / yxjijms@gmail.com

Version: V1.0.14 (update at 2023-10-13)

<details> <summary>Click here to check the log of all updates</summary>

[Update - 2022 - 05 - 01] : <BR/>

  • V1.0.3: StrainScan can be installed via bioconda now! <BR/>

[Update - 2022 - 06 - 07] : <BR/>

  • V1.0.10: Add multuple threads to the reference database constrcution! <BR/>

[Update - 2023 - 02 - 07] : <BR/>

  • Two new intra-cluster searching modes are updated: plasmid_mode and extraRegion_mode.<BR/>

[Update - 2023 - 04 - 22] : <BR/>

  • StrainScan is able to take gzipped and PE FASTQs as input now!<BR/>

[Update - 2023 - 05 - 04] : <BR/>

  • StrainScan is able to take the custom clustering file generated by other clustering methods (e.g. PopPunk)! Additionally, we have made updates to the script (StrainScan_subsample.py) which now enables users to subsample their strains using hierarchical clustering. <BR/>

[Update - 2023 - 05 - 15] : <BR/>

  • Add a parameter "-b" that enables StrainScan to output the probability of detecting a strain in samples with low sequencing depth (e.g. <1X).<BR/>

[Update - 2023 - 09 - 23] : <BR/>

  • One database containing 1627 Staphylococcus aureus strains is publicly available!<BR/>

    </details>

[Update - 2023 - 09 - 29] : <BR/>

  • V1.0.13: Update Bioconda version to latest GitHub version, which has more and new functions!! <BR/>

[Update - 2023 - 10 - 03] : <BR/>

  • One database containing 1124 Lactobacillus crispatus strains is publicly available!<BR/>

[Update - 2023 - 10 - 13] : <BR/>

  • V1.0.14: Fix a bug in identify.py about the identification of low-depth strains! <BR/>

Overview of StrainScan:

<div align=center><img width="500" height="500" src="https://user-images.githubusercontent.com/22760266/152946273-b39c5c10-9a96-4572-b409-e7a8db53d9e4.png" alt="strainscan_overview_new"></div>

Dependencies:

  • Python ==3.7.x
  • R
  • Sibeliaz ==1.2.2 (https://github.com/medvedevgroup/SibeliaZ)
  • Required python package: numpy==1.17.3, pandas==1.0.1, biopython==1.74, scipy==1.3.1, scikit-learn==0.23.1, bidict==0.21.3, treelib==1.6.1

Make sure these programs have been installed before using StrainScan.

Install (Linux or ubuntu only)

Option 1 - The first way to install StrainScan, is to use bioconda. Once you have bioconda environment installed, install package strainscan:

conda install -c bioconda strainscan

It should be noted that some commands have been replaced if you install StrainScan using bioconda. (See below)

Command (Not bioconda) | Command (bioconda) ------------ | ------------- python StrainScan.py -h | strainscan -h python StrainScan_build.py -h | strainscan_build -h

Option 2 - Also, yon can install StrainScan via Anaconda using the commands below:<BR/>

git clone https://github.com/liaoherui/StrainScan.git<BR/> cd StrainScan<BR/>

conda env create -f environment_candidate.yaml<BR/> conda activate strainscan<BR/>

chmod 755 library/jellyfish-linux<BR/> chmod 755 library/dashing_s128<BR/>

<!---Note: if the command `conda env create -f environment_candidate.yaml` outputs an error (which is most likely caused by your machine): `ResolvePackageNotFound...`, then you can try the command `conda env create -f environment_candidate.yaml`-->

Option 3 - Or, you can install all dependencies of StrainScan mannually and then run the commands below.

git clone https://github.com/liaoherui/StrainScan.git<BR/> cd StrainScan<BR/>

chmod 755 library/jellyfish-linux<BR/> chmod 755 library/dashing_s128<BR/>

Pre-built databases download

The table below offers information about the pre-built databases of 6 bacterial species used in the paper and 2 additional bacterial species. Users can download these databases and use them to identify strains directly.

Species | Source | Number of Strains | Number of Clusters | Download link ------------ | -------------| ------------- | ------------- | ------------- Akkermansia muciniphila | NCBI | 157 | 53 | Google drive Cutibacterium acnes | NCBI | 275 | 28 | Google drive Prevotella copri | NCBI | 112 | 51 | Google drive Escherichia coli | NCBI | 1433 | 823 | Google drive Mycobacterium tuberculosis | NCBI | 792 | 25 | Google drive Staphylococcus epidermidis | NCBI | 995 | 378 | Google drive Staphylococcus aureus | NCBI | 1627 | 202 | Google drive Lactobacillus crispatus | NCBI | 1124 | 311 | Google drive

You can also use bash scripts in the folder "Download_DB_script" to download the pre-built databases from Google drive. For example,

cd Download_DB_script<BR/> sh ecoli_db.sh<BR/>

If you can not download databases from the google drive, you may try to download databases from the Baidu Netdisk.<BR/> Baidu Netdisk link: https://pan.baidu.com/s/1YFtHv2weJEBdwTz4LmTKOQ <BR/> Extraction code: ASDF<BR/>

Usage

One example about database construction and identification commands can be found in "<b>test_run.sh</b>".

Use StrainScan to build your own custom database.<BR/>

python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir><BR/> <BR/>eg: python StrainScan_build.py -i Test_genomes -o DB_Small<BR/>

(Note: input fasta can be gzipped format)

Use StrainScan to build your own custom database with custom clustering file.

python StrainScan_build.py -i <Input_Genomes> -c <Cluster_file> -o <Database_Dir><BR/> <BR/> The data format of the input clustering file can be found in the demo file Custom_cluster_demo/custom_cls.txt, where the first column is the cluster ID, the second column is the cluster size, and the last column is the prefix of the reference genomes in the cluster.

Use StrainScan_subsample to subsample your large-scale strains with high redundancy.

python StrainScan_subsample.py -i <Input_Genomes> -o <Output_Dir><BR/> <BR/> If you have large-scale strains with high redundancy and you want to remove the redundancy. Then you can use this script, which utilizes dashing and hierarchical clustering to subsample strains efficiently. The subsampled strains can be found in <Output_Dir>/Rep_ref and clustering information can be found in <Output_Dir>/Cls_res. More parameters can be found using python StrainScan_subsample.py -h.


Use StrainScan to identify bacterial strains in short reads.

python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir><BR/> <BR/>eg: python StrainScan.py -i Sim_Data/GCF_003812785.fq -d DB_Small -o Test_Sim/GCF_003812785<BR/> or python StrainScan.py -i Sim_Data_mul/GCA_000144385_5X_GCF_008868325_5X.fq -d DB_Small -o Test_Sim/GCA_000144385_5X_GCF_008868325_5X <BR/>

PE reads (can be gzipped FASTQ format)<BR/> python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785<BR/>

Note: if the sequencing depth of targeted strains is super low (e.g., <1X), then you may get "Warning: No clusters can be detected!". In this case, you can use parameter "-b" to output the probability of detecting a strain (cluster) in low-depth samples. The higher the probability, the more likely the strain (cluster) is to be present.<BR/> eg: python StrainScan.py -i <Input_reads> -d <Database_Dir> -b 1 -o <Output_Dir><BR/>

Use StrainScan to identify plasmids of bacterial strains in short reads.

option-1: identify possible plasmids by using contigs <100000 bp:<BR/> python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 1 -r <Ref_genome_Dir> -o <Output_Dir><BR/>

option-2: identify possible plasmids (or possible strains) using reference genomes provided by "-r".<BR/> python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 2 -r <Ref_genome_Dir> -o <Output_Dir><BR/>

<Ref_genome_Dir> refer to the dir of reference genomes of identified clusters or all strains used to build the database.

Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.

This mode will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. If there is a novel strain not in the database, then its closest relative can be one specific strain while its partial regions (we call them "extraRegion" ) in the genome can be similar to other strains. In this case, this mode ca

Related Skills

View on GitHub
GitHub Stars43
CategoryDevelopment
Updated6mo ago
Forks7

Languages

Python

Security Score

87/100

Audited on Sep 21, 2025

No findings