StrainScan
High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers
Install / Use
/learn @liaoherui/StrainScanREADME
StrainScan
One efficient, accurate and high-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers. StrainScan takes reference database and sequencing data as input, outputs strain-level microbiome compistion analysis report.
Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut
E-mail: heruiliao2-c@my.cityu.edu.hk / yxjijms@gmail.com
Version: V1.0.14 (update at 2023-10-13)
<details> <summary>Click here to check the log of all updates</summary>[Update - 2022 - 05 - 01] : <BR/>
- V1.0.3: StrainScan can be installed via bioconda now! <BR/>
[Update - 2022 - 06 - 07] : <BR/>
- V1.0.10: Add multuple threads to the reference database constrcution! <BR/>
[Update - 2023 - 02 - 07] : <BR/>
- Two new intra-cluster searching modes are updated: plasmid_mode and extraRegion_mode.<BR/>
[Update - 2023 - 04 - 22] : <BR/>
- StrainScan is able to take gzipped and PE FASTQs as input now!<BR/>
[Update - 2023 - 05 - 04] : <BR/>
- StrainScan is able to take the custom clustering file generated by other clustering methods (e.g. PopPunk)! Additionally, we have made updates to the script (StrainScan_subsample.py) which now enables users to subsample their strains using hierarchical clustering. <BR/>
[Update - 2023 - 05 - 15] : <BR/>
- Add a parameter "-b" that enables StrainScan to output the probability of detecting a strain in samples with low sequencing depth (e.g. <1X).<BR/>
[Update - 2023 - 09 - 23] : <BR/>
-
One database containing 1627 Staphylococcus aureus strains is publicly available!<BR/>
</details>
[Update - 2023 - 09 - 29] : <BR/>
- V1.0.13: Update Bioconda version to latest GitHub version, which has more and new functions!! <BR/>
[Update - 2023 - 10 - 03] : <BR/>
- One database containing 1124 Lactobacillus crispatus strains is publicly available!<BR/>
[Update - 2023 - 10 - 13] : <BR/>
- V1.0.14: Fix a bug in identify.py about the identification of low-depth strains! <BR/>
Overview of StrainScan:
<div align=center><img width="500" height="500" src="https://user-images.githubusercontent.com/22760266/152946273-b39c5c10-9a96-4572-b409-e7a8db53d9e4.png" alt="strainscan_overview_new"></div>Dependencies:
- Python ==3.7.x
- R
- Sibeliaz ==1.2.2 (https://github.com/medvedevgroup/SibeliaZ)
- Required python package: numpy==1.17.3, pandas==1.0.1, biopython==1.74, scipy==1.3.1, scikit-learn==0.23.1, bidict==0.21.3, treelib==1.6.1
Make sure these programs have been installed before using StrainScan.
Install (Linux or ubuntu only)
Option 1 - The first way to install StrainScan, is to use bioconda. Once you have bioconda environment installed, install package strainscan:
conda install -c bioconda strainscan
It should be noted that some commands have been replaced if you install StrainScan using bioconda. (See below)
Command (Not bioconda) | Command (bioconda) ------------ | ------------- python StrainScan.py -h | strainscan -h python StrainScan_build.py -h | strainscan_build -h
Option 2 - Also, yon can install StrainScan via Anaconda using the commands below:<BR/>
git clone https://github.com/liaoherui/StrainScan.git<BR/>
cd StrainScan<BR/>
conda env create -f environment_candidate.yaml<BR/>
conda activate strainscan<BR/>
chmod 755 library/jellyfish-linux<BR/>
chmod 755 library/dashing_s128<BR/>
Option 3 - Or, you can install all dependencies of StrainScan mannually and then run the commands below.
git clone https://github.com/liaoherui/StrainScan.git<BR/>
cd StrainScan<BR/>
chmod 755 library/jellyfish-linux<BR/>
chmod 755 library/dashing_s128<BR/>
Pre-built databases download
The table below offers information about the pre-built databases of 6 bacterial species used in the paper and 2 additional bacterial species. Users can download these databases and use them to identify strains directly.
Species | Source | Number of Strains | Number of Clusters | Download link ------------ | -------------| ------------- | ------------- | ------------- Akkermansia muciniphila | NCBI | 157 | 53 | Google drive Cutibacterium acnes | NCBI | 275 | 28 | Google drive Prevotella copri | NCBI | 112 | 51 | Google drive Escherichia coli | NCBI | 1433 | 823 | Google drive Mycobacterium tuberculosis | NCBI | 792 | 25 | Google drive Staphylococcus epidermidis | NCBI | 995 | 378 | Google drive Staphylococcus aureus | NCBI | 1627 | 202 | Google drive Lactobacillus crispatus | NCBI | 1124 | 311 | Google drive
You can also use bash scripts in the folder "Download_DB_script" to download the pre-built databases from Google drive. For example,
cd Download_DB_script<BR/>
sh ecoli_db.sh<BR/>
If you can not download databases from the google drive, you may try to download databases from the Baidu Netdisk.<BR/>
Baidu Netdisk link: https://pan.baidu.com/s/1YFtHv2weJEBdwTz4LmTKOQ <BR/>
Extraction code: ASDF<BR/>
Usage
One example about database construction and identification commands can be found in "<b>test_run.sh</b>".
Use StrainScan to build your own custom database.<BR/>
python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir><BR/>
<BR/>eg:
python StrainScan_build.py -i Test_genomes -o DB_Small<BR/>
(Note: input fasta can be gzipped format)
Use StrainScan to build your own custom database with custom clustering file.
python StrainScan_build.py -i <Input_Genomes> -c <Cluster_file> -o <Database_Dir><BR/>
<BR/> The data format of the input clustering file can be found in the demo file Custom_cluster_demo/custom_cls.txt, where the first column is the cluster ID, the second column is the cluster size, and the last column is the prefix of the reference genomes in the cluster.
Use StrainScan_subsample to subsample your large-scale strains with high redundancy.
python StrainScan_subsample.py -i <Input_Genomes> -o <Output_Dir><BR/>
<BR/> If you have large-scale strains with high redundancy and you want to remove the redundancy. Then you can use this script, which utilizes dashing and hierarchical clustering to subsample strains efficiently. The subsampled strains can be found in <Output_Dir>/Rep_ref and clustering information can be found in <Output_Dir>/Cls_res. More parameters can be found using python StrainScan_subsample.py -h.
Use StrainScan to identify bacterial strains in short reads.
python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir><BR/>
<BR/>eg:
python StrainScan.py -i Sim_Data/GCF_003812785.fq -d DB_Small -o Test_Sim/GCF_003812785<BR/>
or
python StrainScan.py -i Sim_Data_mul/GCA_000144385_5X_GCF_008868325_5X.fq -d DB_Small -o Test_Sim/GCA_000144385_5X_GCF_008868325_5X <BR/>
PE reads (can be gzipped FASTQ format)<BR/>
python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785<BR/>
Note: if the sequencing depth of targeted strains is super low (e.g., <1X), then you may get "Warning: No clusters can be detected!". In this case, you can use parameter "-b" to output the probability of detecting a strain (cluster) in low-depth samples. The higher the probability, the more likely the strain (cluster) is to be present.<BR/>
eg:
python StrainScan.py -i <Input_reads> -d <Database_Dir> -b 1 -o <Output_Dir><BR/>
Use StrainScan to identify plasmids of bacterial strains in short reads.
option-1: identify possible plasmids by using contigs <100000 bp:<BR/>
python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 1 -r <Ref_genome_Dir> -o <Output_Dir><BR/>
option-2: identify possible plasmids (or possible strains) using reference genomes provided by "-r".<BR/>
python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 2 -r <Ref_genome_Dir> -o <Output_Dir><BR/>
<Ref_genome_Dir> refer to the dir of reference genomes of identified clusters or all strains used to build the database.
Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.
This mode will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. If there is a novel strain not in the database, then its closest relative can be one specific strain while its partial regions (we call them "extraRegion" ) in the genome can be similar to other strains. In this case, this mode ca
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
