PStrain
Pstrain profiles strains in metagenomics data. It infers strain abundance and genotype for each species. Also, it has a single species mode; where given a BAM and VCF, it can phase the variants for any species.
Install / Use
/learn @wshuai294/PStrainREADME
PStrain: An Iterative Microbial Strains Profiling Algorithm for Shotgun Metagenomic Sequencing Data
Quick start
PStrain with Metaphlan4
First, get the source code and construct the env
git clone https://github.com/wshuai294/PStrain.git --depth 1
cd PStrain/
conda env create --name pstrain -f pstrain_metaphlan4_env.yml
conda activate pstrain
Then get the metaphlan database
bash scripts/collect_metaphlan_datbase.sh -x mpa_vOct22_CHOCOPhlAnSGB_202403 -m 4 -d ./ #constructing the reference for Metaphlan4
python3 scripts/PStrain.py --help
test PStrain by:
cd test/
bash test_PStrain_V40.sh # remember edit the --bowtie2db and -x if you have downloaded the database yourself.
Also, you can install it using Bioconda directly by:
conda create --name pstrain -c bioconda -c conda-forge pstrain
conda activate pstrain
or install it in an existing environment:
conda install bioconda::pstrain
PStrain with Metaphlan3
First, get the source code and construct the env
git clone https://github.com/wshuai294/PStrain.git --depth 1
cd PStrain/
conda env create --name pstrain3 -f pstrain_metaphlan3_env.yml
conda activate pstrain3
Then get the metaphlan database
bash scripts/collect_metaphlan_datbase.sh -x mpa_v31_CHOCOPhlAn_201901 -m 3 -d ./ #constructing the reference for Metaphlan3
python3 scripts/PStrain.py --help
test PStrain by:
cd test/
bash test_PStrain_V30.sh # remember edit the --bowtie2db and -x if you have downloaded the database yourself.
PStrain with Metaphlan2
git clone https://github.com/wshuai294/PStrain.git --depth 1
cd PStrain/
conda env create --name pstrain2 -f pstrain_metaphlan2_env.yml
conda activate pstrain2
python3 scripts/PStrain_m2.py -h
Download db/ and packages/ from Google Drive or Zenodo, and put them in the same path with the scripts/ folder. Also, you can appoint the path of the dependencies by --dbdir, --prior and --metaphlan2.
test PStrain (with Metaphlan2) by:
cd test/
bash test_run.sh
Note:
- PStrain supports both single-end and paired-end reads.
Basic Usage
Main functions
| Scripts | Description | | --- | --- | |scripts/PStrain.py| Strain profiling based on the Metaphlan4 or Metaphlan3| |scripts/PStrain_m2.py| Strain profiling based on the Metaphlan2| |scripts/collect_metaphlan_datbase.sh| collect the Metaphlan4/3 database and build the bowtie2 index | |scripts/single_species.py| phase variants for a single genome, based on the BAM and VCF file| |scripts/merge.py| Merge the strain results for downstream analyses |
Run PStrain based on Metaphlan4 or Metaphlan3
usage: python3 PStrain.py [options] -c/--conf <config file> -o/--outdir <output directory> --bowtie2db <metaphlan database> -x <metaphlan db index>
Example: python3 PStrain.py -c config.txt -o out --bowtie2db ../mpa_vOct22_CHOCOPhlAnSGB_202403/ -x mpa_vOct22_CHOCOPhlAnSGB_202403 # Metaphlan 4
python3 PStrain.py --metaphlan_version 3 -c config.txt -o outdir/ --bowtie2db ../mpa_v31_CHOCOPhlAn_201901/ -x mpa_v31_CHOCOPhlAn_201901 # Metaphlan 3
The config file should follow the format:
//
sample: [sample1_ID]
fq1: [forward reads fastq]
fq2: [reverse/mate reads fastq]
//
sample: [sample2_ID]
fq1: [forward reads fastq]
fq2: [reverse/mate reads fastq]
...
Help information can be found by python3 PStrain.py -h/--help, config file format for single end reads , and additional information can be found in README.MD or https://github.com/wshuai294/PStrain.
PStrain: profile strains in shotgun metagenomic sequencing reads.
required arguments:
-c , --conf The configuration file of the input samples. (default: None)
-o , --outdir The output directory. (default: None)
--bowtie2db Path to metaphlan bowtie2db. (default:
/home/wangshuai/softwares/PStrain/scripts/../mpa_vOct22_CHOCOPhlAnSGB_202403)
-x , --metaphlan_index
metaphlan bowtie2db index. (default: mpa_vOct22_CHOCOPhlAnSGB_202403)
options:
-h, --help show this help message and exit
--metaphlan_version Metaphlan version [3 or 4]. Used to download the corresponding database. If you build the
metaphlan database yourself, just ignore this. (default: 4)
--bowtie2 Path to bowtie2 binary. (default: bowtie2)
--bowtie2-build Path to bowtie2 binary. (default: bowtie2-build)
--samtools Path to samtools binary. (default: samtools)
--metaphlan Path to metaphlan. (default: metaphlan)
--metaphlan_output_files
If you have run metaphlan already, give metaphlan result file in each line, the order should be
the same with config file. In particular, '--tax_lev s' should be added while running
metaphlan. (default: )
-p , --proc The number of process to use for parallelizing the whole pipeline, run a sample in each
process. (default: 1)
-n , --nproc The number of CPUs to use for parallelizing the mapping with bowtie2. (default: 1)
-w , --weight The weight of genotype frequencies while computing loss, then the weight of linked read type
frequencies is 1-w. The value is between 0~1. (default: 0.3)
--lambda1 The weight of prior knowledge while rectifying genotype frequencies. The value is between 0~1.
(default: 0.0)
--lambda2 The weight of prior estimation while rectifying second order genotype frequencies. The value is
between 0~1. (default: 0.0)
--species_dp The minimum depth of species to be considered in strain profiling step (default is 5).
(default: 5)
--snp_ratio The SNVs of which the depth are less than the specific ratio of the species's mean depth would
be removed. (default: 0.45)
--qual The minimum quality score of SNVs to be considered in strain profiling step. (default: 20)
--similarity The similarity cutoff of hierachical clustering in merge step. (default: 0.95)
--elbow The cutoff of elbow method while identifying strains number. If the loss reduction ratio is
less than the cutoff, then the strains number is determined. (default: 0.24)
--gatk Path to gatk binary. (default: /home/wangshuai/softwares/PStrain/scripts//GenomeAnalysisTK.jar)
--picard Path to picard binary. (default: /home/wangshuai/softwares/PStrain/scripts//picard.jar)
--prior The file of prior knowledge of genotype frequencies in the population. Not providing this is
also ok. (default: None)
If you have run metaphlan3 already for each file, to save runtime, you can build a file to record
the metaphlan3 resultfile of each sample in each line (In particular, --tax_lev s should be added while running metaphlan3.).
Please afford this file to PStrain.py with the parameter --metaphlan_output_files.
The file should be like
sample1_metaphlan_output.txt
sample2_metaphlan_output.txt
sample3_metaphlan_output.txt
...
Then PStrain will skip running metaphlan, and use the existing metaphlan result files.
python3 PStrain.py --metaphlan_output_files exisiting_metaphlan_results.txt -c test.config -o test_dir
For metaphlan3/4, we have not built prior genotype frequencies database which will not impact the performance very much. So just ignore --prior parameter. Also, you can build a prior genotype frequencies database by yourself. I will show how to do this in the following section. In this way, PStrain will run Metaphlan to find the species profile in the sample and profile the strains in each species.
Run PStrain based on Metaphlan2
Example:
Example: python3 PStrain_m2.py -p 10 -n 8 --similarity 0.95 -o ./outdir -c config.txt \
--dbdir /home/wangshuai/strain/00.simulation/15.needle/db \
--prior /home/wangshuai/strain/00.simulation/15.needle/db/prior_beta.pickle\
--metaphlan2 /path-to/metaphlan2
Command:
PStrain: profile strains in shotgun metagenomic sequencing reads.
required arguments:
-c , --conf The configuration file of the input samples. (default:
None)
-o , --outdir The output directory. (default: None)
--dbdir Path to marker gene database, can be downloaded from
https://zenodo.org/doi/10.5281/zenodo.10457543. (default:
/home/wangshuai/softwares/PStrain/scripts/../db/)
optional arguments:
-h, --help show this help message and exit
-p , --proc The number of process to use for parallelizing the whole
pipeline, run a sample in each process. (default: 1)
-n , --nproc The number of CPUs to use for parallelizing the mapping
with bowtie2. (default: 1)
-w , --weight The weight of genotype frequencies while computing loss,
then the weight of linked read type frequencies is 1-w.
The value is between 0~1. (default: 0.3)
--lambda1 The weight of prior knowledge while rectifying genotype
frequencies. The value is between 0~1. (default: 0.0)
--lambda2 The weight of prior estimation while rectifying second
order genotype frequencies. The value is between 0~1.
(default: 0.0)
--species_dp The minimum depth of species to be considered in strain
profiling step. (default: 5)
--snp_ratio The SNVs of which the depth are less than the specific
ratio o
Related Skills
product-manager-skills
38PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
