VirStrain
An RNA virus strain-level identification tool for short reads.
Install / Use
/learn @liaoherui/VirStrainREADME
VirStrain <img src="logo.png" width="250" title="VirStrain">
An RNA virus strain-level identification tool for short reads.
E-mail: heruiliao2-c@my.cityu.edu.hk
Recommanded Version: V1.17
- Old Version - V1.14: Fix some bugs but lack virstrain_contig and virstrain_merge. <BR/>
[Update - 2022 - 02 - 05] : <BR/>
- V1.12: VirStrain is able to take gzipped FASTQs as input now! <BR/>
[Update - 2022 - 03 - 23] : <BR/>
- Fix one bug of the perl script about head name problem.
[Update - 2022 - 11 - 10] : <BR/>
- Update a new parameter '-s' that allows sorting the most possible strain by matches to the sites.
[Update - 2022 - 12 - 16] : <BR/>
- The web server extension of VirStrain - StrainDetect (https://strain.ee.cityu.edu.hk) is online now!
[Update - 2022 - 12 - 20] : <BR/>
- V1.13: Fix a database generation bug in V1.12 of bioconda version! <BR/>
[Update - 2023 - 09 - 05] : <BR/>
- A new function that allows comprehensive (including 45619 strains of 28 viral species) viral strain identification for assembled contigs is available! <BR/>
[Update - 2023 - 10 - 12] : <BR/>
- V1.14: Fix a bug (about handling gzipped FASTQs) in V1.13! <BR/>
[Update - 2024 - 02 - 27] : <BR/>
- Tem_Vs files are named randomly (only GitHub version) and links for pre-built databases are provided. <BR/>
[Update - 2024 - 03 - 11] : <BR/>
- V1.17: All the changes made so far have been updated in both GitHub and Conda. <BR/>
[Update - 2024 - 05 - 28] : <BR/>
- V1.17: Add the parameter '-v' to show the version information. (GitHub version available only) <BR/>
Dependencies:
- Python >=3.6 (3.7.3 is recommanded and 3.9 is not supprted now!)
- Perl
- Required python package: networkx==2.4, numpy==1.17.3, pandas==1.0.1, biopython==1.74, Plotly==3.10.0
- Bowtie2 (for virstrain version >= V1.17)
(If you have installed conda, then you can run sh install_package.sh to install all required packages automatically.)
Make sure these programs have been installed before using VirStrain. (However, if you use bioconda/pip to install VirStrain, ignore this.)
Install (Linux or ubuntu only)
The first way to install VirStrain, is to use bioconda. Once you have bioconda environment installed, install package virstrain:
conda install -c bioconda virstrain
The second way to install VirStrain, is to use pip:
pip install virstrain==1.17
It should be noted that some commands have been replaced if you install VirStrain using bioconda/pip. (See below)
Command (Not bioconda/pip) | Command (bioconda/pip) ------------ | ------------- python VirStrain.py -h | virstrain -h python VirStrain_build.py -h | virstrain_build -h python VirStrain_contig.py -h | virstrain_contig -h python VirStrain_contigDB_merge.py -h | virstrain_merge -h
Or you can install VirStrain mannually (Make sure all dependencies have been installed before this step).
git clone https://github.com/liaoherui/VirStrain.git<BR/>
cd VirStrain<BR/>
chmod 755 bin/jellyfish-linux<BR/>
rm VirStrain_DB.tar.gz<BR/>
Then, you can download the reference database of 3 RNA viruses used in the paper.
There are three ways to download the reference database.<BR/><BR/>
-> Method-1:<BR/>
Run:<BR/>
cd VirStrain<BR/>
sh download.sh<BR/> <BR/>
[Update - 2022 - 02 - 08] : <BR/>
- -> Method-2:<BR/>
Run:<BR/>
cd VirStrain<BR/>wget -qO- "https://figshare.com/ndownloader/files/34002479" | tar -zx<BR/> Or, download the database from figshare mannually, and then extract it using the commandtar -zxvf.
If all failed, please email to the author to get the database.
[Update - 2021 - Nov] : <BR/>
- The databases of two DNA viruses (HBV and HCMV) used in the paper can be downloaded now! <BR/>
sh download_dna.sh<BR/> - Besides, a larger database with more SARS-CoV-2 strains (see Supplementary Section 1.1 in the paper) can also be downloaded now. <BR/>
sh download_scov2_big.sh<BR/>
You can also build the VirStrain database with your own genomes, the mannual is written in Usage section.
Pre-built databases download
In the event that the download scripts fail to retrieve the pre-built database, we also provide Google drive inks to access all pre-built databases. The table below offers information about the public pre-built databases. Users can download these databases and use them to identify viral strains directly. Name | Description | Download link ------------ | ------------- | ------------- VirStrain_DB.tar.gz | Databases containing SCOV2, H1N1, and HIV viral strains used in the paper | Google drive SCOV2_newBig.tar.gz | Databases containing more SCOV2 viral strains used in the paper | Google drive VirStrain_DNA_DB.tar.gz | Databases containing two DNA viral (HBV and HCMV) strains used in the paper | Google drive VirStrain_contig_DB.tar.gz | Contig-level database | Google drive
Usage
It should be noted if you install VirStrain using bioconda/pip, you should replace the commands. (see below)
Command (Not bioconda/pip) | Command (bioconda/pip) ------------ | ------------- python VirStrain.py -h | virstrain -h python VirStrain_build.py -h | virstrain_build -h python VirStrain_contig.py -h | virstrain_contig -h python VirStrain_contigDB_merge.py -h | virstrain_merge -h
Use VirStrain to identify RNA virus strains in short reads.
For SE reads:<BR/>
python VirStrain.py -i Test_Data/MT451123_1.fq -d VirStrain_DB/SCOV2 -o MT451123_SE_Test<BR/>
For PE reads:<BR/>
python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test<BR/>
When the virus has high mutation rate, like HIV, you may need to add -m parameter.
For HIV:<BR/>
SE reads: python VirStrain.py -i <Read1> -d VirStrain_DB/HIV -o <Output_dir> -m<BR/>
PE reads: python VirStrain.py -i <Read1> -p <Read2> -d VirStrain_DB/HIV -o <Output_dir> -m<BR/>
[Update - 2023 - Sep] Use VirStrain_contig to identify viral strains for assembled contigs.
python VirStrain_contig.py -i <Input_Contig_fasta> -d VirStrain_contig_DB -o VirStrain_Contig_Res<BR/>
You can use the command below to download the pre-built comprehensive viral strain database for contig identification:
sh download_contig_db.sh
If you want to convert pre-built VirStrain databases for reads (e.g. VirStrain_DB/SCOV2 and VirStrain_DB/H1N1) to database for contigs. Then you can try the command below:
python VirStrain_contigDB_merge.py -i VirStrain_DB/SCOV2,VirStrain_DB/H1N1 -o VirStrain_contig_DB_merge
Use VirStrain to build your own custom database.<BR/>
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir><BR/>
<b>Important note</b>: "," and "|" are not allowed in your <Input_MSA>. For example, ">Strain_A, 2022" or ">Strain_A|2022" is not allowed but ">Strain_A_2022" is allowed.
For small-scale strains (<1000 input strains) or viruses with large genome sizes (like HCMV), you can use manual-covering function to cover more useful sites. For example, in our experiment, we used "-s 0.4" for 328 HCMV strains. Usually, 0.2~0.6 shoule be a suitable range for the parameter "-s". However, if you only have very few strains, like 3 strains, you can also use a greater value like "-s 0.8".
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4<BR/>
Besides, if you only want to use SNV sites from "x" to "y" (eg. x=500 to y=1000), then you can add the parameter -r.
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4 -r 500-1000<BR/>
Note: The format of input MSA should be same as the format of MSA generated by Mafft (https://mafft.cbrc.jp/alignment/software/).<BR/>
Full command-line options
<!---(Note: The initial idea of development of VirStrain is "Simpler is better". We do not want to burden users due to complicated usage of VirStrain. So the default parameters (some are inside the program) are simple but have good performance in our test, however, more useful parameters will be added for users who need them.)-->Identification - VirStrain.py (Default k-mer size: 25)
VirStrain - An RNA virus strain-level identification tool for short reads.
Example: python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test
required arguments:
-i, --input_reads Input fastq data.
-d, --database_dir Path of VirStrain database.
optional arguments:
-h, --help Show help message and exit.
-o, --output_dir The output directory. (Default: ./VirStrain_Out)
-p, --input_reads2 Input fastq data for PE reads
-c, --site_filter_cutoff The cutoff of filtering one site when calculate the Vscore. (Default: 0.05)
-s, --rank_by_sites If set to 1, then VirStrain will sort the most possible strain by matches to the sites. (default: 0)
-f, --turn_off_figures If set to 1, then VirStrain will not generate figures. (default:
Related Skills
feishu-drive
342.5k|
things-mac
342.5kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
342.5kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
