VirStrain <img src="logo.png" width="250" title="VirStrain">

An RNA virus strain-level identification tool for short reads.

E-mail: heruiliao2-c@my.cityu.edu.hk

Recommanded Version: V1.17

Old Version - V1.14: Fix some bugs but lack virstrain_contig and virstrain_merge.

<details> <summary> Click here to check the log of all updates </summary>

[Update - 2022 - 02 - 05] :

V1.12: VirStrain is able to take gzipped FASTQs as input now!

[Update - 2022 - 03 - 23] :

Fix one bug of the perl script about head name problem.

[Update - 2022 - 11 - 10] :

Update a new parameter '-s' that allows sorting the most possible strain by matches to the sites.

[Update - 2022 - 12 - 16] :

The web server extension of VirStrain - StrainDetect (https://strain.ee.cityu.edu.hk) is online now!

[Update - 2022 - 12 - 20] :

V1.13: Fix a database generation bug in V1.12 of bioconda version!

</details>

[Update - 2023 - 09 - 05] :

A new function that allows comprehensive (including 45619 strains of 28 viral species) viral strain identification for assembled contigs is available!

[Update - 2023 - 10 - 12] :

V1.14: Fix a bug (about handling gzipped FASTQs) in V1.13!

[Update - 2024 - 02 - 27] :

Tem_Vs files are named randomly (only GitHub version) and links for pre-built databases are provided.

[Update - 2024 - 03 - 11] :

V1.17: All the changes made so far have been updated in both GitHub and Conda.

[Update - 2024 - 05 - 28] :

V1.17: Add the parameter '-v' to show the version information. (GitHub version available only)

Dependencies:

Python >=3.6 (3.7.3 is recommanded and 3.9 is not supprted now!)
Perl
Required python package: networkx==2.4, numpy==1.17.3, pandas==1.0.1, biopython==1.74, Plotly==3.10.0
Bowtie2 (for virstrain version >= V1.17)

(If you have installed conda, then you can run sh install_package.sh to install all required packages automatically.)

Make sure these programs have been installed before using VirStrain. (However, if you use bioconda/pip to install VirStrain, ignore this.)

Install (Linux or ubuntu only)

The first way to install VirStrain, is to use bioconda. Once you have bioconda environment installed, install package virstrain:

conda install -c bioconda virstrain

The second way to install VirStrain, is to use pip:

pip install virstrain==1.17

It should be noted that some commands have been replaced if you install VirStrain using bioconda/pip. (See below)

Command (Not bioconda/pip) | Command (bioconda/pip) ------------ | ------------- python VirStrain.py -h | virstrain -h python VirStrain_build.py -h | virstrain_build -h python VirStrain_contig.py -h | virstrain_contig -h python VirStrain_contigDB_merge.py -h | virstrain_merge -h

Or you can install VirStrain mannually (Make sure all dependencies have been installed before this step).

git clone https://github.com/liaoherui/VirStrain.git cd VirStrain chmod 755 bin/jellyfish-linux rm VirStrain_DB.tar.gz

Then, you can download the reference database of 3 RNA viruses used in the paper. There are three ways to download the reference database. -> Method-1: Run: cd VirStrain sh download.sh

[Update - 2022 - 02 - 08] :

-> Method-2:  Run: cd VirStrain wget -qO- "https://figshare.com/ndownloader/files/34002479" | tar -zx Or, download the database from figshare mannually, and then extract it using the command tar -zxvf.

If all failed, please email to the author to get the database.

[Update - 2021 - Nov] :

The databases of two DNA viruses (HBV and HCMV) used in the paper can be downloaded now!  sh download_dna.sh
Besides, a larger database with more SARS-CoV-2 strains (see Supplementary Section 1.1 in the paper) can also be downloaded now.  sh download_scov2_big.sh

You can also build the VirStrain database with your own genomes, the mannual is written in Usage section.

Pre-built databases download

In the event that the download scripts fail to retrieve the pre-built database, we also provide Google drive inks to access all pre-built databases. The table below offers information about the public pre-built databases. Users can download these databases and use them to identify viral strains directly. Name | Description | Download link ------------ | ------------- | ------------- VirStrain_DB.tar.gz | Databases containing SCOV2, H1N1, and HIV viral strains used in the paper | Google drive SCOV2_newBig.tar.gz | Databases containing more SCOV2 viral strains used in the paper | Google drive VirStrain_DNA_DB.tar.gz | Databases containing two DNA viral (HBV and HCMV) strains used in the paper | Google drive VirStrain_contig_DB.tar.gz | Contig-level database | Google drive

Usage

It should be noted if you install VirStrain using bioconda/pip, you should replace the commands. (see below)

Use VirStrain to identify RNA virus strains in short reads.

For SE reads: python VirStrain.py -i Test_Data/MT451123_1.fq -d VirStrain_DB/SCOV2 -o MT451123_SE_Test

For PE reads: python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

When the virus has high mutation rate, like HIV, you may need to add -m parameter.

For HIV: SE reads: python VirStrain.py -i <Read1> -d VirStrain_DB/HIV -o <Output_dir> -m PE reads: python VirStrain.py -i <Read1> -p <Read2> -d VirStrain_DB/HIV -o <Output_dir> -m

[Update - 2023 - Sep] Use VirStrain_contig to identify viral strains for assembled contigs.

python VirStrain_contig.py -i <Input_Contig_fasta> -d VirStrain_contig_DB -o VirStrain_Contig_Res

You can use the command below to download the pre-built comprehensive viral strain database for contig identification:

sh download_contig_db.sh

If you want to convert pre-built VirStrain databases for reads (e.g. VirStrain_DB/SCOV2 and VirStrain_DB/H1N1) to database for contigs. Then you can try the command below:

python VirStrain_contigDB_merge.py -i VirStrain_DB/SCOV2,VirStrain_DB/H1N1 -o VirStrain_contig_DB_merge

Use VirStrain to build your own custom database.

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>

Important note: "," and "|" are not allowed in your <Input_MSA>. For example, ">Strain_A, 2022" or ">Strain_A|2022" is not allowed but ">Strain_A_2022" is allowed.

For small-scale strains (<1000 input strains) or viruses with large genome sizes (like HCMV), you can use manual-covering function to cover more useful sites. For example, in our experiment, we used "-s 0.4" for 328 HCMV strains. Usually, 0.2~0.6 shoule be a suitable range for the parameter "-s". However, if you only have very few strains, like 3 strains, you can also use a greater value like "-s 0.8".

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4

Besides, if you only want to use SNV sites from "x" to "y" (eg. x=500 to y=1000), then you can add the parameter -r.

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4 -r 500-1000

Note: The format of input MSA should be same as the format of MSA generated by Mafft (https://mafft.cbrc.jp/alignment/software/).

Full command-line options

Identification - VirStrain.py (Default k-mer size: 25)

VirStrain - An RNA virus strain-level identification tool for short reads.

Example: python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

required arguments:
    -i, --input_reads             Input fastq data.
    -d, --database_dir            Path of VirStrain database.

optional arguments:
    -h, --help                    Show help message and exit.
    -o, --output_dir              The output directory. (Default: ./VirStrain_Out)
    -p, --input_reads2            Input fastq data for PE reads
    -c, --site_filter_cutoff      The cutoff of filtering one site when calculate the Vscore. (Default: 0.05)
    -s, --rank_by_sites		  If set to 1, then VirStrain will sort the most possible strain by matches to the sites. (default: 0)
    -f, --turn_off_figures	  If set to 1, then VirStrain will not generate figures. (default:

VirStrain

Install / Use

README

VirStrain <img src="logo.png" width="250" title="VirStrain">

E-mail: heruiliao2-c@my.cityu.edu.hk

Recommanded Version: V1.17

[Update - 2022 - 02 - 05] : <BR/>

[Update - 2022 - 03 - 23] : <BR/>

[Update - 2022 - 11 - 10] : <BR/>

[Update - 2022 - 12 - 16] : <BR/>

[Update - 2022 - 12 - 20] : <BR/>

[Update - 2023 - 09 - 05] : <BR/>

[Update - 2023 - 10 - 12] : <BR/>

[Update - 2024 - 02 - 27] : <BR/>

[Update - 2024 - 03 - 11] : <BR/>

[Update - 2024 - 05 - 28] : <BR/>

Dependencies:

Install (Linux or ubuntu only)

[Update - 2022 - 02 - 08] : <BR/>

[Update - 2021 - Nov] : <BR/>

Pre-built databases download

Usage

Use VirStrain to identify RNA virus strains in short reads.

[Update - 2023 - Sep] Use VirStrain_contig to identify viral strains for assembled contigs.

Use VirStrain to build your own custom database.<BR/>

Full command-line options

Related Skills