SkillAgentSearch skills...

VirStrain

An RNA virus strain-level identification tool for short reads.

Install / Use

/learn @liaoherui/VirStrain
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

install with bioconda

VirStrain <img src="logo.png" width="250" title="VirStrain">

An RNA virus strain-level identification tool for short reads.

E-mail: heruiliao2-c@my.cityu.edu.hk

Recommanded Version: V1.17

  • Old Version - V1.14: Fix some bugs but lack virstrain_contig and virstrain_merge. <BR/>
<details> <summary> Click here to check the log of all updates </summary>

[Update - 2022 - 02 - 05] : <BR/>

  • V1.12: VirStrain is able to take gzipped FASTQs as input now! <BR/>

[Update - 2022 - 03 - 23] : <BR/>

  • Fix one bug of the perl script about head name problem.

[Update - 2022 - 11 - 10] : <BR/>

  • Update a new parameter '-s' that allows sorting the most possible strain by matches to the sites.

[Update - 2022 - 12 - 16] : <BR/>

  • The web server extension of VirStrain - StrainDetect (https://strain.ee.cityu.edu.hk) is online now!

[Update - 2022 - 12 - 20] : <BR/>

  • V1.13: Fix a database generation bug in V1.12 of bioconda version! <BR/>
<!-----> </details>

[Update - 2023 - 09 - 05] : <BR/>

  • A new function that allows comprehensive (including 45619 strains of 28 viral species) viral strain identification for assembled contigs is available! <BR/>

[Update - 2023 - 10 - 12] : <BR/>

  • V1.14: Fix a bug (about handling gzipped FASTQs) in V1.13! <BR/>

[Update - 2024 - 02 - 27] : <BR/>

  • Tem_Vs files are named randomly (only GitHub version) and links for pre-built databases are provided. <BR/>

[Update - 2024 - 03 - 11] : <BR/>

  • V1.17: All the changes made so far have been updated in both GitHub and Conda. <BR/>

[Update - 2024 - 05 - 28] : <BR/>

  • V1.17: Add the parameter '-v' to show the version information. (GitHub version available only) <BR/>

Dependencies:

  • Python >=3.6 (3.7.3 is recommanded and 3.9 is not supprted now!)
  • Perl
  • Required python package: networkx==2.4, numpy==1.17.3, pandas==1.0.1, biopython==1.74, Plotly==3.10.0
  • Bowtie2 (for virstrain version >= V1.17)

(If you have installed conda, then you can run sh install_package.sh to install all required packages automatically.)

Make sure these programs have been installed before using VirStrain. (However, if you use bioconda/pip to install VirStrain, ignore this.)

Install (Linux or ubuntu only)

The first way to install VirStrain, is to use bioconda. Once you have bioconda environment installed, install package virstrain:

conda install -c bioconda virstrain

The second way to install VirStrain, is to use pip:

pip install virstrain==1.17

It should be noted that some commands have been replaced if you install VirStrain using bioconda/pip. (See below)

Command (Not bioconda/pip) | Command (bioconda/pip) ------------ | ------------- python VirStrain.py -h | virstrain -h python VirStrain_build.py -h | virstrain_build -h python VirStrain_contig.py -h | virstrain_contig -h python VirStrain_contigDB_merge.py -h | virstrain_merge -h

Or you can install VirStrain mannually (Make sure all dependencies have been installed before this step).

git clone https://github.com/liaoherui/VirStrain.git<BR/> cd VirStrain<BR/> chmod 755 bin/jellyfish-linux<BR/> rm VirStrain_DB.tar.gz<BR/>

Then, you can download the reference database of 3 RNA viruses used in the paper. There are three ways to download the reference database.<BR/><BR/> -> Method-1:<BR/> Run:<BR/> cd VirStrain<BR/> sh download.sh<BR/> <BR/>

[Update - 2022 - 02 - 08] : <BR/>

  • -> Method-2:<BR/> Run:<BR/> cd VirStrain<BR/> wget -qO- "https://figshare.com/ndownloader/files/34002479" | tar -zx<BR/> Or, download the database from figshare mannually, and then extract it using the command tar -zxvf.

If all failed, please email to the author to get the database.

[Update - 2021 - Nov] : <BR/>

  • The databases of two DNA viruses (HBV and HCMV) used in the paper can be downloaded now! <BR/> sh download_dna.sh<BR/>
  • Besides, a larger database with more SARS-CoV-2 strains (see Supplementary Section 1.1 in the paper) can also be downloaded now. <BR/> sh download_scov2_big.sh<BR/>

You can also build the VirStrain database with your own genomes, the mannual is written in Usage section.

Pre-built databases download

In the event that the download scripts fail to retrieve the pre-built database, we also provide Google drive inks to access all pre-built databases. The table below offers information about the public pre-built databases. Users can download these databases and use them to identify viral strains directly. Name | Description | Download link ------------ | ------------- | ------------- VirStrain_DB.tar.gz | Databases containing SCOV2, H1N1, and HIV viral strains used in the paper | Google drive SCOV2_newBig.tar.gz | Databases containing more SCOV2 viral strains used in the paper | Google drive VirStrain_DNA_DB.tar.gz | Databases containing two DNA viral (HBV and HCMV) strains used in the paper | Google drive VirStrain_contig_DB.tar.gz | Contig-level database | Google drive

Usage

It should be noted if you install VirStrain using bioconda/pip, you should replace the commands. (see below)

Command (Not bioconda/pip) | Command (bioconda/pip) ------------ | ------------- python VirStrain.py -h | virstrain -h python VirStrain_build.py -h | virstrain_build -h python VirStrain_contig.py -h | virstrain_contig -h python VirStrain_contigDB_merge.py -h | virstrain_merge -h

Use VirStrain to identify RNA virus strains in short reads.

For SE reads:<BR/> python VirStrain.py -i Test_Data/MT451123_1.fq -d VirStrain_DB/SCOV2 -o MT451123_SE_Test<BR/>

For PE reads:<BR/> python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test<BR/>

When the virus has high mutation rate, like HIV, you may need to add -m parameter.

For HIV:<BR/> SE reads: python VirStrain.py -i <Read1> -d VirStrain_DB/HIV -o <Output_dir> -m<BR/> PE reads: python VirStrain.py -i <Read1> -p <Read2> -d VirStrain_DB/HIV -o <Output_dir> -m<BR/>

[Update - 2023 - Sep] Use VirStrain_contig to identify viral strains for assembled contigs.

python VirStrain_contig.py -i <Input_Contig_fasta> -d VirStrain_contig_DB -o VirStrain_Contig_Res<BR/>

You can use the command below to download the pre-built comprehensive viral strain database for contig identification:

sh download_contig_db.sh

If you want to convert pre-built VirStrain databases for reads (e.g. VirStrain_DB/SCOV2 and VirStrain_DB/H1N1) to database for contigs. Then you can try the command below:

python VirStrain_contigDB_merge.py -i VirStrain_DB/SCOV2,VirStrain_DB/H1N1 -o VirStrain_contig_DB_merge

Use VirStrain to build your own custom database.<BR/>

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir><BR/>

<b>Important note</b>: "," and "|" are not allowed in your <Input_MSA>. For example, ">Strain_A, 2022" or ">Strain_A|2022" is not allowed but ">Strain_A_2022" is allowed.

For small-scale strains (<1000 input strains) or viruses with large genome sizes (like HCMV), you can use manual-covering function to cover more useful sites. For example, in our experiment, we used "-s 0.4" for 328 HCMV strains. Usually, 0.2~0.6 shoule be a suitable range for the parameter "-s". However, if you only have very few strains, like 3 strains, you can also use a greater value like "-s 0.8".

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4<BR/>

Besides, if you only want to use SNV sites from "x" to "y" (eg. x=500 to y=1000), then you can add the parameter -r.

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4 -r 500-1000<BR/>

Note: The format of input MSA should be same as the format of MSA generated by Mafft (https://mafft.cbrc.jp/alignment/software/).<BR/>

Full command-line options

<!---(Note: The initial idea of development of VirStrain is "Simpler is better". We do not want to burden users due to complicated usage of VirStrain. So the default parameters (some are inside the program) are simple but have good performance in our test, however, more useful parameters will be added for users who need them.)-->

Identification - VirStrain.py (Default k-mer size: 25)

VirStrain - An RNA virus strain-level identification tool for short reads.

Example: python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

required arguments:
    -i, --input_reads             Input fastq data.
    -d, --database_dir            Path of VirStrain database.

optional arguments:
    -h, --help                    Show help message and exit.
    -o, --output_dir              The output directory. (Default: ./VirStrain_Out)
    -p, --input_reads2            Input fastq data for PE reads
    -c, --site_filter_cutoff      The cutoff of filtering one site when calculate the Vscore. (Default: 0.05)
    -s, --rank_by_sites		  If set to 1, then VirStrain will sort the most possible strain by matches to the sites. (default: 0)
    -f, --turn_off_figures	  If set to 1, then VirStrain will not generate figures. (default: 

Related Skills

View on GitHub
GitHub Stars23
CategoryData
Updated12d ago
Forks2

Languages

Python

Security Score

95/100

Audited on Mar 18, 2026

No findings