Purpose of VcfHunter

VcfHunter regroups several programs which principal aims are to map DNA and RNAseq data onto reference genome sequence, perform variant calling, manipulate vcf files, perform chromosome painting of accessions based on the contribution of ancestral groups, select marker for genetic map analysis and perform pairwise chromosome linkage of ordered markers.

Installation

All proposed tools described here are written in python and work on linux system To install the tools:

Open the loca_programs.conf file loacted in bin folder
Set the path to each required program. If the program is loaded in the environment (available in $PATH), the complete path is not required and then only program name can be set in loca_programs.conf. For example if bwa is already loaded in the environment, put bwa = bwa. If bam-readcound is not loaded in the environment put bamreadcount = /toto/tartenpion/programmes/bam-readcount/bin/bam-readcount

Dependencies

BWA, https://bio-bwa.sourceforge.net/
STAR, https://github.com/alexdobin/STAR
PicarTools, https://broadinstitute.github.io/picard/
GATK, https://software.broadinstitute.org/gatk/
Samtools, https://github.com/samtools/samtools
Bamtools, https://github.com/pezmaster31/bamtools
bam-readcount, https://github.com/genome/bam-readcount
gnuplot, http://www.gnuplot.info/
circos-0.67 or greater, http://circos.ca/software/download/circos/
umi tools, https://umi-tools.readthedocs.io/en/latest/

Python3 (tested with 3.4.10), Java and Biopython are also required.

How to cite

Depending on the tool you use (see Description section) please cite either:

Martin et al., in prep. G. Martin, B. Istace, F.C. Baurens, C. Belser, C Hervouet, K. Labadie, C. Cruaud, B. Noel, C. Guiougou, F. Salmon, J. Mahadeo, F. Ahmad, H. A. Volkaert, G. Droc, M. Rouard, J. Sardos, P. Wincker, N Yahiaoui, J.M. Aury, A D’Hont. in prep. Chromosome evolution in Musa species: Insights from genome assemblies of wild contributors to banana cultivars.

Martin et al., 2023b. Martin G, Baurens F-C, Labadie K, Hervouet C, Salmon F, Marius F, Paulo-de-la-Reberdiere N, Van den Houwe I, Aury J-M, D'Hont A, Yahiaoui N. 2023. Shared pedigree relationships and transmission of unreduced gametes in cultivated banana. Annals of Botany. XX:1–13 https://doi.org/10.1093/aob/mcad065

Martin et al., 2023a. Martin G, Cottin A, Baurens F-C, Labadie K, Hervouet C, Salmon F, Paulo-de-la-Reberdiere N, Van den Houwe I, Sardos J, Aury J-M, et al. 2023. Interspecific introgression patterns reveal the origins of worldwide cultivated bananas in New Guinea. Plant J. 113:802–818 https://doi.org/10.1111/tpj.16086

Martin et al., 2020b. Martin G, Baurens F-C, Hervouet C, Salmon F, Delos J-M, Labadie K, Perdereau A, Mournet P, Blois L, Dupouy M, et al. 2020. Chromosome reciprocal translocations have accompanied subspecies evolution in bananas. Plant J. 104:1698–1711 https://doi.org/10.1111/tpj.15031

Martin et al., 2020a. Martin G, Cardi C, Sarah G, Ricci S, Jenny C, Fondi E, Perrier X, Glaszmann J-C, D’Hont A, Yahiaoui N. 2020. Genome ancestry mosaics reveal multiple and cryptic contributors to cultivated banana. Plant J. 102:1008–1025. https://doi.org/10.1111/tpj.14683

Baurens et al., 2019. Baurens F-C, Martin G, Hervouet C, Salmon F, Yohomé D, Ricci S, Rouard M, Habas R, Lemainque A, Yahiaoui N, et al. 2019. Recombination and large structural variations shape interspecific edible bananas genomes. Mol. Biol. Evol. https://academic.oup.com/mbe/article/36/1/97/5162481

Garsmeur et al., 2018. Garsmeur O, Droc G, Antonise R, Grimwood J, Potier B, Aitken K, Jenkins J, Martin G, Charron C, Hervouet C, et al. 2018. A mosaic monoploid reference sequence for the highly complex genome of sugarcane. Nat. Commun. 9:2638. https://www.nature.com/articles/s41467-018-05051-5

Referring person of the deposit

Guillaume Martin (CIRAD)

License

Licencied under GPLv3

Description

The package provided comprised 46 programs listed here:

Draw_dot_plot.py (Baurens et al., 2019)
RecombCalculatorDDose.py (Baurens et al., 2019)
vcf2allPropAndCov.py (Baurens et al., 2019)
vcf2allPropAndCovByChr.py (Baurens et al., 2019)
vcf2popNew.1.0.py (Baurens et al., 2019)
VcfPreFilter.1.0.py (Garsmeur et al., 2018)
process_reseq_1.0.py (Garsmeur et al., 2018)
vcf2pop.1.0.py (Garsmeur et al., 2018)
vcfFilter.1.0.py (Garsmeur et al., 2018)
haplo2Circos.1.0.py (Martin et al., 2020a)
haplo2kar.1.0.py (Martin et al., 2020a)
haplo2karByChr.1.0.py (Martin et al., 2020a)
process_RNAseq.1.0.py (Martin et al., 2020a)
vcf2struct.1.0.py (Martin et al., 2020a)
vcfIdent.1.0.py (Martin et al., 2020a)
vcfRemove.1.0.py (Martin et al., 2020a)
vcf2linear.1.1.py (Martin et al., 2020a)
CaReRa.py (Martin et al., 2020b)
HaploProp.py (Martin et al., 2020b)
VcfAndCarto2haplo.py (Martin et al., 2020b)
vcf2cov.py (Martin et al., 2020b)
vcf2dis.py (Martin et al., 2020b)
vcfAndConsToRatio.py (Martin et al., 2020b)
vcfFormatForVcftools.py (Martin et al., 2020b)
DrawRatio.py (Martin et al., 2023a)
DrawRatioDetailInterractive.py (Martin et al., 2023a)
GroupBasedOnDistanceToCentroids.py (Martin et al., 2023a)
IdentOtherAncestry.py (Martin et al., 2023a)
IdentPrivateAllele.py (Martin et al., 2023a)
PaintArp.py (Martin et al., 2023a)
PhaseInVcf.py (Martin et al., 2023a)
PhaseInVcfToFasta.2.0.py (Martin et al., 2023a)
ReformatTree.py (Martin et al., 2023a)
allele_ratio_group.py (Martin et al., 2023a)
allele_ratio_per_acc.py (Martin et al., 2023a)
convertForIdeo.py (Martin et al., 2023a)
plot_allele_normalized_mean_ratio_per_acc.py (Martin et al., 2023a)
ACRO.py (Martin et al., 2023b)
APAR.py (Martin et al., 2023b)
DrawCircos.py (Martin et al., 2023b)
DrawStackedDensity.py (Martin et al., 2023b)
FormatHaplo.py (Martin et al., 2023b)
SPRH.py (Martin et al., 2023b)
TotalRecal.1.0.py (Martin et al., 2023b)
ValPar.py (Martin et al., 2023b)
vcfSelect.py (Martin et al., 2023b)
calcul_pileup_count.py (Martin et al., In prep)
calcul_pileup_mean.py (Martin et al., In prep)
PaintAssembly.sh (Martin et al., In prep)
ParseReadsOnHaplo.py (Martin et al., In prep)

49 programs run using the following command: python program-name <--options-name value> 1 program (PaintAssembly.sh) run using the following command: bash PaintAssembly.sh <--options-name value>

Programs

process_RNAseq.1.0.py

This program takes a reference DNA sequence multifasta file and several fastq files and returns a bam file for each accessions and a final VCF file containing alleles count at each variant site having at least one variant allele supported by at least one read.

Options:

--conf: A configuration file containing path to references sequence (multifasta file) and RNAseq reads (fastq files).
--thread: Number of processor to use (i.e. Max number of accessions treated at the same time). Do not exceed the number of processors available! [default: 1] 
--queue: If you are using SGE scheduler: the queue name for the job to perform parallelization. If not do not fill.
--prefix: Prefix for vcf file and statistics folders.
--steps: A string containing steps to perform:
    a: STAR indexing reference,
    b: Merging fastq from first mapping step to identify splicing sites,
    c: STAR Indexing reference with identified splicing sites,
    d: Second mapping step,
    e: Merging bam libraries with identical identifier,
    f: Removing duplicates,
    g: Reordering reads,
    h: Splitting and trimming reads,
    i: Indel realignment,
    j: Allele counting,
    k: Genotype calling,
    l: Merging genotype calling,
    m: Gene exon coverage statistics calculation.

Configuration file description:

The configuration file should contain 4 sections and must be formated as followed:

[Libraries]
lib1 = genome_name path_to_mate1 path_to_mate2 ploidy
lib2 = genome_name path_to_single ploidy
...
[Reference]
genome = path_to_the_reference_sequence
[star]
options = additional_options_to_pass
[General]
max_size = max_read_length
gff3 (optional) = path to a gff3 file used to calculate statistics on genes coverage in step "m"

Output:

Warning: This program need to create a .dict and a .fai file in the folder were the reference sequence is stored if they do not already exist. Make sure that you have right to write in this folder!

Warning: If you use the STAR aligner for the read mapping onto the reference sequence, it only works on uncompressed fastq files (gunzip fastq.gz)

Outputs are dependent of the steps you are running and each steps use the output of the preceding one.

step a: generates a folder (--prefix + refstar_1) in which the reference sequence is indexed,
step b: generates a file (--prefix + JUNCESTIMATION_SJ.out.tab) containing splicing sites detected by STAR on the complete dataset,
step c: generates a folder (--prefix + refstar_2) in which the reference sequence is indexed with splicing sites detected,
step d: generates a folder for each accession, (names filled in column 3 "genome_name") filled in the configuration file, which contained the sam files of aligned reads and a .final.out file of mapping statistic for each libraries. In addition a (--prefix) folder containing a mapping statistics file (--prefix + mapping.tab) for all accession is generated.
step e: generates a merged bam (*_mer

VcfHunter

Install / Use

README