Verkko

Verkko is a hybrid genome assembly pipeline developed for telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO or Hifiasm corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads. Verkko is Finnish for net, mesh and graph.

Verkko uses Canu to correct remaining errors in the reads, builds a multiplex de Bruijn graph using MBG, aligns the Oxford Nanopore reads to the graph using GraphAligner, progressively resolves loops and tangles first with the HiFi reads then with the aligned Oxford Nanopore reads, and finally creates contig consensus sequences using Canu's consensus module.

Install
Getting Started
Outputs
Test data

Install:

Installing with a 'package manager' is recommended:

conda create -n verkko -c conda-forge -c bioconda -c defaults verkko

Alternatively, you can download and compile the source for a recent release.

<details> <summary><b>Compile from source</b></summary>

Compilation from source requires:
- GCC 9 or newer
- Rust 1.74 or newer.

(Do NOT download the .zip source code. It is missing files and will not compile. This is a known flaw with git itself.)

Running verkko requires:
- Python (v3.5+)
- Snakemake (>= v7.0, < 8.0.1)
- GraphAligner
- MashMap
- Winnowmap
Running verkko with hi-c/porec data also requires
- Samtools
- BWA
- Minimap2
- seqtk
- networkx python library (>=2.6, <=3.5)

To install an unreleased version of Verkko from github (for development) run:

git clone https://github.com/marbl/verkko.git
cd verkko/src
git checkout <desired branch> (optional if you want to use a branch for development/compilation and not master)
make -j32

This will create the folder verkko/bin and verkko/lib/verkko. You can move the contents of these folders to a central installation location or you can add verkko/bin to your path. If any of the dependencies (e.g. GraphAligner, winnowmap, mashmap, etc) are not available in your path you may also symlink them under verkko/lib/verkko/bin/.

</details>

Getting started:

Verkko is implemented as a Snakemake workflow, launched by a wrapper script to parse options and create a verkko.yml file.

verkko -d <work-directory> --hifi <hifi-read-files> [--nano <ont-read-files>]

Run verkko with no options will list all available options with brief descriptions. At the minimum verkko requires high-accuracy long reads, provided with the --hifi option. You can provide any combination of PacBio HiFi/Oxford Nanopore duplex/both to the --hifi parameter. However, we strongly recommend including some ultra-long sequence data using the --nano parameter and phasing information (see below). For HERRO corrected reads, provide the corrected reads with the --hifi option and the uncorrected reads as --nano. The output of verkko will be phased scaffolds. Note that no attempt is made to generate a primary or pseudo-haplotype assembly.

Phasing:

Verkko supports extended phasing using using rukki using either trio or Hi-C information.

To run in trio mode, you must first generate merqury hapmer databases and pass them to verkko.

<details> <summary><b>Build meryl DBs</b></summary> Please use git clone to pull the latest versions merqury (see the merqury documentation for details). Then, if you have a SLURM cluster you can run:

# assumes you have maternal/paternal folders
# each containing a fofn of sequence inputs named [mp]aternal.fofn
# and a top level folder with a child.fofn specifying F1 sequence inputs
cd maternal
$MERQURY/_submit_build.sh -c 30 maternal.fofn maternal_compress
cd ../paternal
$MERQURY/_submit_build.sh -c 30 paternal.fofn paternal_compress
cd ../
$MERQURY/_submit_build.sh -c 30 child.fofn    child_compress
ln -s maternal/maternal_compress.k30.meryl
ln -s paternal/paternal_compress.k30.meryl

without a grid, you can run

meryl count compress k=30 threads=XX memory=YY maternal.*fastq.gz output maternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY paternal.*fastq.gz output paternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY    child.*fastq.gz output    child_compress.k30.meryl

replacing XX and YY with the threads and memory you want meryl to use. Once you have the databases, run:

$MERQURY/trio/hapmers.sh \
  maternal_compress.k30.meryl \
  paternal_compress.k30.meryl \
     child_compress.k30.meryl

Make sure to count k-mers in compressed space. Child data is optional, in this case, exclude child_compress.k30.meryl from the input to hapmers.sh and use its output maternal_compress.k30.only.meryl and paternal_compress.k30.only.meryl in the verkko command below.

</details>

verkko -d asm \
  --hifi hifi/*.fastq.gz \
  --nano  ont/*.fastq.gz \
  --hap-kmers paternal_compress.k30.hapmer.meryl \
              maternal_compress.k30.hapmer.meryl \
              trio

To run in Hi-C mode, reads should be provided using the --hic1 and --hic2 options. For example:

verkko -d asm \
  --hifi hifi/*.fastq.gz \
  --nano ont/*.fastq.gz \
  --hic1 hic/*R1*fastq.gz  \
  --hic2 hic/*R2*fastq.gz

To run in PoreC mode, reads should be provided using the --porec option. For example:

verkko -d asm \
  --hifi hifi/*.fastq.gz \
  --nano ont/*.fastq.gz \
  --porec porec/*fastq.gz

Hi-C/PoreC integration was tested mostly on human and primate genomes. Please see --rdna-tangle, --uneven-depth and --haplo-divergence options if you want to assemble something distant from human and/or have uneven coverage. If you encounter issues or have questions about appropriate parameters, please open an issue.

Scaffolding:

Verkko includes a separate scaffolding module which is used when Hi-C or Pore-C data are provided (this is separate from Rukki's ability to connect some contigs into scaffolds with just trio) Verkko tries to makes a rough estimate of the gap size based on the assembly graph. When the gap size estimate is smaller than 100K we report the estimated value. For larger gaps or gaps where a size cannot be estimated, we always report 100K N's.

The scaffolding module uses telomere positions in the assembly (detected with seqtk telo), so if your species has a different telomeric repeat motif than vertebrates (CCCTAA), provide it with the --telomere-motif option.

If available, you can provide the genome of another individual of the same or closely related species with the --ref option. It is not reference-based assembly; the reference will be only used as guidance in scaffolding.

Since the scaffolding module relies on the diploid structure of an assembly, it is not compatible with the --haploid option; we recommend the YaHS standalone scaffolder for such cases.

Polyploid scaffolding and phasing is not supported yet.

Consensus for user-provided paths:

If you already have an assembly but want to customize or change how the nodes in the graph are used, you can do so with the --paths option. Verkko will generate contigs from a user-provided file with paths through the assembly graph. Paths should be provided in a GAF path format. For example: name >utig4-1<utig4-2 HAPLOTYPE1 with one path per line, where utig4-1 is a node in your existing assembly graph and is taken is the fwd orientation and utig4-2 is in reverse complement. This option also requires the original assembly directory (specified as --assembly <path_to_assembly>) and input reads. Output is specified as -d <output_dir> (<output_dir> should not be equal to <path_to_assembly>).

Running on a grid:

By default, verkko will run the snakemake workflow and all compute on the local machine. Support for SGE, Slurm LSF, and PBS (untested) can be enabled with options --grid. This will run the snakemake workflow on the local machine but submit all compute to the grid. To launch the both the snakemake workflow and compute on the grid, wrap the verkko command in a shell script and submit using your scheduler. If you're using conda, you may need to make the conda-installed python your default. You can do this with the --python option when calling verkko

<details> <summary>Customizing grid requests (QOS, partition, etc)</summary> Verkko will submit jobs to the default queue on your grid environment. It is possible to customize how jobs are submitted to specify partitions or other options like accounting or QOS. For example:

--snakeopts '--cluster "./slurm-sge-submit.sh {threads} {resources.mem_gb} {resources.time_h} {rulename} {resources.job_id} --partition=quick --account=verkko_asm --qos=verkko_qos"'

on SLURM will request the 'quick' queue and pass account and qos options.

</details>

Verkko uses default cpu/memory/time options for different parts of the pipeline. Usually a user does not need to change them, however advanced tuning is possible with --<stage_code>-run options.

<details> <summary>Here we list th

Verkko

Install / Use

README

Verkko

Table of contents

Install:

Getting started:

Phasing:

Scaffolding:

Consensus for user-provided paths:

Running on a grid: