lavaLampPlot

Instructions, python and R code for generating lava lamp plots of kmer coverage as the kmers themselves, or based on kmer-coverage averages on the raw reads.

This tool was used in the analysis of the Hoilungia hongkongensis genome (Figure S2). It is unlikely I will make a dedicated paper, so if these scripts are used, please cite the paper:

Eitel, M., Francis, W.R., Varoqueaux, F., Daraspe, J., Osigus, H-J., Krebs, S., Vargas, S., Blum, H., Williams, G.A., Schierwater, B., Wörheide, G. (2018) Comparative genomics and the nature of placozoan species. PLoS Biology 16(7):e2005359.

Overview

This includes a few scripts to generate a "lava lamp" plot of kmer coverage separated by GC content of the kmers. This is a heat-map of frequency of reads or kmers with a particular coverage and GC content, giving a sense of the coverage of the target species and any symbionts or contaminants.

Such plots were intended for use on metagenomic data (whole-genome shotgun, not amplicon), and can be generated with raw data, meaning that no assembly or other processing of the reads is needed. This is usually sufficient to reveal heterozygosity (for eukaryotes) or the presence of symbionts with very different GC content than the host, and instruct whether to even use the data or not. For instance, distinct "spots" suggest an good extraction, while a smear would suggest that there were some biases in the extraction or sequencing and that assembly may be highly problematic.

Below is an example plot from Hydra vulgaris SRR1032106 using a kmer of 31. The putative symbiont is visible in the plot with a high GC content. The second plot counts based on GC of the full read and median coverage kmers in that read, rather than the kmers alone.

This is conceptually similar to what is done in the "blob" plots by blobtools, (paper by Kumar et al 2013) though that process makes use of assembled contigs while this one considers the raw reads directly. However, instructions for generating such plots are also below. Some assemblers output contig coverage information with the contigs, and these can be used without mapping the reads.

hydra_dovetail_w_meta_SRR1032106.gc_cov.png

When making such plots for metagenomics, be aware of the implications of contig coverage for many potential analyses. The coverage is calculated by reads per base, but for the blobs, should reflect the relative number of copies of contigs or chromosomes. For something like a bacterial sample, this could be considered to be a relative counts of cells.

Mapping the original reads back to the assembled contigs should make a mostly even coverage within each scaffold. The raw total counts per gene or per contig will vary by the length, and therefore must be normalized in order to be meaningful (such normalization is already expressed by the coverage). For example, if two bacterial species had equal abundance, but one has a genome 2x larger than the other, this will account for 2x the number of reads, even if the coverage is the same.

This means longer genes, contigs, or chromosomes will have more counts, but also means that total counts to homologous operons between species will vary by both coverage and length. If the objective is to make a statement about the abundance of particular genes or pathways in the sample, the contig coverage could be simplistically applied to the operon. However, it may be better to quantify the actual expression (qPCR or metatranscriptomics) or protein activity instead.

Dependencies

Download the jellyfish kmer counter (or another preferred kmer counter). This used to be supplied with Trinity for transcriptome assembly, and was previously in the trinity-plugins/ folder.

Operation

Run jellyfish on the raw genomic data, i.e. paired end reads.

jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 all_reads.fastq

data can also be piped, in which case you must provide the input /dev/fd/0 (or /dev/stdin):

cat *.fastq | jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 /dev/fd/0

or unzipped in real time:

gzip -dc *.fastq.gz | jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 /dev/fd/0
- -s is the memory buffer, 2G (2 billion, or 2000000000) is safe for many analyses, though the array will expand if it is maxed out. This can consume a lot of memory (30-50Gb), so it is not advisable to run on a laptop.
- -U is the max coverage to count. 1000 will catch mostly transposons.
- -t is the thread count / number of CPUs.
- -C refers to canonical counting, meaning both + and - strands.
- -m is the kmer size. If downstream steps involving Trinity will be used (Step 4 and after) then do not select a kmer greater than 32, as this is the upper limit for the fastaToKmerCoverageStats program.
Optionally, a lower limit can be set with -L. For generating kmer plots (step 3 below) this would affect the data and not be advised, but for the read plots (steps 4 and 5), the Trinity program fastaToKmerCoverageStats assumes that any values below 2 are 1. Since the bulk of the kmers have a count of 1 (66% in the Hydra sample), this makes a much smaller file and will run faster at subsequent steps.
Write out kmers with coverage counts. This file can be massive (easily 50Gb+, 83G for Hydra SRR1032106) so it is instead piped through the python script. -k and -u are kmer size and max coverage from jellyfish. This step is a single process, so can take some time for very large files with 1 billion+ kmers (20-30 minutes). This generates a matrix of GC vs coverage across all kmers, where the value at each position is the number of unique kmers with that GC and coverage. This is run through python so that R does not have to deal with the counting in addition to the graphing.

jellyfish dump fastq.counts | fastqdumps2histo.py -k 31 -u 1000 - > fastq.gc_cov.histo.csv

If the Trinity mode steps (4 and 5) will eventually be carried out, then the jellyfish kmer dump is required. A filtered version where kmers with a count of 1 have been removed can be generated using the -j option.

jellyfish dump fastq.counts | fastqdumps2histo.py -k 31 -u 1000 -j fastq.dumps - > fastq.gc_cov.histo.csv
Configure the R script for this .csv file, and pdf output. Add text to annotate the plot as needed with the text(x,y,"important things") command. The X and Y values correspond to the center position of the X and the GC count (not the percentage). Thus a GC of 25% would be something like 7 or 8 for a kmer of 31. This script can be run interactively in an R environment (such as RKWard or RCommander) or in the command line. For the command line, an optional 3rd argument can be given to change the X-axis scale (default is 1000).

Rscript jellyfish_gc_coverage_blob_plot_v2.R fastq.gc_cov.histo.csv counts_k31_gc_cov.pdf
Slice out sections of reads based on kmer coverage. This requires finally generating the jellyfish dump file, so that will be redone first if it was not done at the earlier step.

jellyfish dump fastq.counts > fastq.dumps

Then -T (Trinity mode) is called with kmersorter. This generates a table of median coverage per read, which would be used in Trinity to randomly normalize the data based on coverage. Instead, here it uses the same method to deterministically keep reads with coverage above or below some specified value (-a and -b, respectively). Specify the kmer used with -k. The Trinity base directory is given with -D. Some of the subprocesses can use multithreads, which is specified by -p. Note that the most memory intensive step, fastaToKmerCoverageStats, has no intrinsic memory limit as it reads the entire fastq.dumps file into memory; this is why the reduced dump is generated with the -j option in step 2.

kmersorter.py -T -k 31 -p 8 -a 100 -b 200 -D ~/trinityrnaseq/ -1 reads_1.fq -2 reads_2.fq fastq.dumps
Generate a more precise coverage to GC map using the entire read rather than kmers. This is run similarly as before with some alternate options in fastqdumps2histo. As ab

LavaLampPlot

Install / Use

README

lavaLampPlot

Contents:

Overview

Dependencies

Operation