LavaLampPlot
instructions, python and R code for generating lava lamp plots of kmer coverage
Install / Use
/learn @wrf/LavaLampPlotREADME
lavaLampPlot
Instructions, python and R code for generating lava lamp plots of kmer coverage as the kmers themselves, or based on kmer-coverage averages on the raw reads.
This tool was used in the analysis of the Hoilungia hongkongensis genome (Figure S2). It is unlikely I will make a dedicated paper, so if these scripts are used, please cite the paper:
Eitel, M., Francis, W.R., Varoqueaux, F., Daraspe, J., Osigus, H-J., Krebs, S., Vargas, S., Blum, H., Williams, G.A., Schierwater, B., Wörheide, G. (2018) Comparative genomics and the nature of placozoan species. PLoS Biology 16(7):e2005359.
Contents:
- Dependencies
- Operation to make kmer or read-based lavalamp plots
- Blob plots for metagenomes
- Shiny app for interactive binning
- Usage tips including suggestions on data display
- Troubleshooting
Overview
This includes a few scripts to generate a "lava lamp" plot of kmer coverage separated by GC content of the kmers. This is a heat-map of frequency of reads or kmers with a particular coverage and GC content, giving a sense of the coverage of the target species and any symbionts or contaminants.
Such plots were intended for use on metagenomic data (whole-genome shotgun, not amplicon), and can be generated with raw data, meaning that no assembly or other processing of the reads is needed. This is usually sufficient to reveal heterozygosity (for eukaryotes) or the presence of symbionts with very different GC content than the host, and instruct whether to even use the data or not. For instance, distinct "spots" suggest an good extraction, while a smear would suggest that there were some biases in the extraction or sequencing and that assembly may be highly problematic.
Below is an example plot from Hydra vulgaris SRR1032106 using a kmer of 31. The putative symbiont is visible in the plot with a high GC content. The second plot counts based on GC of the full read and median coverage kmers in that read, rather than the kmers alone.

This is conceptually similar to what is done in the "blob" plots by blobtools, (paper by Kumar et al 2013) though that process makes use of assembled contigs while this one considers the raw reads directly. However, instructions for generating such plots are also below. Some assemblers output contig coverage information with the contigs, and these can be used without mapping the reads.

When making such plots for metagenomics, be aware of the implications of contig coverage for many potential analyses. The coverage is calculated by reads per base, but for the blobs, should reflect the relative number of copies of contigs or chromosomes. For something like a bacterial sample, this could be considered to be a relative counts of cells.

Mapping the original reads back to the assembled contigs should make a mostly even coverage within each scaffold. The raw total counts per gene or per contig will vary by the length, and therefore must be normalized in order to be meaningful (such normalization is already expressed by the coverage). For example, if two bacterial species had equal abundance, but one has a genome 2x larger than the other, this will account for 2x the number of reads, even if the coverage is the same.

This means longer genes, contigs, or chromosomes will have more counts, but also means that total counts to homologous operons between species will vary by both coverage and length. If the objective is to make a statement about the abundance of particular genes or pathways in the sample, the contig coverage could be simplistically applied to the operon. However, it may be better to quantify the actual expression (qPCR or metatranscriptomics) or protein activity instead.

Dependencies
Download the jellyfish kmer counter (or another preferred kmer counter). This used to be supplied with Trinity for transcriptome assembly, and was previously in the trinity-plugins/ folder.
Operation
-
Run jellyfish on the raw genomic data, i.e. paired end reads.
jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 all_reads.fastqdata can also be piped, in which case you must provide the input
/dev/fd/0(or/dev/stdin):cat *.fastq | jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 /dev/fd/0or unzipped in real time:
gzip -dc *.fastq.gz | jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 /dev/fd/0-sis the memory buffer, 2G (2 billion, or 2000000000) is safe for many analyses, though the array will expand if it is maxed out. This can consume a lot of memory (30-50Gb), so it is not advisable to run on a laptop.-Uis the max coverage to count. 1000 will catch mostly transposons.-tis the thread count / number of CPUs.-Crefers to canonical counting, meaning both + and - strands.-mis the kmer size. If downstream steps involving Trinity will be used (Step 4 and after) then do not select a kmer greater than 32, as this is the upper limit for thefastaToKmerCoverageStatsprogram.
Optionally, a lower limit can be set with
-L. For generating kmer plots (step 3 below) this would affect the data and not be advised, but for the read plots (steps 4 and 5), the Trinity programfastaToKmerCoverageStatsassumes that any values below 2 are 1. Since the bulk of the kmers have a count of 1 (66% in the Hydra sample), this makes a much smaller file and will run faster at subsequent steps. -
Write out kmers with coverage counts. This file can be massive (easily 50Gb+, 83G for Hydra SRR1032106) so it is instead piped through the python script.
-kand-uare kmer size and max coverage from jellyfish. This step is a single process, so can take some time for very large files with 1 billion+ kmers (20-30 minutes). This generates a matrix of GC vs coverage across all kmers, where the value at each position is the number of unique kmers with that GC and coverage. This is run through python so that R does not have to deal with the counting in addition to the graphing.jellyfish dump fastq.counts | fastqdumps2histo.py -k 31 -u 1000 - > fastq.gc_cov.histo.csvIf the Trinity mode steps (4 and 5) will eventually be carried out, then the jellyfish kmer dump is required. A filtered version where kmers with a count of 1 have been removed can be generated using the
-joption.jellyfish dump fastq.counts | fastqdumps2histo.py -k 31 -u 1000 -j fastq.dumps - > fastq.gc_cov.histo.csv -
Configure the R script for this .csv file, and pdf output. Add text to annotate the plot as needed with the
text(x,y,"important things")command. The X and Y values correspond to the center position of the X and the GC count (not the percentage). Thus a GC of 25% would be something like 7 or 8 for a kmer of 31. This script can be run interactively in an R environment (such as RKWard or RCommander) or in the command line. For the command line, an optional 3rd argument can be given to change the X-axis scale (default is 1000).Rscript jellyfish_gc_coverage_blob_plot_v2.R fastq.gc_cov.histo.csv counts_k31_gc_cov.pdf -
Slice out sections of reads based on kmer coverage. This requires finally generating the jellyfish dump file, so that will be redone first if it was not done at the earlier step.
jellyfish dump fastq.counts > fastq.dumpsThen
-T(Trinity mode) is called with kmersorter. This generates a table of median coverage per read, which would be used in Trinity to randomly normalize the data based on coverage. Instead, here it uses the same method to deterministically keep reads with coverage above or below some specified value (-aand-b, respectively). Specify the kmer used with-k. The Trinity base directory is given with-D. Some of the subprocesses can use multithreads, which is specified by-p. Note that the most memory intensive step, fastaToKmerCoverageStats, has no intrinsic memory limit as it reads the entire fastq.dumps file into memory; this is why the reduced dump is generated with the-joption in step 2.kmersorter.py -T -k 31 -p 8 -a 100 -b 200 -D ~/trinityrnaseq/ -1 reads_1.fq -2 reads_2.fq fastq.dumps -
Generate a more precise coverage to GC map using the entire read rather than kmers. This is run similarly as before with some alternate options in fastqdumps2histo. As ab
