assemblage [əˈsɛmblɪdʒ]
noun

a number of things or persons assembled together; collection; assembly
(Cookery) a list of dishes served at a meal or the dishes themselves
the act or process of assembling or the state of being assembled
(Fine Arts & Visual Arts / Art Terms) a three-dimensional work of art that combines various objects into an integrated whole

From Collins English Dictionary – Complete and Unabridged © HarperCollins Publishers 1991, 1994, 1998, 2000, 2003

Updates:

2016-09-08: Dominik Laetsch's Blobtools now completely supersedes blobology/TAGC plots. Try it out - it's so much better than anything I've ever written! https://github.com/DRL/blobtools - documentation at https://blobtools.readme.io

2015-06-15: Currently, the best looking blobplots/TAGC plots are made by Dominik Laetsch's scripts at https://github.com/DRL/blobtools-light

2013-10-01: The scripts for making blobplots / Taxon-annotated GC-coverage plots (TAGC plots) are being moved and maintained at github.com/blaxterlab/blobology

The scripts in this repository are being left as they are, as a record of the work accompanying my PhD thesis.

About

This is a set of scripts for working with genome assemblies, making taxon-annotated GC-coverage plots (a.k.a blob plots), extracting reads belonging to the blobs, etc

These blob plots as used here were described in http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3294205/

Only 6 months after this paper was published did our group realise that taxon-annotated GC-coverage plots acronyms to, wait for it, TAGC plots! (thank Mark Blaxter for spotting that).

Figure 3 from the paper:

Caenorhabditis sp 5 Blob Plot

Figure 3: A low-complexity metagenome. Taxon-annotated GC-coverage scatter-plot (as in Fig. 2) for contigs from a preliminary SE assembly of a Caenorhabditis sp 5 DNA sample. 10,000 randomly selected contigs were BLAST-annotated by comparison to the NCBI nt database, and coloured by class-of-origin of the best match identified. Full colour version is available online

There are three other READMEs in this repository corresponding to three of the chapters of my thesis:

Example Data Sets

As an example, we can use the following study from the Short Read Archive:

ERP001495 - De novo whole-genome sequence of the free-living nematode Caenorhabditis sp. 5 strain JU800 DRD-2008

This study has one sample:

ERS147916 - Caenorhabditis sp. 5 JU800 DRD-2008

Two libraries (or "Experiments" in SRA terminology) were run for this sample:

ERX114449 - 300 bp library with Illumina HiSeq2000 101 bp PE sequencing
ERX114450 - 600 bp library with Illumina HiSeq2000 101 bp PE sequencing

And these are the four files that can be accessed from http://www.ebi.ac.uk/ena/data/view/ERP001495

g_ju800_110714HiSeq300_1.txt.gz - 300 bp library forward read
g_ju800_110714HiSeq300_2.txt.gz - 300 bp library reverse read
g_ju800_110714HiSeq600_1.txt.gz - 600 bp library forward read
g_ju800_110714HiSeq600_2.txt.gz - 600 bp library reverse read

How-Tos

This list of How-Tos demonstrates how the scripts in this repository can be used in conjunction with other installed software to assemble nematode (and other small metazoan genomes).

How to make a taxon-annotated GC cov "blob" plot

This section is a meta section that describes the How-Tos you need to make a taxon-annotated GC-cov blob plot like the one at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3294205/figure/Fig3/

adapter- and quality-trim Illumina fastq reads using sickle and scythe in one command with no intermediate files
create a preliminary assembly using ABySS
map reads to an assembly to get insert-size and coverage information
assign high-level taxon IDs to contigs
make blobology plots with R

How to sub-sample sequences at random for test read sets

The example read files are between 5 and 8 GB each, and processing the complete data set could take a long time. If you just want to run all the How-Tos in this README on smaller files to test how the scripts work, use the pick_random.pl script:

zcat g_ju800_110714HiSeq300_1.txt.gz | pick_random.pl 4 0.1 > g_ju800_110714HiSeq300_1.txt.r0.1

This command will gunzip the gzipped fastq file (zcat file.gz is the same as gunzip -c file.gz) and pipe the output through pick_random.pl. The script takes two arguments - the first tells you how many lines at a time should be treated as a block, while the second tells you what proportion of the file to pick (0.1 or 10% in this case)

However, running this command independently on the forward (_1) and reverse (_2) read files will result in different reads being picked. To sample both pairs at the same time, first interleave the files, and then pick 8 lines at a time (4 for the forward read, and 4 for the reverse read), and then use pick_random.pl to get 10% of the reads:

shuffleSequences_fastx.pl 4 <(zcat g_ju800_110714HiSeq300_1.txt.gz) <(zcat g_ju800_110714HiSeq300_2.txt.gz`) |
pick_random.pl 8 0.1 > g_ju800_110714HiSeq300_interleaved.txt.r0.1

Note: The <() syntax is for bash process substitution. Anything inside <(...) will be executed and its output will be treated by the containing command as if it is a file.

shuffleSequences_fastx.pl is based on shuffleSequences_fastq.pl that used to ship with Velvet. The difference is that it can be used to shuffle both fasta (set 1st argument to 2) and fastq (set 1st argument to 4) sequences.

g_ju800_110714HiSeq300_interleaved.txt.r0.1 is an interleaved/shuffled file. If you want the _1 and the _2 reads in separate files, you will have to do one more step:

unshuffleSequences_fastx.pl 4 g_ju800_110714HiSeq300_interleaved.txt.r0.1

which will create g_ju800_110714HiSeq300_interleaved.txt.r0.1.1 and g_ju800_110714HiSeq300_interleaved.txt.r0.1.2

How to use FastQC to check raw read qualities before assembly

Run FastQC (we used version 0.10.0) using this command:

fastqc --nogroup <readfiles...>

--nogroup shows each cycle separately rather than grouping cycle 10 onwards in groups of 5. Very useful for pinpointing failed cycles (i.e. cycles with lots of Ns)

FastQC output is divided into sections as described below:

Basic Statistics: Total number of reads and read length. A rough estimate of the genome coverage is given by multiplying the number of reads by the read length to get the number of bases sequenced per file, summing across all files, and dividing this number by the expected genome size. If there are too few reads (<50X genome coverage), more sequencing will almost certainly be needed for assembling a de novo genome.
Per base sequence quality: Low quality towards the end of each read is expected and can be trimmed. Typically, a Phred quality threshold of 20 is used. Most cycles/positions should have a median quality above 20, otherwise the run may be very low quality and will assemble poorly. If there is a quality dip one or more times in the middle of the sequence (or at the start), then additional correcting might be needed. Most quality trimming algorithms use sliding windows and trim only from the 3' end.
Per sequence quality scores - Mean sequence quality histogram
Per base sequence content and per base GC content: Genomic DNA should have roughly even levels of G and C, and roughly even levels of A and T. If the first few bases are very skewed, then that might indicate that multiplexing barcode adapters have been left on. G > C, or A > T might indicate sequencing bias, a known problem with Illumina sequencing. For RNA-Seq assemblies, the first 10 bases will often be even more skewed or "bumpy", but that is expected as RNA-Seq samples are primed with random hexamers nonamers which are a) not equally distributed in the original hexamer or nonamer mix and b) anneal at different rates, further increasing the skew.
Per sequence GC content - An uncontaminated genome sample will typically have a single GC peak. If there is a bump, that might indicate a contaminant in the sample. Typically, for nematodes that have an average GC between 30-50%, a bump on the right slope might indicate bacterial contamination, as many bacteria have higher GC content.
Per base N content: A large number of Ns at any one position might indicate an Illumina cycle failure. Most assembly algorithms throw away reads with an N in them, but if the proportion of reads thrown away is too large, error correction might be a better solution.
Sequence Length Distribution - Nothing to check here for raw Illumina reads - all reads are the same length. However, this plot could be useful for adapter- and quality- trimmed reads to see how long the remaining good quality reads are.
Sequence Duplication Levels - expected to be high for high coverage, and for pcr duplications in mate-pair data.
Overrepresented sequences. If there are any adapter sequences, they usually show up here. RNA-Seq libraries may also show overrepresented reads (or k-mers, see below) because of highly expressed genes in the sample.
K-mer Content - Hint at adapter sequences present (for example if the library insert is smaller than the read length) or if the multiplexing barcode is still present at the start of the read. We found 8-mers more useful than the default 5-mers, and you can modify the FastQC source code to change the default length.

How to adap

Assemblage

Install / Use

README