FastK: A K-mer counter (for HQ assembly data sets)

Author: Gene Myers First: July 22, 2020 Current: April 18, 2021

Command Line
HPC Operation
Core Applications
- Histex: Display a FastK histogram or convert to 1-code
- Tabex: List, Check, find a k‑mer in a FastK table, or convert to 1-code
- Profex: Display a FastK profile or convert to 1-code
- Logex: Combine kmer,count tables with logical expressions & filter with count cutoffs
- Symmex: Produce a symmetric k-mer table from a canonical one
- KmerMap: Produce a .bed file showing all the regions in a target covered by a set of k-mers
C-Library Interface
File Encodings

Command Line

FastK is a k‑mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode. For example it is about 2 times faster than KMC3 when counting 40-mers in a 50X HiFi data set. Its relative speedup decreases with increasing error rate or increasing values of k, but regardless is a general program that works for any DNA sequence data set and choice of k. It is further designed to handle data sets of arbitrarily large size, e.g. a 100X data set of a 32GB Axolotl genome (3.2Tbp) can be performed on a machine with just 12GB of memory provided it has ~6.5TB of disk space available.

FastK can produce the following outputs:

a histogram of the frequency with which each k‑mer in the data set occurs.
a table of k‑mer/count pairs sorted lexicographically on the k‑mer where a < c < g < t.
a k‑mer count profile of every sequence in the data set. A profile is the sequence of counts of the n-(k-1) consecutive k‑mers of a sequence of length n.
a relative profile of every sequence in the data set against a FastK table produced for another data set.

Note carefully, that in order to accommodate the unknown orientation of a sequencing read, a k‑mer and its Watson Crick complement are considered to be the same k‑mer by FastK, where the lexicograpahically smaller of the two alternatives is termed canonical. The histogram is always produced whereas the production of a k‑mer table (2.) and profiles (3.&4.) are controlled by command line options. The table (2.) is over just the canonical k‑mers present in the data set. Producing profiles (3.&4.) as part of the underlying sort is much more efficient than producing them after the fact using a table or hash of all k‑mers such as is necessitated when using other k‑mer counter programs. The profiles are recorded in a space-efficient compressed form, e.g. about 4.7-bits per base for a recent 50X HiFi asssembly data set.

1. FastK [-k<int(40)>] [-t[<int(1)>]] [-p[:<table>[.ktab]]] [-c] [-bc<int>]
         [-v] [-N<path_name>] [-P<dir($TMPDIR)>] [-M<int(12)>] [-T<int(4)>]
            <source>[.cram|.[bs]am|.db|.dam|.f[ast][aq][.gz]] ...

FastK counts the number of k‑mers in a corpus of DNA sequences over the alphabet {a,c,g,t} for a specified k‑mer size, 40 by default. The input data can be in one or more CRAM, BAM, SAM, fasta, or fastq files, where the later two can be gzip'd. The data can also be in Dazzler databases. The type of the file is determined by its extension (and not its contents). The extension need not be given if the root name suffices to uniquely identify a file. If more than one source file is given they must all be of the same type in the current implementation.

FastK produces a number of outputs depending on the setting of its options. By default, the outputs will be placed in the same directory as that of the first input and begin with the prefix <source> which is the first path name absent any suffix extensions. For example, if the input is <code>../BLUE/foo.fastq</code> then <source> is <code>../BLUE/foo</code>, the outputs will be placed in directory <code>../BLUE</code>, and all result file names will begin with <code>foo</code>. If the ‑N option is specified then the path name specified is used as <source>.

One can select any value of k ≥ 5 with the ‑k option. FastK always outputs a file with path name <source>.hist that contains a histogram of the k‑mer frequency distribution where the highest possible count is 215-1 = 32,767 -- FastK clips all higher values to this upper limit. Its exact format is described in the section on Data Encodings.

One can optionally request, by specifying the ‑t option, that FastK produce a sorted table of all canonical k‑mers along with their counts. If an integer follows then only those k‑mers that occur ‑t or more times where the default threshold is 1. In those applications where low count k‑mers are not needed this can save significant time and space as most such k‑mers are error‑mers. The output is placed in a single stub file with path name <source>.ktab and N roughly equal-sized hidden files with the path names <dir>/.<base>.ktab.# assuming <source> = <dir>/<base> and where # is a thread number between 1 and N where N is the number of threads used by FastK (4 by default). The exact format of the N‑part table is described in the section on Data Encodings.

One can also ask FastK to produce a k‑mer count profile of each sequence in the input data set by specifying the ‑p option. A single stub file with path name <source>.prof is output along with 2N roughly equal-sized pairs of hidden files with path names <dir>/.<base>.pidx.# and <dir>/.<base>.prof.# in the order of the sequences in the input assuming <source> = <dir>/<base>. The profiles are individually compressed and the exact format of these files is described in the section on Data Encodings.

If the data file contains sequences with letters other than upper or lower case a, c, g, or t, then all k-mers involving these letters are considered invalid and they are not counted. Specifically, the do not occur in the k-mer table and in profiles they are regions of 2k-1 or more 0's. So for example, if one passes a fasta "assembly" file to FastK wherein gaps between contigs are indicated by runs of N's, then the profile of a scaffold "sequence" will contain a corresponding run of 0's where the contig gaps are.

The -p option can contain an optional reference to a k‑mer table such as produced by the -t option. If so, then FastK produces profiles of every read where the k‑mer counts are those found in the referenced table, or zero if a k‑mer in a read is not in the table. This relative profile is often useful to see how the k‑mers from one source are reflected in another by tools such as merfin. They could also be used to distinguish haplotypes in a trio-based project, by producing relative profiles with respect to the k‑mers of the father and mother sequencing data sets. If this version of the -p option is specified then only profiles are produced -- the -t option is ignored and the default histogram is not produced.

The ‑c option asks FastK to first homopolymer compress the input sequences before analyzing the k‑mer content. In a homopolymer compressed sequence, every substring of 2 or more a's is replaced with a single a, and similarly for runs of c's, g's, and t's. This is particularly useful for Pacbio data where homopolymer errors are five‑fold more frequent than other errors and thus the error rate of such "hoco" k‑mers is five‑fold less.

The ‑v option asks FastK to output information about its ongoing operation to standard error including a time and resource summary at completion. The ‑bc option allows you to ignore the prefix of each read of the indicated length, e.g. when the reads have a bar code at the start of each read. The ‑P option specifies where FastK should place all the numerous temporary files it creates. By default this is the value of the system variable $TMPDIR, or should this be undefined, then /tmp. The ‑M option specifies the maximum amount of memory, in GB, FastK should use at any given moment. FastK by design uses a modest amount of memory, the default 12GB should generally be more than enough. Lastly, the ‑T option allows the user to specify the number of threads to use. Generally, this is ideally set to the actual number of physical cores in one's machine.

2a. Fastrm [-if] <source>[.hist|.ktab|.prof] ...
2b. Fastmv [-inf] <source>[.hist|.ktab|.prof] ( <target> | ... <directory> )
2c. Fastcp [-inf] <source>[.hist|.ktab|.prof] ( <target> | ... <directory> )

As described above FastK produces hidden files whose names begin with a . for the ‑t and ‑p options in order to avoid clutter when listing a directory's contents. An issue with this approach is that it is inconvenient for the user to remove, rename, or copy these files and often a user will forget the hidden files are there, potentially wasting disk space. We therefore provide Fastrm, Fastmv, and Fastcp that remove, rename, and copy F

FASTK

Install / Use

README

FastK: A K-mer counter (for HQ assembly data sets)

Command Line

Related Skills