SkillAgentSearch skills...

Discount

Very large scale k-mer counting and analysis on Apache Spark.

Install / Use

/learn @jtnystrom/Discount
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Discount

Maven Central Build and test install with bioconda

Discount is a Spark-based tool for k-mer (genomic sequences of length k) counting and analysis. It is able to analyse large metagenomic-scale datasets while having a small memory footprint. It can be used as a standalone command line tool, but also as a general Spark library, including in interactive notebooks.

Discount is highly scalable. It has been tested on the Serratus dataset for a total of 5.59 trillion k-mers (5.59 x 10^12) with 1.57 trillion distinct k-mers.

This software includes Fastdoop by U.F. Petrillo et al [1]. We have also included compact universal hitting sets generated by PASHA [2].

For a detailed background and description, please see our paper on evenly distributed k-mer binning.

Contents

  1. Basics
  2. Advanced topics
  3. References

Installation

Discount is now available in BioConda. This is the easiest way for most users. Please see the BioConda installation instructions.

For advanced users, to run Discount with your own Spark distribution, you may wish to download a pre-built release from the Releases page.

Running

Discount can run locally on your laptop, on a cluster, or on cloud platforms that support Spark (tested on AWS EMR and Google Cloud Dataproc).

If you installed from BioConda, you can simply run discount.sh.

For a manual installation, download the Spark distribution (3.1.0 or later) (http://spark.apache.org). Scripts to run Discount are provided for macOS and Linux. To run locally, edit the file discount.sh and set the path to your unpacked Spark distribution). This will be the script used to run Discount. Other critical settings can also be changed in this file. It is very helpful to point LOCAL_DIR to a fast drive, such as an SSD.

To run on AWS EMR (tested on v6.8.0), please use discount-aws.sh. In that case, change the example commands below to use that script instead, and insert your EMR cluster name as an additional first parameter when invoking. To run on Google Cloud Dataproc (tested on v2.1), please use discount-gcloud.sh instead.

K-mer counting

The following command produces a statistical summary of a dataset.

./discount.sh -k 55 /path/to/data.fastq stats

All example commands shown here accept multiple input files. The FASTQ and FASTA formats are supported, and must be uncompressed.

To submit an equivalent job to AWS EMR, after creating a cluster with id j-ABCDEF1234 and uploading the necessary files (the GCloud script discount-gcloud.sh works in the same way):

./discount-aws.sh j-ABCDEF1234 -k 55 s3://my-data/path/to/data.fastq stats

As of version 2.3, minimizer sets for k >=19, m=10,11 are bundled with Discount and do not need to be specified explicitly. Advanced users may wish to override this (see the section on minimizers)

To generate a full counts table with k-mer sequences (in many cases larger than the input data), the count command may be used:

./discount.sh -k 55 /path/to/data.fastq count -o /path/to/output/dir --sequence

A new directory called /path/to/output/dir_counts (based on the location specified with -o) will be created for the output.

Usage of upper and lower bounds filtering, histogram generation, normalization of k-mer orientation, and other functions, may be seen in the online help:

./discount.sh --help
./discount.sh count --help

Chromosomes and very long sequences

If the input data contains sequences longer than 1,000,000 bp, you must use the --maxlen flag to specify the longest expected single sequence length. However, if the sequences in a FASTA file are very long (for example full chromosomes), it is essential to generate a FASTA index (.fai). Various tools can be used to do this, for example with SeqKit:

seqkit faidx myChromosomes.fasta

Discount will detect the presence of the myChromosomes.fasta.fai file and read the data efficiently. In this case, the parameter --maxlen is not necessary.

Repetitive or very large datasets

As of version 2.3, Discount contains two different counting methods, the "simple" method, which was the only method prior to this version, and the "pregrouped" method, which is essential for data that contains highly repetitive k-mers. The pregrouped method counts each distinct super-mer separately prior to k-mer counting. Discount will try to pick the best method automatically, but we would advise users to do their own experiments. If Spark crashes with an exception about buffers being too large, the pregrouped method may also help. It can be forced with a command such as:

./discount.sh --method pregrouped -k 55 /path/to/data.fastq stats

Or, to force the simple method to be used:

./discount.sh --method simple -k 55 /path/to/data.fastq stats

While highly scalable, the pregrouped method may sometimes cause a slowdown overall (by requiring one additional shuffle), so it should not be used for datasets that do not need it. See the section on performance tuning.

Additional examples may be found in the wiki.

K-mer indexes

Discount can store a multiset of counted k-mers as an index (k-mer database). Indexes can be combined by various operations, inspired by the design of kmc_tools in KMC3. They are stored in the Apache Parquet format, allowing for a high degree of compression and efficiency in the cloud.

To create a new index, the store command may be used:

discount.sh -k 35 input.fasta store -o index_path

The directory index_path will be created and index files will be written to it (or overwritten if they already existed). Alongside it, the files index_path_minimizers.txt and index_path.properties will record the minimizer ordering and some other parameters of the index. These files should not be manually edited or moved.

By using the -i parameter, an index can be used instead of sequence files as a source of input data. For example, k-mers with minimum count 2 can be obtained from an index and written to a set of fasta files by using the count command from above in the following way:

discount.sh -i index_path --min 2 count -o index_min2

Summary statistics for an index can be obtained with this command:

discount.sh -i index_path stats

Except for union, subtract, and intersect, only one index can be used as an input at once. When a new index is created (like index_min2 above) it should always be written in a new location. The same location can not simultaneously be both an input and an output.

Indexes may be combined using binary operations such as intersect, union, and subtract. For example, to create the intersection of two indexes using the minimum count from either index:

discount.sh -i index1_path intersect -i index2_path -r min -o i1i2_min_path

Multiple indexes may be combined at once with the same rule. For example, to union three indexes at once with the maximum rule:

discount.sh -i index1_path union -r max -i index2_path index3_path -o union3_path

The various rules have the same meaning as in KMC3:

Intersection

  • max: choose the maximum of the count of each k-mer in index 1 and index 2
  • min: choose the minimum
  • left: choose the count from index 1
  • right: choose the count from index 2
  • sum: the sum of the counts in index 1 and index 2

Union

Union supports the same rules as intersection does, except that if the k-mer is present in only one index, it is still kept and then the value from that index is used.

Subtract

  • kmers_subtract: k-mers that were present in index 1, but not in index 2, are kept. Counts remain as they were in index 1.
  • counters_subtract: the count in index 2 is subtracted from the count in index 1. Only k-mers with positive values after subtraction are kept.

For additional guidance, consult the command line help for each command, e.g.:

discount.sh intersect --help

More examples can be found in the wiki.

Partitions

For each index, a number of parquet files will be created in the corresponding directory. The number of partitions corresponds to the number of shuffle partitions that Spark uses. To set the number of partitions, the -p argument may be us

Related Skills

View on GitHub
GitHub Stars18
CategoryDevelopment
Updated1mo ago
Forks2

Languages

Scala

Security Score

95/100

Audited on Mar 2, 2026

No findings