Mosdepth

fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing

Generate Convert Improve

Install / Use

/learn @brentp/Mosdepth

About this skill

Quality Score

0/100

README

fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing.

logo

Note:
mosdepth must be built with the --mm:refc flag for optimal performance and correct operation.

mosdepth can output:

per-base depth about 2x as fast samtools depth--about 25 minutes of CPU time for a 30X genome.
mean per-window depth given a window size--as would be used for CNV calling.
the mean per-region given a BED file of regions.

the mean or median per-region cumulative coverage histogram given a window size

a distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
quantized output that merges adjacent bases as long as they fall in the same coverage bins e.g. (10-20)
threshold output to indicate how many bases in each region are covered at the given thresholds.
A summary of mean depths per chromosome and within specified regions per chromosome.
a d4 file (better than bigwig).

when appropriate, the output files are bgzipped and indexed for ease of use.

usage


mosdepth 0.3.11

  Usage: mosdepth [options] <prefix> <BAM-or-CRAM>

Arguments:

  <prefix>       outputs: `{prefix}.mosdepth.global.dist.txt`
                          `{prefix}.mosdepth.summary.txt`
                          `{prefix}.per-base.bed.gz` (unless -n/--no-per-base is specified)
                          `{prefix}.regions.bed.gz` (if --by is specified)
                          `{prefix}.quantized.bed.gz` (if --quantize is specified)
                          `{prefix}.thresholds.bed.gz` (if --thresholds is specified)

  <BAM-or-CRAM>  the alignment file for which to calculate depth.

Common Options:

  -t --threads <threads>     number of BAM decompression threads [default: 0]
  -c --chrom <chrom>         chromosome to restrict depth calculation.
  -b --by <bed|window>       optional BED file or (integer) window-sizes.
  -n --no-per-base           dont output per-base depth. skipping this output will speed execution
                             substantially. prefer quantized or thresholded values if possible.
  -f --fasta <fasta>         fasta file for use with CRAM files [default: ].

Other options:

  -F --flag <FLAG>                  exclude reads with any of the bits in FLAG set [default: 1796]
  -i --include-flag <FLAG>          only include reads with any of the bits in FLAG set. default is unset. [default: 0]
  -x --fast-mode                    dont look at internal cigar operations or correct mate overlaps (recommended for most use-cases).
  -a --fragment-mode                count the coverage of the full fragment including the full insert (proper pairs only).
  -q --quantize <segments>          write quantized output see docs for description.
  -Q --mapq <mapq>                  mapping quality threshold. reads with a quality less than this value are ignored [default: 0]
  -l --min-frag-len <min-frag-len>  minimum insert size. reads with a smaller insert size than this are ignored [default: -1]
  -u --max-frag-len <max-frag-len>  maximum insert size. reads with a larger insert size than this are ignored. [default: -1]
  -T --thresholds <thresholds>      for each interval in --by, write number of bases covered by at
                                    least threshold bases. Specify multiple integer values separated
                                    by ','.
  -m --use-median                   output median of each region (in --by) instead of mean.
  -R --read-groups <string>         only calculate depth for these comma-separated read groups IDs.
  -h --help                         show help

If you use mosdepth please cite the publication in bioinformatics

See the section below for more info on distribution.

If --by is a BED file with 4 or more columns, it is assumed the the 4th column is the name. That name will be propagated to the mosdepth output in the 4th column with the depth in the 5th column. If you don't want this behavior, simply send a bed file with 3 columns.

exome example

To calculate the coverage in each exome capture region:


mosdepth --by capture.bed sample-output sample.exome.bam

For a 5.5GB exome BAM and all 1,195,764 ensembl exons as the regions, this completes in 1 minute 38 seconds with a single CPU.

Per-base output will go to sample-output.per-base.bed.gz, the mean for each region will go to sample-output.regions.bed.gz; each of those will be written along with a CSI index that can be used for tabix queries. The distribution of depths will go to sample-output.mosdepth.dist.txt

WGS example

For 500-base windows


mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram

-n means don't output per-base data, this will make mosdepth a bit faster as there is some cost to outputting that much text.

--fast-mode avoids the extra calculations of mate pair overlap and cigar operations, and also allows htslib to extract less data from CRAM, providing a substantial speed improvement.

Callable regions example

To create a set of "callable" regions as in GATK's callable loci tool:


# by setting these ENV vars, we can control the output labels (4th column)

export MOSDEPTH_Q0=NO_COVERAGE   # 0 -- defined by the arguments to --quantize
export MOSDEPTH_Q1=LOW_COVERAGE  # 1..4
export MOSDEPTH_Q2=CALLABLE      # 5..149
export MOSDEPTH_Q3=HIGH_COVERAGE # 150 ...
mosdepth -n --quantize 0:1:5:150: $sample.quantized $sample.wgs.bam

For this case. A regions with depth of 0 are labelled as "NO_COVERAGE", those with coverage of 1,2,3,4 are labelled as "LOW_COVERAGE" and so on.

The result is a BED file where adjacent bases with depths that fall into the same bin are merged into a single region with the 4th column indicating the label.

Distribution only with modified precision

To get only the distribution value, without the depth file or the per-base and using 3 threads:


MOSDEPTH_PRECISION=5 mosdepth -n -t 3 $sample $bam

Output will go to $sample.mosdepth.dist.txt

This also forces the output to have 5 decimals of precision rather than the default of 2.

D4

D4 is a format created by Hao Hou in the Quinlan lab. It is incorporated into mosdepth as of version 0.3.0 for per-base output with the --d4 flag. It improves write speed dramatically; for one test-case it takes 24.8s to write a per-base.bed.gz with mosdepth compared to 7.7s to write a d4 file. For the same case, running mosdepth without writing per-base takes 5.9 seconds so D4 greatly mitigates the cost of outputing per-base depth and the output is more useful.

Installation

The simplest option is to download the binary from the releases.

Another quick way is to

It can also be installed with brew as brew install brewsci/bio/mosdepth or used via docker with quay:


docker pull quay.io/biocontainers/mosdepth:0.3.3--h37c5b7d_2
docker run -v /hostpath/:/opt/mount quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth -n --fast-mode -t 4 --by 1000 /opt/mount/sample /opt/mount/$bam

The binary from releases is static, with no dependencies. If you build it yourself, mosdepth requires htslib version 1.4 or later. If you get an error about "libhts.so not found", set LD_LIBRARY_PATH to the directory that contains libhts.so. e.g.

LD_LIBRARY_PATH=~/src/htslib/ mosdepth -h

If you get the error could not import: hts_check_EOF you may need to install a more recent version of htslib.

If you do want to install from source, see the install.sh.

If you use archlinux, you can install as a package

distribution output

This is useful for QC.

The $prefix.mosdepth.global.dist.txt file contains, a cumulative distribution indicating the proportion of total bases (or the proportion of the --by for $prefix.mosdepth.region.dist.txt) that were covered for at least a given coverage value. It does this for each chromosome, and for the whole genome.

Each row will indicate:

chromosome (or "total")
coverage level
proportion of bases covered at that level

The last value in each chromosome will be coverage level of 0 aligned with 1.0 bases covered at that level.

A python plotting script is provided in scripts/plot-dist.py that will make plots like below. Use is python scripts/plot-dist.py \*global.dist.txt and the output is dist.html with a plot for the full set along with one for each chromosome.

Using something like that, we can plot the distribution from the entire genome. Below we show this for samples with ~60X coverage:

![WGS Example](https://user-images.githubusercontent.com/1739/29646192-2a2a6126-883f-11e7-91ab-049295eb3531.p

Related Skills

node-connect

342.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.7k

Commit, push, and open a PR