MethylDackel

A (mostly) universal methylation extractor for BS-seq experiments.

Generate Convert Improve

Install / Use

/learn @dpryan79/MethylDackel

About this skill

Quality Score

0/100

README

<img align="left" src="MethylDackelLogo.png" /> MethylDackel (formerly named PileOMeth, which was a temporary name derived due to it using a PILEup to extract METHylation metrics) will process a coordinate-sorted and indexed BAM or CRAM file containing some form of BS-seq alignments and extract per-base methylation metrics from them. MethylDackel requires an indexed fasta file containing the reference genome as well.

Prerequisites

A C compiler, such as gcc, htslib (at least versions 1.11, earlier versions are not compatible) and libBigWig are required. For libBigWig, the static library is used.

Compilation

Compilation and installation can be performed via:

git clone https://github.com/dpryan79/MethylDackel.git
cd MethylDackel
make LIBBIGWIG="/some/path/to/libBigWig.a"
make install prefix=/some/installation/path

If the linker has issues finding the htslib headers and library, then specify them using CFLAGS and LIBS:

make install CFLAGS="-O3 -Wall -I/some/path/include " LIBS="-L/some/path/lib" prefix=/some/installation/path LIBBIGWIG="/some/path/to/libBigWig.a"

License

This software is licensed under the terms of the MIT license (see the LICENSE file). Please feel free to redistribute this according to those terms.

Methylation Context

MethylDackel groups all Cytosines into one of three sequence contexts: CpG, CHG, and CHH. Here, H is the IUPAC ambiguity code for any nucleotide other than G. If an N is encountered in the reference sequence, then the context will be assigned to CHG or CHH, as appropriate (e.g., CNG would be categorized as in a CHG context and CNC as in a CHH context). If a Cytosine is close enough to the end of a chromosome/contig such that its context can't be inferred, then it is categorized as CHH (e.g., a Cytosine as the last base of a chromosome is considered as being in a CHH context).

Usage

The most basic usage of MethylDackel is as follows:

MethylDackel extract reference_genome.fa alignments.bam

This will calculate per-base CpG metrics and output them to alignments_CpG.bedGraph, which is a standard bedGraph file with column 4 being the number of reads/read pairs with evidence for a methylated C at a given position and column 5 the equivalent for an unmethylated C. An alternate output filename prefix can be specified with the -o some_new_prefix option.

By default, MethylDackel will only calculate metrics for Cytosines in a CpG context, but metrics for those in CHG and CHH contexts are supported as well (see the --CHH and --CHG options). If you would like to ignore Cytosines in CpGs, simply specify --noCpG. Each type of metric is output to a different file. For per-CpG and/or per-CHG rather than per-Cytosine metrics, see "Per-CpG/CHG metrics", below.

MethylDackel can filter reads and bases according to MAPQ and Phred score, respectively. The default minimums are MAPQ >= 10 and Phred >= 5, though these can be changed with the -q and -p options. MethylDackel can also account for methylation bias (described below) with the --OT, --OB, --CTOT, and --CTOB options.

Given a bigWig of Bismap mappability data (using the -M option), MethylDackel can also filter out reads with a mappability score which is too low to support methylation calling. With default settings, a read would be rejected if less than 15 bases have mappability ≥ 0.01, but this can be configured with the -t and -b options.

This mappability data can also be read in from a BBM file. A BBM file (short for "Binary BisMap") is a custom compressed version of the data from the bigWig which is used by MethylDackel. It can be read in much faster than a bigWig and takes up less disk space. These files can be created by specifying the -O or -N options when running MethylDackel extract, and can be read in using the -B option (instead of -M).

A note on overlapping reads

A commonly asked and important question is how paired-end reads that overlap are treated. If two mates from a paired-end experiment happen to both overlap the same CpG then one could get skewed results if the methylation status is taken from both reads, rather than only one of them (i.e., double counting paired-end reads is a bad idea). MethylDackel was written with this in mind, so by default paired-end reads that overlap will never be double counted.

It's useful to note how this is implemented internally, since it relates to the -p option. For the sake of an example, suppose you have two mates that overlap a given position. In this example, we'll specify that mate #1 has a base call with a phred score of 20 at that position and that mate #2 has a base all with a phred score of 15.

If both base calls agree, then the one with the lower score (here, mate #2 with a score of 15) is set to 0. Further, the other read's base score (here, mate #1 with a score of 20) is increased by 20%. Since the -p option defaults to 10, it will skip mate #2 and only count read #1.
If the base calls instead disagree, then the process is more complicated. There are a total of four possibilities for how this could work:
Mate #1 has a higher phred score and the base call is not an N. Then its phred score is decremented by that of the other mate (in the example above, 20-15=5, so 5 would be the new phred score). The phred score for the mate is then set to 0.
Mate #2 has a higher phred score and the base call is not an N. In that case the same procedure as that above is followed, with the mate #2 phred score decremented and the mate #1 phred score set to 0.
The mate with the higher phred score has a base call of N. In this case, the phred scores for both mates are set to 0.
The mates have identical phred scores. In this case, the phred scores are both set to 0.

It's important to think about the above possibilities in conjunction with the -p option. If two mates agree but individually have low quality base calls, they may combine to produce a base call of acceptable quality. Similarly, if the two mates have high quality base calls but these disagree, the result could be a combined base call of low quality (due to the disagreement). This is actually the same prodcedure that samtools mpileup uses internally, so if you're familiar with variant calling with samtools then the methodology and reasons are identical.

Single Cytosine methylation metrics extraction

MethylDackel extract produces a variant of bedGraph that's similar to the "coverage" file produced by Bismark and Bison. In short, each line consists of 6 tab separated columns:

The chromosome/contig/scaffold name
The start coordinate
The end coordinate
The methylation percentage rounded to an integer
The number of alignments/pairs reporting methylated bases
The number of alignments/pairs reporting unmethylated bases

All coordinates are 0-based half open, which conforms to the bedGraph definition. When paired-end reads are aligned, it can often occur that their alignments overlap. In such cases, MethylDackel will not count both reads of the pair in its output, as doing so would lead to incorrect downstream statistical results.

An example of the output is below:

track type="bedGraph" description="SRR1182519.sorted CpG methylation levels"
1	25115	25116	100	3	0
1	29336	29337	50	1	1

Note the header line, which starts with "track". The "description" field is used as a label in programs such as IGV. Each of the subsequent lines describe single Cytosines, the 25116th and 29337th base on chromosome 1, respectively. The first position has 3 alignments (or pairs of alignments) indicating methylation and 0 indicating unmethylation (100% methylation) and the second position has 1 alignment each supporting methylation and unmethylation (50% methylation).

Per-CpG/CHG metrics

In many circumstances, it's desireable for metrics from individual Cytosines in a CpG to be merged, producing per-CpG metrics rather than per-Cytosine metrics. This can be accomplished with the --mergeContext option in MethylDackel extract. If this is used, then this output:

track type="bedGraph" description="SRR1182519.sorted CpG methylation levels"
1	25114	25115	66	2	1
1	25115	25116	100	3	0

is changed to this:

track type="bedGraph" description="SRR1182519.sorted merged CpG methylation levels"
1	25114	25116	83	5	1

This also works for CHG-level metrics. If bedGraph files containing per-Cytosine metrics already exist, they can be converted to instead contain per-CpG/CHG metrics with MethylDackel mergeContext.

Excluding low-coverage regions

If your downstream analysis requires an absolute minimum coverage (here, defined as the number of methylation calls kept after filtering for MAPQ, phred score, etc.), you can use the --minDepth option to achieve this. By default, MethylDackel extract will output all methylation metrics as long as the coverage is at least 1. If you use --minDepth 10, then only sites covered at least 10x will be output. This works in conjunction with the --mergeContext option, above. So if you request per-CpG context output (i.e., with --mergeContext) and `-

Related Skills

node-connect

341.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.6k

Commit, push, and open a PR