Metheor
:comet: Ultrafast DNA methylation heterogeneity calculation from bisulfite alignments (Lee et al., PLOS Computational Biology. 2023)
Install / Use
/learn @dohlee/MetheorREADME
Metheor ☄

Compute DNA methylation heterogeneity levels from Bismark-aligned bisulfite sequencing data.
Motivation
Per-CpG methylation level is not the only information we can extract from bisulfite sequencing data. We miss too much valuable clues about epigenetic stability and diversity by averaging out the signals! We here provide an ultrafast, Rust-based bioinformatics tool for the computation of various intra-/inter-cellular methylation heterogeneity measures.
Installation
Install with conda.
conda install -c dohlee metheor
Usage
Supported methylation heterogeneity measures
Metheor supports seven methylation heterogeneity measures in total.
Proportion of discordant reads (PDR)

PDR is defined as a fraction of reads carrying CpGs in discordant methylation states (i.e. containing both methylated and unmethylated CpGs in a single read) with respect to all reads mapped to a CpG.
metheor pdr --input <input.bam> --output <output.tsv>
--min-depth <min_depth> --min-cpgs <min_cpgs>
--min-qual <min_qual> --cpg-set <cpg_set.bed>
Options
-i, --input: Path to input BAM file.-o, --output: Path to output table file summarizing the result of PDR calculation.-d, --min-depth: Minimum depth of CpG stretches to consider. [default: 10]-p, --min-cpgs: Minimum number of consecutive CpGs in a CpG stretch to consider. [default: 10]-q, --min-qual: Minimum quality for a read to be considered [default: 10]-c, --cpg-set: (Optional) Specify a predefined set of CpGs (in BED file) to be analyzed.
Output
Produces a tab-separated table with the following six columns. Note that the coordinate system follows that of BED/BedGraph format (0-based, half-open), without header names.
chrom: Chromosome where the CpG exists.start: 0-based position of the cytosine (C) in CpGend: 0-based position + 1 of the guanine (G) in CpGpdr: Value of PDRn_c: Number of concordant reads (i.e., all the CpGs covered by the read have identical methylation states) supporting the CpGn_d: Number of discordant reads (i.e., CpGs covered by the read have more than one methylation states)
Local pairwise methylation disorder (LPMD)

We introduce LPMD as a new measure to quantify the local concordance of DNA methylation states. The conceptual basis of LPMD is similar to PDR as they both are aware of local homogeneity of DNA methylation states, but they are different in that LPMD explicitly takes the distance between CpGs into consideration. In detail, LPMD is defined as a fraction of CpG pairs within a given range of genomic distance (i.e., CpG pairs more distant than m basepairs and closer than M basepairs) and therefore, LPMD is defined for a pair of CpGs, but not for a single CpG. Importantly, we note that while there is an increased tendency of observing discordant reads solely by change, according to the definition in PDR, as the sequencing read gets longer, LPMD is not dependent on the length of sequencing reads.
metheor lpmd --input <INPUT> --output <OUTPUT>
--pairs --min-distance <min_distance> --max-distance <max_distance>
--min-qual <min_qual> --cpg-set <cpg_set.bed>
Options
-i, --input: Path to input BAM file.-o, --output: Path to output table file summarizing the result of LPMD calculation.-p, --pairs: (Optional) Concordance information for all CpG pairs.-m, --min-distance: Minimum distance between CpG pairs to consider. [default: 2]-M, --max-distance: Maximum distance between CpG pairs to consider. [default: 16]-q, --min-qual: Minimum quality for a read to be considered. [default: 10]-c, --cpg-set: (Optional) Specify a predefined set of CpGs (in BED file) to be analyzed.
Output
By default, it produces a simple tab-delimited table containing genomewide average lpmd values. It has two columns:
name: Filenamelpmd: LPMD values averaged for all CpG pairs participating in the analysis
Optionally, when --pairs flag is turned on, you can have a more detailed table with CpG pair-wise statistics for LPMD computation. Note that the size of the file can be large. There are six columns:
chrom: Chromosome where the CpG pair existscpg1: 0-based position of the cytosine (C) of the first CpGcpg2: 0-based position of the cytosine (C) of the second CpGlpmd: LPMD value. Computed asn_discordant/ (n_concordant+n_discordant)n_concordant: Number of sequencing reads supporting CpG pairs having the same methylation staten_discordant: Number of sequencing reads supporting CpG pairs having different methylation states
Methylation haplotype load (MHL)

The concept of MHL is also based on the local homogeneity of DNA methylation states, or co-methylation, due to the processivity of enzymes responsible for (de-)methylation of cytosines (Guo et al., 2017). While PDR and LPMD focus on how much the tendency of co-methylation is perturbed in the given population of cells, MHL focuses on how well the methylation haplotypes (i.e., stretch of consecutive methylated CpGs) are conserved throughout the cell population for a given genomic region. MHL is first devised to systematically identify the genomic blocks harboring CpGs with tightly coupled methylation states. In detail, MHL is computed as a fraction of observed fully methylated stretches out of the all stretches of every possible lengths.
metheor mhl --input <input.bam> --output <output.tsv>
--min-depth <min-depth> --min-cpgs <min-cpgs> --min-qual <min-qual>
--cpg-set <cpg-set.bed>
Options
-i, --input: Path to input BAM file.-o, --output: Path to output table file summarizing the result of epipolymorphism calculation.-d, --min-depth: Minimum depth of reads covering epialleles to consider. [default: 10]-p, --min-cpgs: Minimum number of consecutive CpGs in a CpG stretch to consider.-q, --min-qual: Minimum quality for a read to be considered. [default: 10]-c, --cpg-set: (Optional) Specify a predefined set of CpGs (in BED file) to be analyzed.
Output
Produces a (BedGraph-compatible) tab-separated table with the following four columns.
chrom: Chromosome where the CpG exists.start: 0-based position of the cytosine (C) in CpGend: 0-based position + 1 of the guanine (G) in CpGmhl: Value of MHL
Epipolymorphism (PM)

Landan et al., proposed another measure named epipolymorphism (PM), which also captures the amount of heterogeneity of DNA methylation for a given genomic region. The relationship between ME and PM is analogous to that between entropy and Gini index used for decision trees, and they can be considered as similar measures of DNA methylation heterogeneity in general. Using metheor, PM can be calculated with the command below:
metheor pm --input <input.bam> --output <output.tsv>
--min-depth <min-depth> --min-qual <min-qual> --cpg-set <cpg-set.bed>
Options
-i, --input: Path to input BAM file.-o, --output: Path to output table file summarizing the result of epipolymorphism calculation.-d, --min-depth: Minimum depth of reads covering epialleles to consider. [default: 10]-q, --min-qual: Minimum quality for a read to be considered. [default: 10]-c, --cpg-set: (Optional) Specify a predefined set of CpGs (in BED file) to be analyzed.
Output
Produces a tab-separated table with the following five columns, where each row corresponds to a CpG quartet of interest (four consecutive CpGs):
chrom: Chromosome where the CpG quartet exists.cpg1: 0-based position of the cytosine (C) in the first CpGcpg2: 0-based position of the cytosine (C) in the second CpGcpg3: 0-based position of the cytosine (C) in the third CpGcpg4: 0-based position of the cytosine (C) in the fourth CpGpm: Value of PM
NOTE: The order of CpG quartets in the output file is not sorted.
Methylation entropy (ME)

Xie et al. proposed an information theoretic measure called methylation entropy (ME), which is calculated as the entropy of epialleles originating from a single genomic locus.
metheor me --input <INPUT> --output <OUTPUT>
--min-depth <min-depth> --min-qual <min-qual> --cpg-set <cpg-set.bed>
Options
-i, --input: Path to input BAM file.-o, --output: Path to output table file summarizing the result of ME calculation.-d, --min-depth: Minimum depth of reads covering epialleles to consider. [default: 10]-q, --min-qual: Minimum quality for a read to be considered. [default: 10]-c, --cpg-set: (Optional) Specify a predefined set of CpGs (in BED file) to be analyzed.
Output
Produces a tab-separated table with the following five columns, where each row corresponds to a CpG quartet of interest (four consecutive CpGs):
chrom: Chromosome where the CpG quartet exists.cpg1: 0-based position of the cytosine (C) in the first CpGcpg2: 0-based position of the cytosine (C) in the second CpGcpg3: 0-based position of the cytosine (C) in the third CpGcpg4: 0-based position of the cytosine (C) in the fourth CpGme: Value of ME
NOTE: The order of CpG quartets in the output file is not sorted.
Fraction of discordant read pairs (FDRP)
FDRP and qFDRP are measures of epigenetic diversity within a cell population that were first proposed in Scherer et al. (2020). While PM and ME quantify the epiallelic diversity in terms of CpG quartets, FDRP and qFDRP
