DMRfinder
Identifying differentially methylated regions from MethylC-seq (bisulfite-sequencing) data
Install / Use
/learn @jsh58/DMRfinderREADME
DMRfinder: efficiently identifying differentially methylated regions from MethylC-seq data
Gaspar JM, Hart RP. BMC Bioinformatics. 2017 Nov 29;18(1):528. [PMC] [BMC] [PDF]
Table of Contents
- Introduction
- Alignment
- Extracting methylation counts
- Clustering CpG sites into regions
- Testing regions for differential methylation
- Miscellaneous
- Appendix A. Illustrative examples with
combine_CpG_sites.py - Appendix B. Illustrative examples with
findDMRs.r<br><br>
Introduction <a name="intro"></a>
DMRfinder efficiently identifies genomic regions with differentially methylated CpG sites from high-throughput MethylC-seq datasets (sequencing preceded by bisulfite treatment). Following alignment with Bismark [1], DMRfinder extracts methylation counts and clusters the CpG sites into genomic regions. It then performs pairwise comparisons to identify differentially methylated regions (DMRs) using the statistical framework of the Bioconductor package DSS [2].
Quick start <a name="quick"></a>
Given:
/path/to/genome: a file path to a reference genomeC1.fastq,C2.fastq: single-end sequence files for two samples in the groupControlE1.fastq,E2.fastq: single-end sequence files for two samples in the groupExptl<br><br>
Looking for:
- Genomic regions that are differentially methylated between the groups
ControlandExptl<br><br>
Step 1: Alignment
$ bismark_genome_preparation /path/to/genome
$ bismark /path/to/genome C1.fastq
$ bismark /path/to/genome C2.fastq
$ bismark /path/to/genome E1.fastq
$ bismark /path/to/genome E2.fastq
<br>
Step 2: Extract methylation counts from alignment files
$ samtools view -h C1_bismark_bt2.bam | python extract_CpG_data.py -i - -o C1.cov -v
$ samtools view -h C2_bismark_bt2.bam | python extract_CpG_data.py -i - -o C2.cov -v
$ samtools view -h E1_bismark_bt2.bam | python extract_CpG_data.py -i - -o E1.cov -v
$ samtools view -h E2_bismark_bt2.bam | python extract_CpG_data.py -i - -o E2.cov -v
<br>
Step 3: Cluster CpG sites into regions
$ python combine_CpG_sites.py -v -o combined.csv C1.cov C2.cov E1.cov E2.cov
<br>
Step 4: Test regions for differential methylation
$ Rscript findDMRs.r -i combined.csv -o results.csv -v -n Control,Exptl C1,C2 E1,E2
<br>
Software prerequisites <a name="prereqs"></a>
Required:
- DMRfinder
- Bismark
- Bowtie2
- DSS. This package requires R and Bioconductor. DMRfinder scripts have been tested with Rscript version 3.3.1 and DSS version 2.14.0.
- Python. DMRfinder scripts have been tested with Python versions 2.7.12 and 3.5.2.
Optional:
- Samtools. If this is installed, the output from Bismark will be converted to BAM; otherwise, the output will be gzip-compressed. Either output can be analyzed by DMRfinder, as described below. <br><br>
Alignment <a name="align"></a>
To identify the methylation status of genomic cytosines, the high-throughput sequence reads from a MethylC-seq run first must be aligned to a reference genome. To accomplish this, Bismark [1] performs an in silico bisulfite conversion of both the reads and the reference genome, and then uses Bowtie2 [3] to align the reads to the reference. Reads with a single optimal alignment are further analyzed for methylation status; reference cytosines that were sequenced as thymines are labeled unmethylated, and those that remained as cytosines are labeled methylated. All methylation calls are made with respect to the forward strand, so for reads that align to the reverse strand of the genome, methylation calls appear on the guanine bases. Bismark classifies each methylation call based on the sequence context (CpG/CHG/CHH).
There are many command-line options in Bismark, which are described in the User Guide. Of note, because Bismark relies on the alignments produced by Bowtie2, most of the Bowtie2 command-line options are available with Bismark as well (a full description of the Bowtie2 options can be found in the manual).
Following alignment, some researchers choose to remove reads that may be PCR duplicates. We do not recommend using Bismark's deduplication script for this purpose; it simply keeps the first read at a given position in the alignment file and eliminates the rest, regardless of the reads' sequences or methylation information.
A bug related to ambiguous read alignments crept into version 0.16.0 of Bismark. Until this is fixed, we recommend version 0.15.0. <br><br>
Extracting methylation counts <a name="extract"></a>
The DMRfinder script extract_CpG_data.py converts the output from Bismark's aligner into a table of methylated/unmethylated counts at each CpG site. The table is sorted by chromosome (based on the order of reference sequence names in the header of the SAM) and position, which is the 1-based coordinate of the cytosine of the CpG. The script also merges methylation counts, in that it sums counts for both the cytosine and the guanine of each CpG site (this occurs when reads align to different strands of the reference; see above).
Usage: python extract_CpG_data.py [options] -i <input> -o <output>
-i <input> SAM alignment file produced by Bismark (must have
a header, 'XM' methylation strings, and 'XG'
genome designations; can use '-' for stdin)
-o <output> Output file listing counts of methylated and
unmethylated CpGs, merged and sorted
Options:
-m <int> Minimum coverage (methylation counts) to report a
CpG site (def. 1)
-s Report strand in third column of output
-n <file> BED file listing regions for which to collect
linked methylation data
-b Memory-saving option (with coordinate-sorted SAM)
-e <file> Output file listing ordered chromosomes
-i <input> SAM alignment file produced by Bismark (must have
a header, 'XM' methylation strings, and 'XG'
genome designations; can use '-' for stdin)
A SAM alignment file produced by Bismark, whether gzip-compressed or not, can be used directly as the <input> file. However, a BAM file must be converted to SAM, or it can be piped in via Samtools, e.g.:
$ samtools view -h <BAM> | python extract_CpG_data.py -i - -o <output> -v
<br>
-o <output> Output file listing counts of methylated and
unmethylated CpGs, merged and sorted
Each line of the output lists the following six values for a single CpG, tab-delimited:
<table> <tr> <td align="center">chrom</td> <td>chromosome name</td> </tr> <tr> <td align="center">chromStart</td> <td>1-based position of the cytosine in the CpG</td> </tr> <tr> <td align="center">chromEnd</td> <td>chromStart + 1</td> </tr> <tr> <td align="center">percent</td> <td>percent methylation at this site</td> </tr> <tr> <td align="center">methylated</td> <td>count of methylated cytosines</td> </tr> <tr> <td align="center">unmethylated</td> <td>count of unmethylated cytosines</td> </tr> </table> <br> -m <int> Minimum coverage (methylation counts) to report a
CpG site (def. 1)
Any sites that do not have at least the specified number of methylation counts will be excluded from the output. By default, all sites with at least one count are reported. <br><br>
-s Report strand in third column of output
With this option, the 3rd column of the output, instead of chromEnd, will contain the strand. Because this is a merged output, the strand will invariably be +.
<br><br>
-n <file> BED file listing regions for which to collect
linked methylation data
On separate lines, the input BED file should list, for each genomic region of interest, chromosome name, 1-based start and end coordinates, and unique region name, tab-delimited. For example:
chrZ 100 200 regionA
The output file, <file>_linked.txt, will list, for each BED region, the methylation data for reads that cover at least one CpG in the region. The methylation data is summarized using 0 for an unmethylated CpG, 1 for methylated, and - for no data. For example:
Region: regionA, chrZ:100-200
Sites: 102, 109, 123, 140, 147, 168
00100- read1
001100 read2
--0000 read3
This shows that, in regionA, there were six CpG sites, corresponding to the six columns of data for the three reads. The first read (read1) had unmethylated CpGs at positions 102, 109, 140, and 147, while the CpG at position 123 was methylated. The last CpG, at position 168, was not covered by this read. The methylation data for the six CpG site
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
