ccsmeth

Detecting DNA methylation from PacBio CCS reads

Installation
Trained models
Demo data
Quick start
Usage
Acknowledgements
TODO

Installation

ccsmeth is built on Python3 and PyTorch.

Prerequisites:
Python3.* (version>=3.8)
pbccs (version>=6.3.0)
pbmm2 (version>=1.9.0) or minimap2 (version>=2.22-r1101)
samtools (version>=1.12)
CUDA Toolkit (version>=10.2, for GPU only)
Dependencies:
numpy
statsmodels
scikit-learn
PyTorch (version >=1.2.0, <=2.1.0)
tqdm
pysam
pybedtools
pytabix

System Requirements

ccsmeth requires only a standard computer with enough RAM to support the in-memory operations. Using GPU could acceralate the process of methylation calling.

Recommended Hardware: 128 GB RAM, 40 CPU processors, 4 TB disk storage, >=8 GB GPU

Recommended OS: Linux (Ubuntu 16.04, CentOS 7, etc.)

Option 1. One-step installation

Install ccsmeth, its dependencies, and other required packages in one step using conda and environment.yml:

# download deepsignal-plant
git clone https://github.com/PengNi/ccsmeth.git

# install tools in environment.yml
conda env create --name ccsmethenv -f /path/to/ccsmeth/environment.yml

# then the environment can be activated to use
conda activate ccsmethenv

Option 2. Step-by-step installation

(1) install ccsmeth

It is highly recommended installing ccsmeth in a virtual environment.

conda create -n ccsmethenv python=3.8
# activate
conda activate ccsmethenv
# deactivate this environment
conda deactivate

# install ccsmeth after activating ccsmethenv
# install ccsmeth from github (latest version)
git clone https://github.com/PengNi/ccsmeth.git
cd ccsmeth
python setup.py install
# OR, install ccsmeth using pip
pip install ccsmeth
# OR, install ccsmeth using conda
conda install ccsmeth -c bioconda

(2) install necessary packages

Install necessary packages (bedtools, and pbccs, pbmm2 or minimap2, samtools) in the same environment. Installing of those packages using Bioconda is recommended:

conda install bedtools -c bioconda  # required by pybedtools->ccsmeth:call_mods
conda install pbccs pbmm2 samtools -c bioconda

Also install the cuda version of pytoch and cudatoolkit (>=10.2) if you want use GPU to run ccsmeth in your GPU machine. Uninstall the wrong pytorch first if you have installed it before.

conda install pytorch==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Trained models

See models:

For the ccsmeth call_mods module:

model_ccsmeth_5mCpG_call_mods_attbigru2s_b21.v3.ckpt: model of ccsmeth call_mods module for 5mCpG detection, trained using NA12898 pcr/MSssI and HG002 native (BS-seq as standard) PacBio Sequel II (kit 2.0) CCS reads. (for version >=0.5.0)
model_ccsmeth_5mCpG_call_mods_attbigru2s_b21.v2.ckpt: model of ccsmeth call_mods module for 5mCpG detection, trained using NA12898 pcr/MSssI and HG002 native (BS-seq as standard) PacBio Sequel II (kit 2.0) CCS reads. (for version <=0.4.1)

For the aggregate mode of ccsmeth call_freqb module:

model_ccsmeth_5mCpG_aggregate_attbigru_b11.v2p.ckpt: model of aggregate mode of ccsmeth call_freqb module for 5mCpG detection, trained using HG002 native (BS-seq as standard) PacBio Sequel II (kit 2.0) CCS reads.

Demo data

Check demo for some demo data to play with:

hg002.chr20_demo.hifi.bam: HG002 demo hifi reads which are aligned to human genome chr20:10000000-10100000.
chr20_demo.fa: reference sequence of human chr20:10000000-10100000.
hg002_bsseq_chr20_demo.bed: HG002 BS-seq results of region chr20:10000000-10100000.

Quick start

Use denovo mode (first call_mods, then align_hifi):

# 1. call hifi reads with kinetics if needed
# should have added pbccs to $PATH or the used environment
ccsmeth call_hifi --subreads /path/to/subreads.bam \
  --threads 10 \
  --output /path/to/output.hifi.bam


# 2. call modifications
# output: [--output].modbam.bam
CUDA_VISIBLE_DEVICES=0 ccsmeth call_mods \
  --input /path/to/output.hifi.bam \
  --model_file /path/to/ccsmeth/models/model_call_mods.ckpt \
  --output /path/to/output.hifi.call_mods \
  --threads 10 --threads_call 2 --model_type attbigru2s \
  --mode denovo


# 3. align hifi reads
# should have added pbmm2 to $PATH or the used environment
ccsmeth align_hifi \
  --hifireads /path/to/output.hifi.call_mods.modbam.bam \
  --ref /path/to/genome.fa \
  --output /path/to/output.hifi.call_mods.modbam.pbmm2.bam \
  --threads 10


# 4. call modification frequency
# outputs: [--output].[--call_mode].all.bed
# if the input bam file contains haplotags, 
# there will be [--output].[--call_mode].[hp1/hp2].bed in outputs.
# use '--call_mode count' (default):
ccsmeth call_freqb \
  --input_bam /path/to/output.hifi.call_mods.modbam.pbmm2.bam \
  --ref /path/to/genome.fa \
  --output /path/to/output.hifi.call_mods.modbam.pbmm2.freq \
  --threads 10 --sort --bed

# OR, use '--call_mode aggregate':
# NOTE: usually is more accurate than 'count' mode
ccsmeth call_freqb \
  --input_bam /path/to/output.hifi.call_mods.modbam.pbmm2.bam \
  --ref /path/to/genome.fa \
  --output /path/to/output.hifi.call_mods.modbam.pbmm2.freq \
  --threads 10 --sort --bed \
  --call_mode aggregate \
  --aggre_model /path/to/ccsmeth/models/model_aggregate.ckpt

OR, use align mode (first align_hifi, then call_mods):

# 1. call hifi reads with kinetics if needed
# should have added pbccs to $PATH or the used environment
ccsmeth call_hifi --subreads /path/to/subreads.bam \
  --threads 10 \
  --output /path/to/output.hifi.bam


# 2. align hifi reads
# should have added pbmm2 to $PATH or the used environment
ccsmeth align_hifi \
  --hifireads /path/to/output.hifi.bam \
  --ref /path/to/genome.fa \
  --output /path/to/output.hifi.pbmm2.bam \
  --threads 10


# 3. call modifications
# output: [--output].modbam.bam
CUDA_VISIBLE_DEVICES=0 ccsmeth call_mods \
  --input /path/to/output.hifi.pbmm2.bam \
  --ref /path/to/genome.fa \
  --model_file /path/to/ccsmeth/models/model_call_mods.ckpt \
  --output /path/to/output.hifi.pbmm2.call_mods \
  --threads 10 --threads_call 2 --model_type attbigru2s \
  --mode align


# 4. call modification frequency
# outputs: [--output].[--call_mode].all.bed
# if the input bam file contains haplotags, 
# there will be [--output].[--call_mode].[hp1/hp2].bed in outputs.
# use '--call_mode count':
ccsmeth call_freqb \
  --input_bam /path/to/output.hifi.pbmm2.call_mods.modbam.bam \
  --ref /path/to/genome.fa \
  --output /path/to/output.hifi.pbmm2.call_mods.modbam.freq \
  --threads 10 --sort --bed

# OR, use '--call_mode aggregate':
# NOTE: usually is more accurate than 'count' mode
ccsmeth call_freqb \
  --input_bam /path/to/output.hifi.pbmm2.call_mods.modbam.bam \
  --ref /path/to/genome.fa \
  --output /path/to/output.hifi.pbmm2.call_mods.modbam.freq \
  --threads 10 --sort --bed \
  --call_mode aggregate \
  --aggre_model /path/to/ccsmeth/models/model_aggregate.ckpt

Usage

Users can use ccsmeth subcommands --help/-h for help.

[the cmds need to be updated]

1. call hifi reads

ccsmeth call_hifi -h
usage: ccsmeth call_hifi [-h] --subreads SUBREADS [--output OUTPUT]
                         [--path_to_ccs PATH_TO_CCS] [--threads THREADS]
                         [--min-passes MIN_PASSES] [--by-strand] [--hd-finder]
                         [--log-level LOG_LEVEL]
                         [--path_to_samtools PATH_TO_SAMTOOLS]

call hifi reads with kinetics from subreads.bam using CCS, save in bam/sam
format. cmd: ccsmeth call_hifi -i input.subreads.bam

optional arguments:
  -h, --help            show this help message and exit
  --path_to_samtools PATH_TO_SAMTOOLS
                        full path to the executable binary samtools file. If
                        not specified, it is assumed that samtools is in the
                        PATH.

INPUT:
  --subreads SUBREADS, -i SUBREADS
                        path to subreads.bam file as input

OUTPUT:
  --output OUTPUT, -o O

Ccsmeth

Install / Use

README