BS-Seeker3

BS-Seeker3 maps bisulfite-treated reads (bs-seq) with high accuracy and ultra-fast speed. While being 1.5 time faster than BSMAP and 10 times faster than Bismark, BS-Seeker3 can map twice as many reads than both aligners. In addition to its high-throughput performance, BS-Seeker3 offers additional downstream analysis to further investigate and visualize methylation pattern post-alignment.

<a name="NewFeatures"></a>New Features

BS-seeker3 now employs an improved index, conducts fast alignment with SNAP, and incorporates a highly optimized pipeline to process SNAP outputs.
BS-seeker3 now executes local alignment through the Unnoken Algorithm, which allows high mappability and accuracy without sacrificing too much runtime.
BS-seeke3 also outputs a preliminary quality control graph, a meta-gene plot, and a bisulfite unconversion rate histogram. Additional downstream methylation analysis is supported by MethGo.

BS-Seeker3 Pipeline

overview

Annoucement

I am aware and fixing a file-directory problem that's preventing the code to be run in a more flexible manner. Also improving the RRBS segment and problem with matching short reads (length < 50). MS1 obgligations are making this go slowly but it's getting there.

New Features
System Requirements
BS-Seeker3 Usage
Example Use Case
Linking with MethGo

<a name="SystemRequirements"></a>System Requirements

Linux or Mac OS Environment
Python2 (version 2.5.2 or above; it should be pre-installed in both Linux and Mac). Type 'Python' to see the installed version. Python2 could be downloaded from http://www.python.org/download/ )
GCC 5.4.0 +
SNAP version 1.0beta.23, which could be downloaded from https://github.com/amplab/snap
Python Modules 'Pysam' and 'Metplotlib'. To install the packages, use the following commands on an UNIX terminal:

pip install pysam

pip install Matplolib

<a name="RunningBS-Seeker3"></a>Running BS-Seeker3

BS-Seeker3 is a 3-steps pipeline: index-building, bs-seq alignment, and methylation rate calculation. Prior to alignment, BS3 first builds a custom-index for the reference genome (the user should adjust specific index-building parameters based on the reference genome size, see below for details). During alignment, BS3 uses SNAP to map bisulfite reads, and then sorts through the non-unique and incorrectly converted mappings. After alignment, methylation rate is then calcualted at the single-base resolution.

<a name="DownloadBS-Seeker3"></a>Download BS-Seeker3

Type the following commands in an Unix Terminal:

To download the Mac verion:

Stay tuned.

To download the Linux version:

git clone https://github.com/khuang28jhu/bs3

<a name="IndexBuilding"></a>Index Buidling

Use the script bs3-build.py to build an index for a reference genome.

Usage:

$ ./bs3-build -h 
Usage: ./bs3-build -h [options]

-f                   Path to the reference genome; the reference genome should be in fasta format

-s                   Seed size (default: 20), a SNAP option; SNAP uses a hashtable strucutre. 
                     It builds the index by breaking the reference genome into multiple seqeunces
                     (seed) of a set length. This option determines the length of each 
                     seqeunce (seed size), and SNAP can process seed sizes to 23. A seed size of
                     20 is recommended for bisulfite reads of 100 bp long; a longer size should be
                     used for raw reads of longer length. 
                     
-L                   (default: 4), a SNAP option specific to the Linux implementation; This option 
                     determines the byte size to store the location of each seed along the 
                     reference genome. It ranges from 4 to 8 bytes. For larger genomes, a larger 
                     location size should be used; for example, to build an index based on the human 
                     genome, a location size of 5 bytes is recommended.

<a name="Alignment"></a>Alignment

Use the script bs3-align.py to map the raw bisulfite reads.

Input:

BS reads file in fastq

@SRR019072.2842 HWI-EAS365_1060:4:1:51:313 length=87
TAATTAGATTTGTGTTATAGATTATTTGTAAAGAAAGTAATTATTAAAGGAAATGTTAGTTTTTATTTGATATATGATAAGAGAACG
+SRR019072.2842 HWI-EAS365_1060:4:1:51:313 length=87
BBBCC@)8ABA/<2>CB:=.:?BBABB1-:@74@B@?=@@ABB@B7@@5/98<;)<>56:?>:;A?A?A@>=AABB@A<3(@@=086

BS reads file in fasta

>read1
TCCATTATACCGTAACCCAATACAAAAATTATTTAT
>read2
TCTGTAGACGGGTCGAATGGGGAGTTCATAGGGGGG

Usage:

$ ./bs3-align -h 
Usage: ./bs3-align -h [options] 

For single end reads:

-i INFILE,           Input read file (FORMAT:  fasta, fastq). Ex: read.fa or read.fa.gz

For pair end reads:

-1 FILE,             Input read file, mate 1 (FORMAT: fasta, fastq)

-2 FILE,             Input read file, mate 2 (FORMAT: fasta, fastq)

Important General options:

-K ALIGNMENT LENGTH, Neglect alignments with length below this value

-g GENOME,           Name of the reference genome (should be the same as "-f" in bs3-build.py ) [ex.
                     chr21_hg18.fa]

-m NO_MISMATCHES,    Set the number(>=1)/percentage([0, 1)) of mismatche in a read. Ex: 8 (allow 8 
                     mismatches) or 0.08 (allow 8% mismatches) [Default: 12]
                     
-l INT,              Split the input file into smaller files based on this number. Each smaller file 
                     is processed in paralell. The result is then merged. [Default: 12800000]
                     
-o OUTFILE           The name of output file 

Relevant Aligner Options:

--snap-h             MaxHits, (default: 250 on the Mac version, 300 on Linux) a SNAP option; There 
                     are often seeds matching to multiple locations in the genomes. Sorting throught all
                     the putative hits is a time-consuming process. This option sets a threshold on the 
                     number of locations that a seed can match to. Seeds matching to locations more than 
                     this number are ignored during the alignment.
                     
Methylation Rate Statistics Display Option:

--qcf=QC_F           Supply the length of the raw bisulfite reads to plot a quality control plot. A 
                     quality control plot tabulates the average rate of mismatche at each position 
                     on a raw read.

Output:

Alignment Summary in .stat file

	BS-seeker3 Result

	Final Alignment Report
================================================

Number of reads in total: 20000
Number of unique-hits reads (before post-filtering): 9990.0
Number of reads mapped after post-filtering 9934.0

Alignment Time: 1.14915108681secs

	Final Cytosine Report
================================================

Total Number of Cytosines: 348392.0
Total Number of Cs in CpG context: 45242.0
Total Number of Cs in CHG context: 49610.0
Total Number of Cs in CHH context: 253540.0

Rate of Methylation
 mCG  0.999%
 mCHG  0.999%
 mCHH  0.999%

List of Aligned Reads in SAM Format (SAM Fields Description)

SRR2058107.412129	0	10_w_c	42386003	1	90M	*	0	0	TGGATTGGAAGGTAATTATTATTGAATGGAATTGAATGGAATTATTGAATGGATTTGAATGGAATAATTATTGAATGGAATTGAATGGAA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	PG:Z:SNAP	NM:i:3	RG:Z:FASTQ	PL:Z:Illumina	PU:Z:pu	LB:Z:lb	SM:Z:sm

<a name= "MethylationRateCalculation" ></a> Methylation Rate Calculation

Use the script bs3-align.py to map raw bisulfite reads.

Input:

SAM file from the previous step

Usage:

$ ./bs3-call_methylation -h 
Usage: ./bs3-call_methylation -h [options]

Options:

-i INFILE,          Input alinged read file in SAM format; output from bs3-align.py

-d DBPATH,          Path to the reference genome index (generated during index-buidling) (optional)

-o OUTFILE,         The output prefix to create the CGmap, ATCGmap and wiggle files

--sorted,           Specify when the input bam file is already sorted, the sorting step will be 
                    skipped [Default: False]

<a name="Outputaa"></a>Output:

wig file

Sample:

  variableStep chrom=chr1
  3000419	0.000000
  3000423	-0.2
  3000440	0.000000
  3000588	0.5
  3000593	-0.000000


  Format descriptions:
  WIG file format. Negative value for 2nd column indicate a Cytosine on minus strand.

CGmap file

Sample:

  chr1	G	3000851	CHH	CC	0.1	1	10
  chr1	C	3001624	CHG	CA	0.0	0	9
  chr1	C	3001631	CG	CG	1.0	5	5

Format descriptions:

  (1) chromosome
  (2) nucleotide on Watson (+) strand
  (3) position
  (4) context (CG/CHG/CHH)
  (5) dinucleotide-context (CA/CC/CG/CT)
  (6) methylation-level = #_of_C / (#_of_C + #_of_T).
  (7) #_of_C (methylated C, the count of reads showing C here)
  (8) = #_of_C + #_of_T (all Cytosines, the count of reads showing C or T here)

ATCGmap file

Sample:

  chr1	T	3009410	--	--	0	10	0	0	0	0	0	0	0	0	na
  chr1	C	3009411	CHH	CC	0	10	0	0	0	0	0	0	0	0	0.0
  chr1	C	3009412	CHG	CC	0	10	0	0	0	0	0	0	0	0	0.0
  chr1	C	3009413	CG	CG	0	10	50	0	0	0	0	0	0	0	0.83

Format descriptions:

  (1) chromosome
  (2) nucleotide on Watson (+) strand
  (3) position
  (4) context (CG/CHG/CHH)
  (5) dinucleotide-context (CA/CC

Bs3

Install / Use

README

BS-Seeker3

<a name="NewFeatures"></a>New Features

BS-Seeker3 Pipeline

Annoucement

Table of Contents

New Features

System Requirements

BS-Seeker3 Usage

Download BS-Seeker3

Index Buidling

Alignment

Methylation Rate Calculation

Methylation Rate Statistics Display

Example Use Case

Linking with MethGo

<a name="SystemRequirements"></a>System Requirements

<a name="RunningBS-Seeker3"></a>Running BS-Seeker3

<a name="DownloadBS-Seeker3"></a>Download BS-Seeker3

<a name="IndexBuilding"></a>Index Buidling

<a name="Alignment"></a>Alignment

<a name= "MethylationRateCalculation" ></a> Methylation Rate Calculation