Bs3
BS-Seeker3: An Ultra-fast, Versatile Pipeline for Mapping Bisulfite-treated Reads.
Install / Use
/learn @khuang28jhu/Bs3README
BS-Seeker3
BS-Seeker3 maps bisulfite-treated reads (bs-seq) with high accuracy and ultra-fast speed. While being 1.5 time faster than BSMAP and 10 times faster than Bismark, BS-Seeker3 can map twice as many reads than both aligners. In addition to its high-throughput performance, BS-Seeker3 offers additional downstream analysis to further investigate and visualize methylation pattern post-alignment.
<a name="NewFeatures"></a>New Features
- BS-seeker3 now employs an improved index, conducts fast alignment with SNAP, and incorporates a highly optimized pipeline to process SNAP outputs.
- BS-seeker3 now executes local alignment through the Unnoken Algorithm, which allows high mappability and accuracy without sacrificing too much runtime.
- BS-seeke3 also outputs a preliminary quality control graph, a meta-gene plot, and a bisulfite unconversion rate histogram. Additional downstream methylation analysis is supported by MethGo.
BS-Seeker3 Pipeline

Annoucement
I am aware and fixing a file-directory problem that's preventing the code to be run in a more flexible manner. Also improving the RRBS segment and problem with matching short reads (length < 50). MS1 obgligations are making this go slowly but it's getting there.
Table of Contents
<a name="SystemRequirements"></a>System Requirements
- Linux or Mac OS Environment
- Python2 (version 2.5.2 or above; it should be pre-installed in both Linux and Mac). Type 'Python' to see the installed version. Python2 could be downloaded from http://www.python.org/download/ )
- GCC 5.4.0 +
- SNAP version 1.0beta.23, which could be downloaded from https://github.com/amplab/snap
- Python Modules 'Pysam' and 'Metplotlib'. To install the packages, use the following commands on an UNIX terminal:
pip install pysam
pip install Matplolib
<a name="RunningBS-Seeker3"></a>Running BS-Seeker3
BS-Seeker3 is a 3-steps pipeline: index-building, bs-seq alignment, and methylation rate calculation. Prior to alignment, BS3 first builds a custom-index for the reference genome (the user should adjust specific index-building parameters based on the reference genome size, see below for details). During alignment, BS3 uses SNAP to map bisulfite reads, and then sorts through the non-unique and incorrectly converted mappings. After alignment, methylation rate is then calcualted at the single-base resolution.
<a name="DownloadBS-Seeker3"></a>Download BS-Seeker3
Type the following commands in an Unix Terminal:
- To download the Mac verion:
Stay tuned.
- To download the Linux version:
git clone https://github.com/khuang28jhu/bs3
<a name="IndexBuilding"></a>Index Buidling
Use the script bs3-build.py to build an index for a reference genome.
Usage:
$ ./bs3-build -h
Usage: ./bs3-build -h [options]
-f Path to the reference genome; the reference genome should be in fasta format
-s Seed size (default: 20), a SNAP option; SNAP uses a hashtable strucutre.
It builds the index by breaking the reference genome into multiple seqeunces
(seed) of a set length. This option determines the length of each
seqeunce (seed size), and SNAP can process seed sizes to 23. A seed size of
20 is recommended for bisulfite reads of 100 bp long; a longer size should be
used for raw reads of longer length.
-L (default: 4), a SNAP option specific to the Linux implementation; This option
determines the byte size to store the location of each seed along the
reference genome. It ranges from 4 to 8 bytes. For larger genomes, a larger
location size should be used; for example, to build an index based on the human
genome, a location size of 5 bytes is recommended.
<a name="Alignment"></a>Alignment
Use the script bs3-align.py to map the raw bisulfite reads.
Input:
- BS reads file in fastq
@SRR019072.2842 HWI-EAS365_1060:4:1:51:313 length=87
TAATTAGATTTGTGTTATAGATTATTTGTAAAGAAAGTAATTATTAAAGGAAATGTTAGTTTTTATTTGATATATGATAAGAGAACG
+SRR019072.2842 HWI-EAS365_1060:4:1:51:313 length=87
BBBCC@)8ABA/<2>CB:=.:?BBABB1-:@74@B@?=@@ABB@B7@@5/98<;)<>56:?>:;A?A?A@>=AABB@A<3(@@=086
- BS reads file in fasta
>read1
TCCATTATACCGTAACCCAATACAAAAATTATTTAT
>read2
TCTGTAGACGGGTCGAATGGGGAGTTCATAGGGGGG
Usage:
$ ./bs3-align -h
Usage: ./bs3-align -h [options]
For single end reads:
-i INFILE, Input read file (FORMAT: fasta, fastq). Ex: read.fa or read.fa.gz
For pair end reads:
-1 FILE, Input read file, mate 1 (FORMAT: fasta, fastq)
-2 FILE, Input read file, mate 2 (FORMAT: fasta, fastq)
Important General options:
-K ALIGNMENT LENGTH, Neglect alignments with length below this value
-g GENOME, Name of the reference genome (should be the same as "-f" in bs3-build.py ) [ex.
chr21_hg18.fa]
-m NO_MISMATCHES, Set the number(>=1)/percentage([0, 1)) of mismatche in a read. Ex: 8 (allow 8
mismatches) or 0.08 (allow 8% mismatches) [Default: 12]
-l INT, Split the input file into smaller files based on this number. Each smaller file
is processed in paralell. The result is then merged. [Default: 12800000]
-o OUTFILE The name of output file
Relevant Aligner Options:
--snap-h MaxHits, (default: 250 on the Mac version, 300 on Linux) a SNAP option; There
are often seeds matching to multiple locations in the genomes. Sorting throught all
the putative hits is a time-consuming process. This option sets a threshold on the
number of locations that a seed can match to. Seeds matching to locations more than
this number are ignored during the alignment.
Methylation Rate Statistics Display Option:
--qcf=QC_F Supply the length of the raw bisulfite reads to plot a quality control plot. A
quality control plot tabulates the average rate of mismatche at each position
on a raw read.
Output:
- Alignment Summary in .stat file
BS-seeker3 Result
Final Alignment Report
================================================
Number of reads in total: 20000
Number of unique-hits reads (before post-filtering): 9990.0
Number of reads mapped after post-filtering 9934.0
Alignment Time: 1.14915108681secs
Final Cytosine Report
================================================
Total Number of Cytosines: 348392.0
Total Number of Cs in CpG context: 45242.0
Total Number of Cs in CHG context: 49610.0
Total Number of Cs in CHH context: 253540.0
Rate of Methylation
mCG 0.999%
mCHG 0.999%
mCHH 0.999%
- List of Aligned Reads in SAM Format (SAM Fields Description)
SRR2058107.412129 0 10_w_c 42386003 1 90M * 0 0 TGGATTGGAAGGTAATTATTATTGAATGGAATTGAATGGAATTATTGAATGGATTTGAATGGAATAATTATTGAATGGAATTGAATGGAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII PG:Z:SNAP NM:i:3 RG:Z:FASTQ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm
<a name= "MethylationRateCalculation" ></a> Methylation Rate Calculation
Use the script bs3-align.py to map raw bisulfite reads.
Input:
- SAM file from the previous step
Usage:
$ ./bs3-call_methylation -h
Usage: ./bs3-call_methylation -h [options]
Options:
-i INFILE, Input alinged read file in SAM format; output from bs3-align.py
-d DBPATH, Path to the reference genome index (generated during index-buidling) (optional)
-o OUTFILE, The output prefix to create the CGmap, ATCGmap and wiggle files
--sorted, Specify when the input bam file is already sorted, the sorting step will be
skipped [Default: False]
<a name="Outputaa"></a>Output:
-
wig file
Sample:
variableStep chrom=chr1 3000419 0.000000 3000423 -0.2 3000440 0.000000 3000588 0.5 3000593 -0.000000 Format descriptions: WIG file format. Negative value for 2nd column indicate a Cytosine on minus strand. -
CGmap file
Sample:
chr1 G 3000851 CHH CC 0.1 1 10 chr1 C 3001624 CHG CA 0.0 0 9 chr1 C 3001631 CG CG 1.0 5 5Format descriptions:
(1) chromosome (2) nucleotide on Watson (+) strand (3) position (4) context (CG/CHG/CHH) (5) dinucleotide-context (CA/CC/CG/CT) (6) methylation-level = #_of_C / (#_of_C + #_of_T). (7) #_of_C (methylated C, the count of reads showing C here) (8) = #_of_C + #_of_T (all Cytosines, the count of reads showing C or T here) -
ATCGmap file
Sample:
chr1 T 3009410 -- -- 0 10 0 0 0 0 0 0 0 0 na chr1 C 3009411 CHH CC 0 10 0 0 0 0 0 0 0 0 0.0 chr1 C 3009412 CHG CC 0 10 0 0 0 0 0 0 0 0 0.0 chr1 C 3009413 CG CG 0 10 50 0 0 0 0 0 0 0 0.83Format descriptions:
(1) chromosome (2) nucleotide on Watson (+) strand (3) position (4) context (CG/CHG/CHH) (5) dinucleotide-context (CA/CC
