We have released a new version MECAT2. Please go and download that new version. This version will not be updated any more.

</a>Contents

Introduction
Installation
Quick Start
Input Format
Program Descriptions
Citation
Contact
Update Information

<a name="S-introduction"></a>Introdction

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. We also use MECAT to assemble a diploid human genome based on 102x SMRT data only in 25 days. The latter assembly leads a great improvement of quality to the previous genome assembled from the 54x haploid SMRT data. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON. Here are some comparisons on the 32-thread computer with 2.0 GHz CPU and 512 GB RAM memory:

<div> <table border="0"> <tr> <th>Genome</th> <th>Pipeline</th> <th>Thread number</th> <th>Total running time (h)</th> <th>NG50 of genome</th> </tr> <tr> <th>E.coli</th> <th>FALCON</th> <th>16</th> <th>1.21</th> <th>4,635,129 </th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>1.29</th> <th>4,652,272 </th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>0.71</th> <th>4,648,002</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>0.24</th> <th>4,649,626</th> </tr> <tr> <th>Yeast</th> <th>FALCON</th> <th>16</th> <th>2.16</th> <th>587,169</th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>4.2</th> <th>818,229</th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>5.11</th> <th>739,902</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>0.91</th> <th>929,350</th> </tr> <tr> <th>A.thaliana</th> <th>FALCON</th> <th>16</th> <th>223.84</th> <th>7,583,032</th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>188.7</th> <th>9,610,192</th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>118.57</th> <th>8,315,338</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>10.73</th> <th>12600961</th> </tr> <tr> <th>D.melanogaster</th> <th>FALCON</th> <th>16</th> <th>140.72</th> <th>15,664,372</th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>101.22</th> <th>13,627,256</th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>69.34</th> <th>14,179,324</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>9.58</th> <th>18,111,159</th> </tr> <tr> <th>Human(54X)</th> <th>PBcR-MHAH(f)</th> <th>32</th> <th>5750</th> <th>1,857,788</th> </tr> <tr> <th></th> <th>PBcR-MHAH(s)</th> <th>32</th> <th>13000</th> <th>4,320,471 </th> </tr> <tr> <th></th> <th>MECAT</th> <th>32</th> <th>230.54</th> <th>4,878,957</th> </tr> </table> </div>

MECAT consists of four modules:

mecat2pw, a fast and accurate pairwise mapping tool for SMRT reads
mecat2ref, a fast and accurate reference mapping tool for SMRT reads
mecat2cns, correct noisy reads based on their pairwise overlaps
mecat2canu, a modified and more efficient version of the Canu pipeline. Canu is a customized version of the Celera Assembler that designed for high-noise single-molecule sequencing

MEAP is written in C, C++, and perl. It is open source and distributed under the GPLv3 license.

<a name="S-installation"></a>Installation

The current directory is /public/users/chenying/smrt_asm.

Install MECAT:

git clone https://github.com/xiaochuanle/MECAT.git
cd MECAT
make 
cd ..

After installation, all the executables are found in MECAT/Linux-amd64/bin. The folder name Linux-amd64 will vary in operating systems. For example, in MAC, the executables are put in MECAT/Darwin-amd64/bin.

Install HDF5:

wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.15-patch1/src/hdf5-1.8.15-patch1.tar.gz
tar xzvf hdf5-1.8.15-patch1.tar.gz
mkdir hdf5
cd hdf5-1.8.15-patch1
./configure --enable-cxx --prefix=/public/users/chenying/smrt_asm/hdf5
make
make install
cd ..

The header files of HDF5 are in hdf5/include. The library files of HDF5 are in hdf5/lib (in some systems, they are put in hdf5/lib64, check it!).

Install dextract

git clone https://github.com/PacificBiosciences/DEXTRACTOR.git
cp MECAT/dextract_makefile DEXTRACTOR
cd DEXTRACTOR
export HDF5_INCLUDE=/public/users/chenying/smrt_asm/hdf5/include
export HDF5_LIB=/public/users/chenying/smrt_asm/hdf5/lib
make -f dextract_makefile
cd ..

Add relative pathes

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/users/chenying/smrt_asm/hdf5/lib
export PATH=/public/users/chenying/smrt_asm/MECAT/Linux-amd64/bin:$PATH
export PATH=/public/users/chenying/smrt_asm/DEXTRACTOR:$PATH

<a name="S-quick-start"></a>Quick Start

Using MECAT to assemble a genome involves 4 steps. Here we take assemblying the genome of Ecoli as an example, to go through each step in order. Options of each command will be given in next section.

Assemblying Pacbio Data

We download the reads ecoli_filtered.fastq.gz from the MHAP website. By decompressing it we obtain ecoli_filtered.fastq.

Step 1, using mecat2pw to detect overlapping candidates


mecat2pw -j 0 -d ecoli_filtered.fastq -o ecoli_filtered.fastq.pm.can -w wrk_dir -t 16

Step 2, correct the noisy reads based on their pairwise overlapping candidates.


mecat2cns -i 0 -t 16 ecoli_filtered.fastq.pm.can ecoli_filtered.fastq corrected_ecoli_filtered

Step 3, extract the longest 25X corrected reads


extract_sequences corrected_ecoli_filtered.fasta corrected_ecoli_25x.fasta 4800000 25

Step 4, assemble the longest 25X corrected reads using mecat2cacu


mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.02 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -pacbio-corrected corrected_ecoli_25x.fasta

Assemblying Nanopore Data

Download MAP006-PCR-1_2D_pass.fasta.

Step 1, using mecat2pw to detect overlapping candidates


mecat2pw -j 0 -d MAP006-PCR-1_2D_pass.fasta -o candidatex.txt -w wrk_dir -t 16 -x 1

Step 2, correct the noisy reads based on their pairwise overlapping candidates.


mecat2cns -i 0 -t 16 -x 1 candidates.txt MAP006-PCR-1_2D_pass.fasta corrected_ecoli.fasta

Step 3, extract the longest 25X corrected reads


extract_sequences corrected_ecoli.fasta corrected_ecoli_25x.fasta 4800000 25

Step 4, assemble the longest 25X corrected reads using mecat2cacu


mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.06 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -nanopore-corrected corrected_ecoli_25x.fasta

After step 4, the assembled genome is given in file ecoli/ecoli.contigs.fasta. Details of the contigs can be found in file ecoli/ecoli.layout.tigInfo.

<a name="S-input-format"></a>Input Format

MECAT is capable of processing FASTA, FASTQ, and H5 format files. However, the H5 files must first be transfered to FASTA format by running DEXTRACTOR/dextract before running MECAT. For example:

find pathto/raw_reads -name "*.bax.h5" -exec readlink -f {} \; > reads.fofn
while read line; do   dextract -v $line >> reads.fasta ; done <  reads.fofn

the extracted result should be the reads.fasta file for mecat's input file.

<a name="S-program-description"></a>Program Descriptions

We describe in detail each module of MECAT, includin

MECAT

Install / Use

README