MECAT
MECAT: an ultra-fast mapping, error correction and de novo assembly tool for single-molecule sequencing reads
Install / Use
/learn @xiaochuanle/MECATREADME
We have released a new version MECAT2. Please go and download that new version. This version will not be updated any more.
</a>Contents
<a name="S-introduction"></a>Introdction
MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. We also use MECAT to assemble a diploid human genome based on 102x SMRT data only in 25 days. The latter assembly leads a great improvement of quality to the previous genome assembled from the 54x haploid SMRT data. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON. Here are some comparisons on the 32-thread computer with 2.0 GHz CPU and 512 GB RAM memory:
<div> <table border="0"> <tr> <th>Genome</th> <th>Pipeline</th> <th>Thread number</th> <th>Total running time (h)</th> <th>NG50 of genome</th> </tr> <tr> <th>E.coli</th> <th>FALCON</th> <th>16</th> <th>1.21</th> <th>4,635,129 </th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>1.29</th> <th>4,652,272 </th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>0.71</th> <th>4,648,002</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>0.24</th> <th>4,649,626</th> </tr> <tr> <th>Yeast</th> <th>FALCON</th> <th>16</th> <th>2.16</th> <th>587,169</th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>4.2</th> <th>818,229</th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>5.11</th> <th>739,902</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>0.91</th> <th>929,350</th> </tr> <tr> <th>A.thaliana</th> <th>FALCON</th> <th>16</th> <th>223.84</th> <th>7,583,032</th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>188.7</th> <th>9,610,192</th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>118.57</th> <th>8,315,338</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>10.73</th> <th>12600961</th> </tr> <tr> <th>D.melanogaster</th> <th>FALCON</th> <th>16</th> <th>140.72</th> <th>15,664,372</th> </tr> <tr> <th></th> <th>PBcR-MHAP</th> <th>16</th> <th>101.22</th> <th>13,627,256</th> </tr> <tr> <th></th> <th>Canu</th> <th>16</th> <th>69.34</th> <th>14,179,324</th> </tr> <tr> <th></th> <th>MECAT</th> <th>16</th> <th>9.58</th> <th>18,111,159</th> </tr> <tr> <th>Human(54X)</th> <th>PBcR-MHAH(f)</th> <th>32</th> <th>5750</th> <th>1,857,788</th> </tr> <tr> <th></th> <th>PBcR-MHAH(s)</th> <th>32</th> <th>13000</th> <th>4,320,471 </th> </tr> <tr> <th></th> <th>MECAT</th> <th>32</th> <th>230.54</th> <th>4,878,957</th> </tr> </table> </div>MECAT consists of four modules:
-
mecat2pw, a fast and accurate pairwise mapping tool for SMRT reads -
mecat2ref, a fast and accurate reference mapping tool for SMRT reads -
mecat2cns, correct noisy reads based on their pairwise overlaps -
mecat2canu, a modified and more efficient version of the Canu pipeline. Canu is a customized version of the Celera Assembler that designed for high-noise single-molecule sequencing
MEAP is written in C, C++, and perl. It is open source and distributed under the GPLv3 license.
<a name="S-installation"></a>Installation
The current directory is /public/users/chenying/smrt_asm.
- Install
MECAT:
git clone https://github.com/xiaochuanle/MECAT.git
cd MECAT
make
cd ..
After installation, all the executables are found in MECAT/Linux-amd64/bin. The folder name Linux-amd64 will vary in operating systems. For example, in MAC, the executables are put in MECAT/Darwin-amd64/bin.
- Install
HDF5:
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.15-patch1/src/hdf5-1.8.15-patch1.tar.gz
tar xzvf hdf5-1.8.15-patch1.tar.gz
mkdir hdf5
cd hdf5-1.8.15-patch1
./configure --enable-cxx --prefix=/public/users/chenying/smrt_asm/hdf5
make
make install
cd ..
The header files of HDF5 are in hdf5/include. The library files of HDF5 are in hdf5/lib (in some systems, they are put in hdf5/lib64, check it!).
- Install
dextract
git clone https://github.com/PacificBiosciences/DEXTRACTOR.git
cp MECAT/dextract_makefile DEXTRACTOR
cd DEXTRACTOR
export HDF5_INCLUDE=/public/users/chenying/smrt_asm/hdf5/include
export HDF5_LIB=/public/users/chenying/smrt_asm/hdf5/lib
make -f dextract_makefile
cd ..
- Add relative pathes
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/users/chenying/smrt_asm/hdf5/lib
export PATH=/public/users/chenying/smrt_asm/MECAT/Linux-amd64/bin:$PATH
export PATH=/public/users/chenying/smrt_asm/DEXTRACTOR:$PATH
<a name="S-quick-start"></a>Quick Start
Using MECAT to assemble a genome involves 4 steps. Here we take assemblying the genome of Ecoli as an example, to go through each step in order. Options of each command will be given in next section.
Assemblying Pacbio Data
We download the reads ecoli_filtered.fastq.gz from the MHAP website. By decompressing it we obtain ecoli_filtered.fastq.
- Step 1, using
mecat2pwto detect overlapping candidates
mecat2pw -j 0 -d ecoli_filtered.fastq -o ecoli_filtered.fastq.pm.can -w wrk_dir -t 16
- Step 2, correct the noisy reads based on their pairwise overlapping candidates.
mecat2cns -i 0 -t 16 ecoli_filtered.fastq.pm.can ecoli_filtered.fastq corrected_ecoli_filtered
- Step 3, extract the longest 25X corrected reads
extract_sequences corrected_ecoli_filtered.fasta corrected_ecoli_25x.fasta 4800000 25
- Step 4, assemble the longest 25X corrected reads using
mecat2cacu
mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.02 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -pacbio-corrected corrected_ecoli_25x.fasta
Assemblying Nanopore Data
Download MAP006-PCR-1_2D_pass.fasta.
- Step 1, using
mecat2pwto detect overlapping candidates
mecat2pw -j 0 -d MAP006-PCR-1_2D_pass.fasta -o candidatex.txt -w wrk_dir -t 16 -x 1
- Step 2, correct the noisy reads based on their pairwise overlapping candidates.
mecat2cns -i 0 -t 16 -x 1 candidates.txt MAP006-PCR-1_2D_pass.fasta corrected_ecoli.fasta
- Step 3, extract the longest 25X corrected reads
extract_sequences corrected_ecoli.fasta corrected_ecoli_25x.fasta 4800000 25
- Step 4, assemble the longest 25X corrected reads using
mecat2cacu
mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.06 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -nanopore-corrected corrected_ecoli_25x.fasta
After step 4, the assembled genome is given in file ecoli/ecoli.contigs.fasta. Details of the contigs can be found in file ecoli/ecoli.layout.tigInfo.
<a name="S-input-format"></a>Input Format
MECAT is capable of processing FASTA, FASTQ, and H5 format files. However, the H5 files must first be transfered to FASTA
format by running DEXTRACTOR/dextract before running MECAT. For example:
find pathto/raw_reads -name "*.bax.h5" -exec readlink -f {} \; > reads.fofn
while read line; do dextract -v $line >> reads.fasta ; done < reads.fofn
the extracted result should be the reads.fasta file for mecat's input file.
<a name="S-program-description"></a>Program Descriptions
We describe in detail each module of MECAT, includin
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
