Mabs
Mabs, a genome assembly tool that optimizes parameters of Hifiasm and Flye
Install / Use
/learn @shelkmike/MabsREADME
Mabs is a genome assembly tool that optimizes parameters of genome assemblers Hifiasm and Flye.<br><br> The core idea of Mabs is to optimize parameters of a genome assembler to make an assembly where protein-coding genes are assembled more accurately than when the assembler is run with its default parameters.<br><br> Briefly, Mabs works as follows:
- It makes a series of genome assemblies by Hifiasm or Flye, using different values of parameters of these programs. Mabs uses several tricks to accelerate the assembly process.
- For each genome assembly, Mabs evaluates the quality of BUSCO genes' assembly using a special metric that I call "AG". For how AG is calculated, see calculate_AG.
- The genome assembly with the largest AG is considered the best.
Mabs is, on average, 3 times slower than Hifiasm or Flye, but usually produces better or equal assemblies. For details, see this article.
Table of Contents
Installation
Mabs requires Python 3, Perl 5, GCC, Zlib-dev, Make.<br> Mabs should be installed in two steps:<br>
- Install Python libraries Pandas, Plotnine, SciPy. This can be done, for example, by running the following two commands one after the other:<br>
pip3 install --upgrade --user pip<br>pip3 install --upgrade --user Pandas Plotnine scipy<br> - Download the latest version of Mabs from Releases. Extract the archive and run<br>
bash install.sh
<br> Alternatively, you can use a Singularity container with Mabs: https://mikeshelk.site/Diff/Mabs_distribution/Singularity_containers/ . To see information on how to use Mabs from the container, run a command "singularity run-help mabs.sif". <br><br>
How to use
Two main components of Mabs are Mabs-hifiasm and Mabs-flye. Mabs-hifiasm works as a parameter optimizer of Hifiasm, while Mabs-flye works as a parameter optimizer of Flye.
a) Mabs-hifiasm
Mabs-hifiasm is intended for PacBio HiFi (also known as CCS) reads and high-accuracy Nanopore reads (Nanopore reads made on flow cell 10.4.1 or newer). <br> To run Mabs-hifiasm, a user should provide two values:
- A path to reads, via the option "--long_reads".
- A BUSCO dataset. In the process of parameters optimization, Mabs uses a BUSCO dataset. The dataset can be provided using either the option "--download_busco_dataset", or the option "--local_busco_dataset". The option "--download_busco_dataset" requires a filename from http://mikeshelk.site/Data/BUSCO_datasets/BUSCO_datasets_current_as_of_2024-03-22/ (for older versions of Mabs, the link was different, run the older versions with the option "--help" to see the link). It is recommended to use the most taxonomically narrow dataset. For example, if you assemble a drosophila genome, use "--download_busco_dataset diptera_odb10.2024-01-08.tar.gz". Alternatively, you can download a dataset to your computer manually, and use the option "--local_busco_dataset". For example, "--local_busco_dataset /home/username/Work/diptera_odb10.2024-01-08.tar.gz". Take into account that odb12 datasets are currently incompatible with Mabs, so please use odb10.
To see the full list of options, run
mabs-hifiasm.py --help
If you have several files with PacBio HiFi and high-accuracy Nanopore reads, concatenate them in one file before giving to Mabs-hifiasm.
Since Mabs-hifiasm is based on Hifiasm (https://github.com/chhylp123/hifiasm), it can use paired-end Hi-C reads for haplotype phasing (but not for scaffolding). Provide trimmed Hi-C reads with options "--short_hi-c_reads_R1" and "--short_hi-c_reads_R2". If you want to do Hi-C scaffolding, you can use YaHS (https://github.com/c-zhou/yahs) with contigs made by Mabs.
Mabs-hifiasm can also use ultra-long (N50 > 50 kbp) low-accuracy Nanopore reads to make assembly more contiguous. Provide them with the option "--ultralong_nanopore_reads".
Examples of using Mabs-hifiasm.
Example 1:<br>
mabs-hifiasm.py --long_reads hifi_reads.fastq.gz --download_busco_dataset eudicots_odb10.2024-01-08.tar.gz --threads 40
Example 2:<br>
mabs-hifiasm.py --long_reads hifi_reads.fastq.gz --short_hi-c_reads_R1 hi-c_reads_trimmed_R1.fastq.gz --short_hi-c_reads_R2 hi-c_reads_trimmed_R2.fastq.gz --ultralong_nanopore_reads ultralong_reads.fastq.gz --download_busco_dataset diptera_odb10.2024-01-08.tar.gz --threads 40
<br><br>
b) Mabs-flye
Mabs-flye is intended for PacBio CLR reads (also known as "old PacBio reads") and low-accuracy Nanopore reads (i.e. Nanopore reads made on flow cells older than 10.4.1). Similarly to Mabs-hifiasm, Mabs-flye requires two values:
- A path to reads, provided via options "--nanopore_reads", "--pacbio_clr_reads" or "--pacbio_hifi_reads". If you have several read datasets created by different technologies, these options can be used simultaneously. Although Mabs-flye can use HiFi reads and high-accuracy Nanopore reads, Mabs-hifiasm will probably make a better assembly from such reads than Mabs-flye.
- A BUSCO dataset, provided via options "--download_busco_dataset" or "--local_busco_dataset". For details, see "2." in the description of Mabs-hifiasm above.
To see the full list of options, run
mabs-flye.py --help
Examples of using Mabs-flye.
Example 1:<br>
mabs-flye.py --nanopore_reads nanopore_reads.fastq.gz --download_busco_dataset eudicots_odb10.2024-01-08.tar.gz --threads 40
Example 2:<br>
mabs-flye.py --nanopore_reads nanopore_reads.fastq.gz --pacbio_hifi_reads pacbio_hifi_reads.fastq.gz --download_busco_dataset diptera_odb10.2024-01-08.tar.gz --threads 40
<br><br>
c) The output of Mabs
Both Mabs-hifiasm and Mabs-flye have a similar output structure. Both of them create a folder which, by default, is named "Mabs_results". The name can be changed via the "--output_folder" option. The two main files that a user may need are:
- ./Mabs_results/mabs_log.txt<br> This file contains information on how Mabs-hifiasm or Mabs-flye run and whether they encountered any problems.
- ./Mabs_results/The_best_assembly/assembly.fasta<br> These are the contigs you need. <br><br>
d) Testing Mabs-hifiasm and Mabs-flye
If you are not sure whether Mabs-hifiasm and Mabs-flye have been installed properly, you can run
mabs-hifiasm.py --run_test
or
mabs-flye.py --run_test
These two commands assemble the first chromosome of <i>Saccharomyces cerevisiae</i>, which is approximately 200 kbp. If, after the assembly finishes, you see a file ./Mabs_results/The_best_assembly/assembly.fasta that is slightly larger than 200 KB, Mabs works correctly. <br><br>
calculate_AG
Besides Mabs-hifiasm and Mabs-flye, Mabs contains a third tool, named calculate_AG. Its purpose is to assess the genome assembly quality.
<br><br>
calculate_AG is used internally by Mabs-hifiasm and Mabs-flye, but also can be used externally if a user wants to assess the quality of some assembly.<br><br>
The main concept in calculate_AG is "AG", which is the metric of gene assembly quality used by Mabs. "AG" is short for "the number of Accurately assembled Genes". It is calculated as the sum of the following two values:<br>
a) The number of genes in single-copy BUSCO orthogroups.<br>
b) The number of genes in true multicopy orthogroups. "True multicopy" means that there is more than one gene in these orthogroups not because of assembly errors, but because these genes are actual paralogs. In contrast, the number of genes in false multicopy orthogroups (the orthogroups where genes' duplications are because of assembly errors) is not included in AG.<br><br>
AG, in my opinion, may be a better metric of gene assembly quality than BUSCO results, because BUSCO does not differentiate true multicopy orthogroups and false multicopy orthogroups, combining them into a single "D" category.<br>
A frequent cause of false multicopy orthogroups are haplotypic duplications, when two alleles of a gene are erroneously assembled as paralogs.<br>
calculate_AG differentiates true multicopy orthogroups from false multicopy orthogroups based on gene coverage, since if a duplication is an assembly error, the gene coverage should be decreased.<br>
<br>
Basically, the larger AG is, the better the assembly is.
<br><br>
If a user wants to calculate AG for some genome assembly, he can use a command like:<br>
calculate_AG.py --assembly contigs.fasta --nanopore_reads nanopore_reads.fastq.gz --local_busco_dataset /mnt/lustre/username/Datasets/eudicots_odb10 --threads 40
<br><br>
For more options, run<br>
calculate_AG.py --help
<br><br>
The main file produced by calculate_AG is ./AG_calculation_results/AG.txt . It contains a single number, which is the AG.
In addition, calculate_AG produces figures <i>gene_coverage_distribution.svg</i> and <i>gene_coverage_distribution.png</i> which look like this:<br>
