SqueezeMeta
A complete pipeline for metagenomic analysis
Install / Use
/learn @jtamames/SqueezeMetaREADME
.. image:: https://github.com/jtamames/SqueezeMeta/blob/images/logo.svg :width: 20% :align: right :alt: SqueezeMeta logo
SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins
SqueezeMeta is a fully automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual metagenome-assembled genomes (MAGs) via binning procedures. Thus, SqueezeMeta features several characteristics:
- Several assembly and co-assembly algorithms and strategies for short and long reads
- Several binning algorithms for the recovery of metagenome-assembled genomes (MAGs)
- Taxonomic annotation, functional annotation and quantification of genes, contigs, and bins
- Support for the annotation and quantification of pre-existing assemblies or collections of genomes
- Support for de-novo metatranscriptome assembly and hybrid metagenomics/metatranscriptomics projects
- Support for the annotation of unassembled shotgun metagenomic reads
- An R package to easily explore your results, including bindings for
microeco <https://chiliubio.github.io/microeco/>_ andphyloseq <https://joey711.github.io/phyloseq/>_
SqueezeMeta uses a combination of custom scripts and external software packages for the different steps of the analysis:
- Assembly
- RNA prediction and classification
- ORF (CDS) prediction
- Homology searching against taxonomic and functional databases
- Hmmer searching against Pfam database
- Taxonomic assignment of genes
- Functional assignment of genes (OPTIONAL)
- Blastx on parts of the contigs with no gene prediction or no hits
- Taxonomic assignment of contigs, and check for taxonomic disparities
- Coverage and abundance estimation for genes and contigs
- Estimation of taxa abundances
- Estimation of function abundances
- Merging of previous results to obtain the ORF table
- Binning with different methods
- Binning integration with DAS tool
- Taxonomic assignment of bins, and check for taxonomic disparities
- Checking of bins with CheckM2 (and optionally classify them with GTDB-Tk)
- Merging of previous results to obtain the bin table
- Merging of previous results to obtain the contig table
- Prediction of kegg and metacyc patwhays for each bin
- Final statistics for the run
- Generation of tables with aggregated taxonomic and functional profiles
Detailed information about the different steps of the pipeline can be
found in the documentation <https://squeezemeta.readthedocs.io>_.
Documentation
- The documentation for SqueezeMeta and SQMtools is available in
ReadTheDocs <https://squeezemeta.readthedocs.io>_. - The
wiki <https://github.com/jtamames/SqueezeMeta/wiki>_ contains extra examples on how to use certain features of SqueezeMeta/SQMtools. - You can also check the SqueezeMeta paper
here <https://www.frontiersin.org/articles/10.3389/fmicb.2018.03349/full>_ and a second paper on how to analyse the output of SqueezeMetahere <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03703-2>_
Installation
SqueezeMeta is intended to be run in a x86-64 Linux OS (tested in Ubuntu and CentOS). The easiest way to install it is by using conda. The default conda solver might however be slow solving the dependencies, so it’s better to first set up the libmamba solver with
::
conda update -n base conda # if your conda version is below 22.11 conda install -n base conda-libmamba-solver conda config --set solver libmamba
and then use conda to install SqueezeMeta
conda create -n SqueezeMeta -c conda-forge -c bioconda -c fpusan squeezemeta=1.7 --no-channel-priority --override-channels
If you change squeezemeta to squeezemeta-dev you will instead
get the latest development version. This will contain additional bugfixes
and features, but potentially also new bugs, as it will not have been
tested as thoroughly as the stable version.
(If the environment does not solve and you get a message saying that
__cuda is missing in your system, try adding CONDA_OVERRIDE_CUDA=12.4
before the installation command: CONDA_OVERRIDE_CUDA=12.4 conda create ...)
The command above will create a new conda environment named SqueezeMeta, which must then be activated.
conda activate SqueezeMeta
When using conda, all the scripts from the SqueezeMeta distribution will
be available on $PATH.
Alternatively, you can download the latest release from the GitHub
repository and uncompress the tarball in a suitable directory. The
tarball includes the SqueezeMeta scripts as well as the third-party software <https://squeezemeta.readthedocs.io/en/stable/installation.html#vendored-tools>_
redistributed with SqueezeMeta. Note that, you may need to provide
additional dependencies, and potentially recompile some
binaries from source in order for the manual install to work.
The conda method is now the recommended way to install SqueezeMeta,
and we will not prioritize support to issues regarding manual installation.
The test_install.pl script can be run in order to check whether the
required dependencies are available in your environment.
/path/to/SqueezeMeta/utils/install_utils/test_install.pl
Downloading or building databases
SqueezeMeta uses several databases. GenBank nr for taxonomic assignment, and eggnog, KEGG and Pfam for functional assignment. The script download_databases.pl can be run to download a pre-formatted version of all the databases required by SqueezeMeta.
/path/to/SqueezeMeta/utils/install_utils/download_databases.pl /download/path
, where /download/path is the destination folder. This is the
recommended option, but the files are hosted in our institutional
server, which can at times be unreachable.
Alternatively, the script make_databases.pl can be run to download
from source and format the latest version of the databases.
/path/to/SqueezeMeta/utils/install_utils/make_databases.pl /download/path
Generally, download_databases.pl is the safest choice for getting
your databases set up. When running make_databases.pl, data download
(e.g. from the NCBI server) can be interrupted, leading to a corrupted
database. Always run test_install.pl to check that the database was
properly created. Otherwise, you can try re-running
make_databases.pl, or just run download_databases.pl instead.
The databases occupy 470Gb, but we recommend having at least 700Gb free disk space during the building process.
Two directories will be generated after running either
make_databases.pl or download_databases.pl.
/download/path/db, which contains the actual databases./download/path/test, which contains data for a test run of SqueezeMeta.
If the SqueezeMeta databases are already built in another location in the system, a different copy of SqueezeMeta can be configured to use them with
/path/to/SqueezeMeta/utils/install_utils/configure_nodb.pl /path/to/db
, where /path/to/db is the route to the db folder that was
generated by either make_databases.pl or download_databases.pl.
After configuring the databases, the test_install.pl can be run in
order to check that SqueezeMeta is ready to work (see previous section).
Testing SqueezeMeta
The download_databases.pl and make_databases.pl scripts also
download two datasets for testing that the program is running correctly.
Assuming either was run with the directory /download/path as its
target the test run can be executed with
| cd </download/path/test>
| SqueezeMeta.pl -m coassembly -p Hadza -s test.mock.samples -f raw
Alternatively, -m sequential or -m merged can be used.
In addition to this mock dataset, we also provide two real metagenomes. A test run on those can be executed with
SqueezeMeta.pl -m coassembly -p Hadza -s test.samples -f raw
Updating SqueezeMeta
Assuming your databases are not inside the SqueezeMeta directory, just remove it, download the new version and configure it with
/path/to/SqueezeMeta/utils/install_utils/configure_nodb.pl /path/to/db
Usage considerations
Choosing an assembly strategy
SqueezeMeta can be run in four different assembly modes, depending on the type of multi-metagenome support. These modes are:
-
Sequential mode: All samples are treated individually and analysed sequentially.
-
Coassembly mode: Reads from all samples are pooled and a single assembly is performed. Then reads from individual samples are mapped to the coassembly to obtain gene abundances in each sample. Binning methods allow to obtain genome bins.
-
Merged mode: if many big samples are available, co-assembly could crash because of memory requirements. This mode achieves a comparable resul with a procedure inspired by
the one used by Benjamin Tully for analysing TARA Oceans data <https://dx.doi.org/10.17504/protocols.io.hfqb3mw>_. Briefly, samples are assembled individually and the resulting contigs are merged in a single co-assembly. Then the analysis proceeds as in the co-assembly mode. This is not the recommended procedure (use co-assembly if possible) since the possibility of creating chimeric contigs is higher. But it is a viable alternative in smaller computers in which standard co-assembly is not feasible. -
Seqmerge mode: This is intended to work with more samples than the merged mode. Instead of merging all individual assemblies in a single step, which can be very comput
