NovelTree: Highly parallelized phylogenomic inference

Arcadia-Science/noveltree is a Nextflow pipeline for phylogenomic inference from whole-proteome amino acid data — automating orthology inference, multiple sequence alignment, gene-family and species tree estimation, and reconciliation-based evolutionary analysis. Input proteomes can be preprocessed using the built-in --preprocess flag or filtered externally (see preprocessing scripts).

NovelTree is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Detailed documentation: For thorough descriptions of samplesheet preparation, all parameters, per-module options, and output files, see docs/usage.md and docs/outputs.md. This README provides a concise overview to get started quickly.

Quick Start

NOTE: Unfortunately, at this time NovelTree is not compatible with Apple silicon/ARM architectures (e.g. M1, M2 chips).

1. Install Nextflow (>=21.10.3).

2. Install Docker.

3. Run the pipeline with the minimal test dataset:

nextflow run . -profile docker,test --outdir results

To constrain resource usage (e.g. on a laptop), specify limits:

nextflow run . -profile docker,test --outdir results --max_cpus 12 --max_memory 16GB

Reduce --max_memory by ~2 GB below your available memory to leave room for Nextflow overhead.

Note: Pre-built Docker images are pulled automatically. You only need make docker-all if you've modified the pipeline code.

NOTE: The workflow supports both Docker and Singularity profiles.

Samplesheet

NovelTree takes a CSV samplesheet as input. Only 3 columns are required:

species,input_data,input_type
Homo-sapiens,UP000005640,proteins
Mus-musculus,GCF_000001635.27,proteins
Drosophila-melanogaster,/path/to/Dmel.fasta,proteins
Saccharomyces-cerevisiae,https://example.com/Scer.fasta.gz,proteins

| Column | Description | |--------|-------------| | species | Species name in Genus-species format | | input_data | Local file path, URL, UniProt proteome ID (UP*), or NCBI accession (GCF_*/GCA_*) | | input_type | proteins or transcriptome |

Optional columns (has_uniprot_ids, transdecoder, filter_isoforms, reference_proteome, include_in_mcl_test, busco_shallow, busco_broad) can be added in any order after the required 3. All default to no or NA. See the full samplesheet documentation for details on all columns, data source types, and preprocessing options.

Workflow Modes

NovelTree supports three workflow modes to accommodate different use cases and computational constraints:

| Feature | Full | Simplified | Zoogle | | ---------------------------- | :------: | :--------: | :------: | | BUSCO quality assessment | ✓ | ✗ | ✗ | | Default aligner | Adaptive | Adaptive | Adaptive | | Per-family GeneRax | ✓ | ✗ | ✗ | | Per-species GeneRax | ✓ | ✓ | ✓ | | GeneRax strategy | SPR | EVAL | EVAL | | Phylogenetic profiles | ✓ | ✓ | ✓ | | Physicochemical properties | ✗ | ✗ | ✓ | | Time-calibrated species tree | ✗ | ✗ | ✓ | | Phylo-dist analysis | ✗ | ✗ | ✓ |

Adaptive mode routes families through MAFFT (≤200 seqs), WITCH (≤3000), and FAMSA (>3000).

Which mode should I use? Use simplified mode (the default) for most analyses. Use full for smaller datasets (≤30 species) where you want additional analyses (BUSCO, per-family GeneRax). Use zoogle when you need physicochemical distance analysis for organism prioritization.

Full Mode

The complete pipeline with all optional analyses enabled. Best for comprehensive phylogenomic studies where accuracy is prioritized over speed.

nextflow run . -profile docker --input samplesheet.csv --outdir results

Simplified Mode (Default)

A streamlined variant optimized for large datasets. Skips BUSCO quality assessment, runs only per-species GeneRax with the faster EVAL strategy, and skips per-family GeneRax analysis.

nextflow run . -profile docker,simplified --input samplesheet.csv --outdir results

Zoogle Mode

Inherits simplified mode settings and adds analyses for organism prioritization: physicochemical protein properties, time calibration of the species tree, and phylogenetically-corrected protein distance analysis. Optionally specify a reference species for pairwise distance analysis, or use --ref_species none for centroid-only analysis.

Recommended (auto-build reference chronogram from TimeTree.org):

nextflow run . -profile docker,zoogle \
  --input samplesheet.csv \
  --outdir results \
  --ncbi_email user@example.com \
  --ref_species Genus-species

The pipeline queries TimeTree.org for pairwise divergence times among species in your samplesheet and builds a UPGMA reference chronogram automatically.

Alternative (provide your own reference tree):

nextflow run . -profile docker,zoogle \
  --input samplesheet.csv \
  --outdir results \
  --reference_time_tree /path/to/reference_timetree.newick \
  --ref_species Genus-species

Running on AWS Batch

NovelTree includes a dedicated AWS Batch profile optimized for cloud-scale analyses:

nextflow run . \
  -profile awsbatch \
  --awsqueue <your-batch-queue> \
  --awsregion <your-aws-region> \
  -work-dir s3://<your-bucket>/work \
  --outdir s3://<your-bucket>/results \
  --input s3://<your-bucket>/samplesheet.csv

The awsbatch profile includes optimized executor settings (queue size of 1000 jobs) and automatic report overwriting for seamless pipeline resumption.

Requirements:

AWS Batch compute environment and job queue configured
Work directory (-work-dir) and output directory (--outdir) must be S3 paths
Input samplesheet and proteome files accessible from S3
Appropriate IAM permissions for Batch and S3 access

See the Nextflow Tower publication example in usage.md for cloud-scale configuration tips.

Running with Singularity

NovelTree supports Singularity as an alternative to Docker, which is useful for HPC environments where Docker may not be available:

nextflow run . -profile singularity --input samplesheet.csv --outdir results

Docker images are automatically pulled and converted to Singularity format. Converted images are cached in ${outdir}/singularity_cache to avoid repeated conversions on subsequent runs.

For detailed Singularity instructions, see the Singularity documentation.

Building Docker Images

Pre-built Docker images are pulled automatically when running the pipeline. If you've modified the pipeline code or are using a custom fork, rebuild with:

make docker-all

Building R-based images (zoogle) may take 15-20 minutes due to package compilation. Images are built for linux/amd64.

The bin/zoogle/ directory contains code vendored from the 2024-organismal-selection repository. See bin/zoogle/README.md for provenance details.

How it works

Orthology inference — OrthoFinder normalizes sequence similarity scores and clusters proteins into gene families via MCL. An optional test step selects the best MCL inflation parameter using InterPro domain coherence (COGEQC).
Alignment & trimming — Adaptive three-tier alignment (MAFFT → WITCH → FAMSA by family size), trimmed with ClipKIT.
Tree inference — Gene family trees via IQ-TREE (FastTree fallback). Species tree via SpeciesRax (and optionally Asteroid).
Reconciliation — GeneRax reconciles gene/species trees, estimating duplication and loss rates. Ortholog/paralog relationships and HOGs are parsed from reconciliation output.
Phylogenetic profiles — Species × gene-family matrices of duplication, loss, and speciation events per species-tree node per gene family.
Zoogle analyses (zoogle mode) — Physicochemical protein properties, time-calibrated trees, and phylogenetically-corrected protein distances for organism prioritization.

The pipeline distributes tasks in a highly parallel manner across available computational resources, supporting local execution, AWS Batch, and SLURM schedulers (see Nextflow executor documentation).

Pipeline overview

flowchart TD
    INPUT["Samplesheet + Proteomes"] --> PREP["PREPARE_INPUTS<br/>Download · Preprocess · Rename"]

    PREP --> BUSCO_Q{"BUSCO?<br/>(full mode)"}
    BUSCO_Q -.->|yes| BUSCO["BUSCO<br/>Shallow + Broad QC"]
    PREP --> ORTHO

    subgraph ORTHO["INFER_ORTHOGROUPS"]
        direction LR
        MCL_SEL["MCL inflation<br/>selection<br/><i>(optional)</i>"] --> OF_PREP["OrthoFinder Prep<br/>+ DIAMOND"] --> MCL["MCL Clustering<br/>+ Filtering"]
    end

    ORTHO -->|"conservative subset<br/>(high coverage, low copy #)"| GT1["INFER_GENE_TREES<br/>species-tree families"]
    ORTHO -->|"remaining subset<br/>(≥4 species)"| GT2["INFER_GENE_TREES<br/>remaining famil

Noveltree

Install / Use

README