Oncodrive3D

Oncodrive3D is a fast and accurate computational method designed to analyze patterns of somatic mutation across tumors, with the goal of identifying three-dimensional (3D) clusters of missense mutations and detecting genes under positive selection.

The method leverages AlphaFold 2-predicted protein structures and Predicted Aligned Error (PAE) to define residue contacts within the protein's 3D space. When available, it integrates mutational profiles to build an accurate background model of neutral mutagenesis. By applying a novel rank-based statistical approach, Oncodrive3D scores potential 3D clusters and computes empirical p-values.

Graphical abstract of Oncodrive3D

Requirements

Before you begin, ensure Python 3.10 or later is installed on your system.
Additionally, you may need to install additional development tools. Depending on your environment, you can choose one of the following methods:

If you have sudo privileges:
```
sudo apt install build-essential
```

For HPC cluster environment, it is recommended to use Conda (or Mamba):

conda create -n o3d python=3.10.0
conda activate o3d
conda install -c conda-forge gxx gcc libxcrypt clang zlib

Installation

Install via PyPI:
```
pip install oncodrive3d
```
Alternatively, you can obtain the latest code from the repository and install it for development with pip:
```
git clone https://github.com/bbglab/oncodrive3d.git
cd oncodrive3d
pip install -e .
oncodrive3d --help
```

Or you can use a modern build tool like uv:

git clone https://github.com/bbglab/oncodrive3d.git
cd oncodrive3d
uv run oncodrive3d --help

Building Datasets

This step builds the datasets necessary for Oncodrive3D to run the 3D clustering analysis. It is required once after installation or whenever you need to generate datasets for a different organism or apply a specific threshold to define amino acid contacts.

[!WARNING] This step is highly time- and resource-intensive, requiring a significant amount of free disk space and computational power. It will download and process a large amount of data. Ensure sufficient resources are available before proceeding, as insufficient capacity may result in extended runtimes or processing failures.

Reliable internet access is required because AlphaFold structures, Ensembl annotations, Pfam files, and other resources are downloaded on demand during the build.

[!WARNING] Human datasets built with the default settings pull canonical transcript metadata from the January 2024 Ensembl archive (release 111 / GENCODE v45). For maximum compatibility, annotate your input variants with the same Ensembl/Gencode release or supply the unfiltered VEP output together with --o3d_transcripts --use_input_symbols.

[!NOTE] Predicted Aligned Error (PAE) files for older AlphaFold DB versions (e.g., v4) are no longer hosted after 2025. If you need PAE for an older AF version, download and supply them locally via --custom_pae_dir.
MANE structures are available only in AlphaFold DB v4, while non-MANE builds default to v6. Since MANE mode forces v4 structures, you should also supply the corresponding PAE files through --custom_pae_dir.

[!NOTE] The first time that you run Oncodrive3D building dataset step with a given reference genome, it will download it from our servers. By default the downloaded datasets go to ~/.bgdata. If you want to move these datasets to another folder you have to define the system environment variable BGDATA_LOCAL with an export command.

Usage: oncodrive3d build-datasets [OPTIONS]

Examples:
  Basic build:
    oncodrive3d build-datasets -o <build_folder>
  
  Build with MANE Select transcripts:
    oncodrive3d build-datasets -o <build_folder> --mane

Options:
  -o, --output_dir PATH           Path to the directory where the output files will be saved. 
                                  Default: ./datasets/
  -s, --organism TEXT             Specifies the organism (`human` or `mouse`; also accepts `Homo sapiens` / `Mus musculus`). 
                                  Default: Homo sapiens
  -m, --mane                      Use structures predicted from MANE Select transcripts 
                                  (applicable to Homo sapiens only).
  -M, --mane_only                 Use only structures predicted from MANE Select transcripts
                                  (applicable to Homo sapiens only).
  -C, --custom_mane_pdb_dir PATH  Path to directory containing custom MANE PDB structures (requires --mane_only).
                                  Default: None
      --custom_pae_dir PATH       Path to directory containing pre-downloaded PAE JSON files.
                                  The directory will be copied into the build as `pae/`.
                                  Default: None
  -f, --custom_mane_metadata_path Path to a dataframe (typically a samplesheet.csv) including 
                                  Ensembl IDs and sequences of the custom pdbs.
                                  Default: None
  -d, --distance_threshold INT    Distance threshold (Å) for defining residues contacts. 
                                  Default: 10
  -c, --cores INT                 Number of CPU cores for computation. 
                                  Default: All available CPU cores
  --af_version INT                AlphaFold DB version for non-MANE builds (MANE uses v4).
                                  Default: 6
  -y, --yes                       Run without interactive prompts.
  -v, --verbose                   Enables verbose output.
  -h, --help                      Show this message and exit.

For more information on the output of this step, please refer to the Building Datasets Output Documentation.

[!TIP]

Increasing MANE Structural Coverage

To maximize structural coverage of MANE Select transcripts, you can predict missing structures locally and integrate them into Oncodrive3D using:

tools/preprocessing/prepare_samplesheet.py: a standalone utility that:

Retrieves the full MANE entries from NCBI.

Identifies proteins missing from the AlphaFold MANE dataset.

Generates:

A samplesheet.csv with Ensembl protein IDs, FASTA paths, and optional sequences.

Individual FASTA files for each missing protein.

tools/preprocessing/update_samplesheet_and_structures.py: takes the samplesheet folder produced above and:

Reuses canonical AlphaFold structures when possible to shrink the nf-core input.

Ingests nf-core/proteinfold predictions and keeps the missing/ set up to date.

Generates a final_bundle/samplesheet.csv plus final_bundle/pdbs/, ready to be passed to oncodrive3d build-datasets --mane_only via --custom_mane_metadata_path and --custom_mane_pdb_dir.

When invoking oncodrive3d build-datasets, supply:

--custom_mane_pdb_dir: use this to provide your own predicted PDB structures (e.g., from nf-core/proteinfold or final_bundle/pdbs/ produced by update_samplesheet_and_structures.py).

--custom_mane_metadata_path: path to the corresponding samplesheet.csv (e.g., final_bundle/samplesheet.csv).

See the documentation of MANE Preprocessing Toolkit for the full workflow to expand coverage of the MANE associated structures.

Running 3D clustering Analysis

For in-depth information on how to obtain the required input data and for comprehensive information about the output, please refer to the Input and Output Documentation of the 3D clustering analysis.

Input

Mutations file (required): It can be either:
- <input_maf>: A Mutation Annotation Format (MAF) file annotated with consequences (e.g., by using Ensembl Variant Effect Predictor (VEP)).
- <input_vep>: The unfiltered output of VEP including annotations for all possible transcripts.
<mut_profile> (optional): Dictionary including the normalized frequencies of mutations (values) in every possible trinucleotide context (keys), such as 'ACA>A', 'ACC>A', and so on.

[!NOTE] Examples of the input files are available in the Test Input Folder.
Please refer to these examples to understand the expected format and structure of the input files.

[!NOTE] Oncodrive3D uses the mutational profile of the cohort to build an accurate background model. However, it’s not strictly required. If the mutational profile is not provided, the tool will use a simple uniform distribution as the background model for simulating mutations and scoring potential 3D clusters.

Main Output

Gene-level output: CSV file (\<cohort>.3d_clustering_genes.csv) containing the results of the analysis at the gene level. Each row represents a gene, sorted from the most significant to the least significant based on the 3D clustering analysis. The table also includes genes that were not analyzed, with the reason for exclusion provided in

Oncodrive3d

Install / Use

README