SkillAgentSearch skills...

Autoencodix

Framework for multi-omics data integration by autoencoders

Install / Use

/learn @jan-forest/Autoencodix
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

AUTOENCODIX

Autoencoders are deep-learning based networks for dimension reduction and embedding by a combination of compressing encoder and decoder structure for non-linear and multi-modal data integration with promising application to complex biological data from large-scale omics measurements. Current ongoing research and publication provide many exciting architectures and implementations of autoencoders. However, there is a lack of easy-to-use and unified implementation covering the whole pipeline of autoencoder application. Consequently, we present AUTOENCODIX with the following features:

  • Multi-modal data integration for any numerical or categorical data
  • Different autoencoder architectures:
    • vanilla vanillix
    • variational varix
    • hierarchical/stacked stackix
    • ontology-based ontix
    • cross-modal autoencoder (translation between different data modalities) x-modalix
  • A customizable set-up, run with your own data and change model parameters in a yaml configuration file
  • Full pipeline from preprocessing to embedding evaluation:
<img src="https://raw.githubusercontent.com/jan-forest/autoencodix/main/images/pipeline_overview.png" alt="pipeline-overview" width="1200"/>

For a detailed description and benchmark of capabilities, check our publication in Nature Computational Science

Please, use this to cite our work when using our framework:

@article{joas2025autoencodix,
  title={AUTOENCODIX: a generalized and versatile framework to train and evaluate autoencoders for biological representation learning and beyond},
  author={Joas, Maximilian Josef and Jurenaite, Neringa and Pra{\v{s}}{\v{c}}evi{\'c}, Du{\v{s}}an and Scherf, Nico and Ewald, Jan},
  journal={Nature Computational Science},
  pages={1--13},
  year={2025},
  doi={}
  publisher={Nature Publishing Group US New York}
}

:information_source: Future :information_source:

A Python-package version of AUTOENCODIX is currently under development and will substitute future development of AUTOENCODIX.

:construction: You can check out the progress in this repo: autoencodix_package :construction:

Table of contents

1 INSTALLATION

Follow the instructions depending on the machine you are working with. For familiarisation with the code, use your local machine on a small dataset as shown in our tutorials.

Requirements:

  • pip
  • Python == 3.10 (support for Python >=3.10 starting soon)
  • GPU recommended for larger datasets

1.1 Linux and MacOS

  • clone this repo:
git clone https://github.com/jan-forest/autoencodix.git
  • change into the repo:
cd autoencodix
  • Create environment with:
make create_environment
  • activate environemnt with:
source venv-gallia/bin/activate
  • install requirements with:
make requirements
  • currently GPU support is not available for MacOS

1.2 Windows-based

  • to use the Makefile in Windows you need to install make

  • See https://linuxhint.com/run-makefile-windows/

  • Move Makefile_windows to Makefile

  • create environment with: make create_environment

  • activate env with .\venv-gallia\Scripts\activate

  • install requirements with make requirements

  • if you encounter problems, see the troubleshooting section at the end

1.4 HPC Cluster

  • clone this repo in a dedicated space you want to decide to work in

  • load Python/3.10 or above and virtualenv

  • create a Python virtual environment according to your HPC guidelines

  • activate the environment with source [env-name]/bin/activate

  • install requirements with make requirements

2 Getting started

2.1 First steps and tutorials

To work with our framework, only three steps are necessary:

  1. Get the input data ready
  2. Specify model and pipeline parameter via a <run_id>_config.yaml-config file
  3. Run the full pipeline via make RUN_ID=<run_id>

First time users should check our tutorial notebooks for more details on those steps and showcases of important options of AUTOENCODIX:

2.2 Other pipeline examples

Additional to tutorial notebooks, we provide example configs of main features of AUTOENCODIX:

  • Multi-modal VAE training on TCGA pan-cancer data run_TCGAexample.sh including hyperparameter tuning with Optuna via:
> ./bash-runs/run_TCGAexample.sh 
  • Training of an ontology-based VAE ontix on single-cell data via:
> ./bash-runs/run_SingleCellExample.sh 

All scripts will download the data, create necessary yaml-configs and run the pipeline for you. Results and visualizations can be found under reports/<run_id>/.

2.3 Working with own data and config files

To work with our framework you first need to make sure that it has following format as described in details in the tutorial:

  • for each data modality either a text-file (csv,tsv,txt) or .parquet-file with samples as rows and features as columns
  • we recommend an ANNOTATION-file in the same format containing clinical parameter or other sample meta-data for visualization
  • As described in the tutorials provide ontologies or image-mapping files to work with ontix or x-modalix

When your input data is ready, you need to create your own config with the name <run_id>_config.yaml. We provide a sample config in ./_config.yaml. Copy and rename this file:


> cp ./_config.yaml ./<RUN_ID>_config.yaml

Each entry in the _config.yaml has an inline comment that indicates whether you:

  • Have to change this parameter (flagged with TODO)
  • Should think about this parameter (flagged with SHOULDDO)
  • probably don't need to change this parameter (flagged with OPTIONAL)

All config parameters are explained in the _config.yaml file directly and the full documentation, here.

3 Run and edit pipeline

There are multiple steps in the pipeline. The steps can be run all at once, or step by step. To run the whole pipeline it is sufficient to run make ml_task RUN_ID=<run-id>.

If you only want to run all steps up until model training, you can run make model RUN_ID=<run-id>. This will prepare the data structure, process the data, and train your models. It is also possible to run steps without running the steps before. This can be done by the _only suffix. i.e. make model_only RUN_ID=<run-id>.

graph LR;
  A[make ml_task]-->|also executes|B[make visualize];
  B-->|also executes|C[make prediction];
  C-->|also executes|D[make model];
  D-->|also executes|E[make data];
  E-->|also executes|F[make config];


After you have run the pipeline including make data and you want to make changes to model training parameters, you can re-run the pipeline with make train_n_visualize RUN_ID=<run-id>. This will skip data preprocessing and is useful is some cases when adjusting training parameters.

4 Project Organization

|-- bash-runs
|   |-- run_slurm.sh
|   |-- run_SingleCellExample.sh
|   |-- run_TCGAexample.sh
|-- data
|   |-- interim
|   |-- processed
|   |   |-- <RUN_ID>
|   |   |   |-- <data1.txt>
|   |   |   |-- <data2.txt>
|   |   |   `-- sample_split.txt
|   |-- raw
|   |   |-- <raw_data.txt>
|   |   |-- images
|   |   |   |---image_mappings.txt
|   |   |   |---image1.jpg
|-- models
|   |-- <RUN_ID>
|       `-- <model.pt>
|-- reports
|   |-- <RUN_ID>
|   |    |--<latent_space.txt>
|   |    |--figures
|   |       `--<figure1.png>
|-- src
|   |-- data
|   |   |-- format_sc_h5ad.py
|   |   |-- format_tcga.py
|   |   |-- join_h5ad.py
|   |   |-- join_tcga.py
|   |   |-- make_dataset.py
|   |   `-- make_ontology.py
|   |-- features
|   |   |-- build_features.py
|   |   |-- combine_MUT_CNA.py
|   |   `-- get_PIscores.py
|   |-- models
|   |   |-- tuning
|   |   |   |-- models_for_tuning.py
|   |   |   |-- tuning.py
|   |   |-- build_models.py
|   |   |-- main_translate.py
|   |   |-- models.py
|   |   |-- predict.py
|   |   `-- train.py
|   |-- utils
|   |   |-- config.py
|   |   |-- utils.py
|   |   `-- utils_basic.py
|   |-- visualization
|   |   |-- Exp2_visualization.py
|   |   |-- Exp3_visualization.py
|   |   |-- ml_task.py
|   | 
View on GitHub
GitHub Stars31
CategoryDevelopment
Updated1mo ago
Forks3

Languages

Jupyter Notebook

Security Score

75/100

Audited on Mar 5, 2026

No findings