Computational Optimization of DNA Activity (CODA)

Preprint on bioRxiv. For legacy reasons, this repo is currently listed as boda2. The model, model weights, and code defining the model architecture are covered under the MIT license, and the rest is covered under the AGPL-3.0 license.

Overview
System Requirements
Installation Guide
Interactive Docker Environments
Interactive usage
Applications
Extending CODA
Cloud Integrations

Overview

Here, we present a platform to engineer and validate synthetic CREs capable of driving gene expression with programmed cell type specificity. This library contains the resources needed to train DNNs on MPRA data and generate synthetic sequences from these models. Additionally, we include resources to apply these model to to inference tasks on common input data types (e.g., VCF, FASTA, etc.). Finally, we provide examples which deploy the Malinois model.

Tutorials

We have a number of end-to-end tested notebooks available as references for model training and sequence design in the tutorials subdirectory. Updated: Feb 16 2024.

Library Modules

The modules we develop to implement modeling and design are found in boda.

Scripts

We wrote scripts provide command line integration for the modules we implement which are found in src.

System requirements

Hardware requirements

CODA was extensively tested in Google Colab environments and GCP VMs with the following specs:

Type: a2-highgpu-1g
CPU: 12 vCPU
RAM: 85 GB
GPU: 1x Tesla A100
GPU-RAM: 40 GB HBM2

Type: custom-12-65536
CPU: 12 vCPU (Intel Broadwell)
RAM: 64 GB
GPU: 1x Tesla V100
GPU-RAM: 16 GB HBM2

Software Requirements

CODA was designed using the GCP deployment: NVIDIA GPU-Optimized Image for Deep Learning, ML & HPC

OS: Ubuntu 20.04.2 LTS
CUDA: 11.3
Python: 3.7.12
PyTorch: 1.13.1

The most recent tested python environment (Feb 23 2024) is provided in pip_packages.json.

Installation Guide

First install torch according to the installation guide at https://pytorch.org/, then CODA can be installed from the latest version of the GITHUB repo. CODA was developed on torch==1.13.1.

git clone https://github.com/sjgosai/boda2.git
cd boda2/

pip install --upgrade pip==21.3.1
pip install --no-cache-dir -r requirements.txt
pip install -e .

Installation time is from the repository is approximately 66.07 seconds (Feb 23 2024, tested in Colab). Properly installing torch is highly variable system to system so it wasn't feesible to include in the requirements.txt.

Colab

CODA works in Colab notebooks backed with Custom VMs. We were successful using V100s, but not T4s (tested: Feb 16 2024). Deployment specs used to test these notebooks are:

Type: n1-highmem-8
GPU: 1x Tesla V100
Disk: SSD
Disk Size: 500 GB

We partially reproduce two tutorials as examples:

Load Malinois (approx total runtime: 4 minutes 2 seconds)
Train and Design (approx total runtime: 13 minutes 51 seconds)

For some reason, you need to restart the runtime before import boda will work.

Interactive Docker Environments

CODA has been installed in docker containers which can be downloaded in attached for interactive use. This can be quickly deployed using helper scripts in the repo:

cd /home/ubuntu
git clone https://github.com/sjgosai/boda2.git
bash boda2/src/run_docker_for_dev.sh gcr.io/sabeti-encode/boda devenv 0.2.0 8888 6006

Which connects jupyter lab to ports 8888 and 6006.

More containers can be found at gcr.io/sabeti-encode/boda. Use devenv 0.2.0 if using A100s and devenv 0.1.2 if using V100s.

Interactive usage

Modeling

CODA is an extension of pytorch and pytorch-lightning. Classes in CODA used to construct models generally inherit from nn.Module and lightning.LightningModule but need to be combined as described in tutorials/construct_new_model.ipynb. The documentation for this is in progress.

Inference

Example interactive deployment of Malinois for inference can be found here: tutorials/load_malinois_model.ipynb.

The models trained for the paper:

Sequence generation

The tutorials also include a notebook that describes how to mix models with the sequence generation algorithms implemented in CODA: tutorials/run_generators.ipynb

Applications

We have developed python applications to train models and generate sequences using this library.

Model training

Deep learning models can be trained from the command line by invoking the DATA, MODEL, and GRAPH modules. For example (note: change --artifact_path arg at bottom):

python /home/ubuntu/boda2/src/train.py \
  --data_module=MPRA_DataModule \
    --datafile_path=gs://tewhey-public-data/CODA_resources/Table_S2__MPRA_dataset.txt \
    --sep tab --sequence_column sequence \
    --activity_columns K562_log2FC HepG2_log2FC SKNSH_log2FC \
    --stderr_columns K562_lfcSE HepG2_lfcSE SKNSH_lfcSE \
    --stderr_threshold 1.0 --batch_size=1076 \
    --duplication_cutoff=0.5 --std_multiple_cut=6.0 \
    --val_chrs 7 13 --test_chrs 9 21 X \
    --synth_val_pct=0.0 --synth_test_pct=99.98 \
    --padded_seq_len=600 --use_reverse_complements=True --num_workers=8 \
  --model_module=BassetBranched \
    --input_len 600 \
    --conv1_channels=300 --conv1_kernel_size=19 \
    --conv2_channels=200 --conv2_kernel_size=11 \
    --conv3_channels=200 --conv3_kernel_size=7 \
    --linear_activation=ReLU --linear_channels=1000 \
    --linear_dropout_p=0.11625456877954289 \
    --branched_activation=ReLU --branched_channels=140 \
    --branched_dropout_p=0.5757068086404574 \
    --n_outputs=3 --n_linear_layers=1 \
    --n_branched_layers=3 --n_branched_layers=3 \
    --use_batch_norm=True --use_weight_norm=False \
    --loss_criterion=L1KLmixed --beta=5.0 \
    --reduction=mean \
  --graph_module=CNNTransferLearning \
    --parent_weights=gs://tewhey-public-data/CODA_resources/my-model.epoch_5-step_19885.pkl \
    --frozen_epochs=0 \
    --optimizer=Adam --amsgrad=True \
    --lr=0.0032658700881052086 --eps=1e-08 --weight_decay=0.0003438210249762151 \
    --beta1=0.8661062881299633 --beta2=0.879223105336538 \
    --scheduler=CosineAnnealingWarmRestarts --scheduler_interval=step \
    --T_0=4096 --T_mult=1 --eta_min=0.0 --last_epoch=-1 \
    --checkpoint_monitor=entropy_spearman --stopping_mode=max \
    --stopping_patience=30 --accelerator=gpu --devices=1 --min_epochs=60 --max_epochs=200 \
    --precision=16 --default_root_dir=/tmp/output/artifacts \
    --artifact_path=gs://your/bucket/mpra_model/

Update Feb. 16 2024: We found a problem with the training data in supplementary table 2 we deposited on bioRxiv. There's an updated version here.

Sequence design

Trained models can be deployed to generate sequences using implemented algorithms. This command will run Fast SeqProp using the Malinois model to optimize K562 specific expression:

python /home/ubuntu/boda2/src/generate.py \
    --params_module StraightThroughParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 --n_samples 10 \
        --use_norm True --use_affine False \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 1.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact gs://tewhey-public-data/CODA_resources/malinois_artifacts__20211113_021200__287348.tar.gz \
    --generator_module FastSeqProp \
         --n_steps 200 --learning_rate 0.5 \
    --energy_threshold -0.5 --max_attempts 40 \
    --n_proposals 1000 \
    --proposal_path ./test__k562__fsp

The target cell type can be changed by picking a different index for --target_feature. If you're using Malinois, {'K562': 0, 'HepG2': 1, 'SKNSH': 2}.

This command will run Simulated Annealing with the same objective:

python /home/ubuntu/boda2/src/generate.py \
    --params_module BasicParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 0.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact gs://tewhey-public-data/CODA_resources/malinois_artifacts__20211113_021200__287348.tar.gz \
    --generator_module SimulatedAnnealing \
         --n_steps 2000 --n_positions 5 \
         --a 1.0 --b 1.0 --gamma 0.501 \
    --energy_threshold -0.5 --max_attempts 40 \
    --n_proposals 1000 \
    --proposal_path ./test__k562__sa

This command will run AdaLead with the same objective:

python /home/ubuntu/boda2/src/generate.py \
    --params_module PassThroughPa

Boda2

Install / Use

README