FLEXS
Fitness landscape exploration sandbox for biological sequence design.
Install / Use
/learn @samsinai/FLEXSREADME

💪 FLEXS is an open-source simulation environment that enables you to develop and compare model-guided biological sequence design algorithms. This project was developed with support from Dyno Therapeutics.
Installation
FLEXS is available on PyPI 🐍 and can be installed with pip install flexs.
There are two optional, but very useful dependencies, ViennaRNA (for RNA binding landscapes) and PyRosetta (for protein design landscapes). These can both be installed with conda:
$ conda install -c bioconda viennarna
$ conda install pyrosetta # Set up RosettaCommons conda channel first (http://www.pyrosetta.org/dow)
Note that PyRosetta requires a commercial license if not being used for academic purposes.
IMPORTANT: ViennaRNA seems to have issues with Python 3.8, so try to run in a Python 3.5>=, <=3.7 environment.
If contributing or running paper code/experiments, we recommend that you install the dependencies for the sandbox in a conda virtual environment. You can initialize a new Python 3.7 environement with conda env create --name {env_name} python=3.7. Then install the local version of flexs with pip install -e . in the root directory.
Overview
Biological sequence design through machine-guided directed evolution has been of increasing interest. This process often involves two closely connected steps:
- Models
fthat attempt to learn the ground truth sequence to function relationshipg(x) = y. - Algorithms that explore the sequence space with the help of the trained model
f.
While in some cases, these two steps are learned simultaneously, it is fairly common to have access to a well-trained model f which is not invertible. Namely, given a sequence x, the model can estimate y' (with variable accuracy), but it cannot generate a sequence x' associated with a specific function y. Therefore it is valuable to develop exploration algorithms E(f) that make use of the model f to propose sequences x'.
We implement a simulation environment that allows you to develop or port landscape exploration algorithms for a variety of challenging tasks. Our environment allows you to abstract away the model f = Noisy_abstract_model(g) or employ empirical models (like Keras/Pytorch or Sklearn models). You can see how these work in the tutorial.
Our abstraction is comprised of four levels:
1. Fitness Landscapes 🏔️
These oracles g are simulators that are assumed as ground truth, i.e. when queried, they return the true value y_i associated with a sequence x_i. Currently we have four classes of ground truth oracles implemented.
- Transcription factor binding data. This is comprised of 158 (experimentally) fully characterized landscapes.
- RNA landscapes. A set of curated and increasingly challenging RNA binding landscapes as simulated with ViennaRNA.
- AAV Additive Tropism. A hypothesized noisy additive protein landscape based on tissue tropism of single mutant AAV2 capsid protein.
- GFP fluorescence. Fluorescence of GFP protein as predicted by TAPE transformer model.
- Rosetta-based design. Rosetta-based design task for 3MSI anti-freeze protein.
For all landscapes we also provide a fixed set of initial points with different degrees of previous optimization, so that the relative strength of algorithms when starting from locations near or far away from peaks can be evaluated.
2. Noisy oracles
Noisy oracles are (approximate) models f of the original ground truth landscape g. These allow for the exploration algorithm to screen sequences virtually, before committing to making expensive queries to g. We implement two flavors of these
- Noisy abstract models: Noise corrupted version of
g(this allows for independent study of exploration algorithms). - Empirical models:
fis learned directly from the data that was collected so far.
3. Exploration algorithms 🕵️
Exploration algorithms have access to f with some limit on the number of queries to this oracle virtual_screen. Once they have queried that many samples, they would commit to measuring batch_size from the ground truth, which incurrs a real cost. The class base_explorer implements the housekeeping tasks, and new exploration algorithms can be implemented by inheriting from it.
4. Evaluators 📊
We also implement a suite of evaluation modules that automatically collect data that is necessary for evaluating algorithms on different performance criteria.
- robustness: Produces data for analyzing how explorer performance changes given different quality of models.
- efficiency: Produces data for analyzing how explorer performance changes when more computational evaluations are allowed.
- adaptivity: Produces data for analyzing how the explorer is sensitive to the number of batches it is allowed to sample, given a fixed total budget.
See the tutorial for an example of how these can be run.
Contributions and credits 🤩
Your PR and contributions to this sandbox are most welcome. If you make use of data or algorithms in this sandbox, please ensure that you cite the relevant original articles upon which this work was made possible (we provide links in this readme). To cite the sandbox itself:
@article{sinai2020adalead,
title={AdaLead: A simple and robust adaptive greedy search algorithm for sequence design},
author={Sinai, Sam and Wang, Richard and Whatley, Alexander and Slocum, Stewart and Locane, Elina and Kelsic, Eric},
journal={arXiv preprint},
year={2020}
}
FLEXS 0.2.1 was developed by Sam Sinai, Richard Wang, Alexander Whatley, Elina Locane, and Stewart Slocum.
Components
Ground Truth Landscapes
Transcription Factor Binding
Barrera et al. (2016) surveyed the binding affinity of more than one hundred and fifty transcription factors (TF) to all possible DNA sequences of length 8. Since the ground truth is entirely characterized, and biological, it is a relevant benchmark for our purpose. These generate the full picture for landscapes of size 4^8. We shift the function distribution such that y is within [0,1], and therefore optimal(y)=1. We also provide 15 initiation sequences with different degrees of optimization across landscapes. The sequence TTAATTAA for instance is a famous binding site that is a global peak in 20 of these landscapes, and a local peak (above all its single mutant neighbors) in 96 landscapes overall. GCTCGAGC is a local peak in 106 landscapes, whereas AAAGAGAG is not a peak in any of the 158 landscapes. It is notable that while complete, these landscapes are generally easy to optimize on due to their size. So we recommend that they are tested in very low-budget setting or additional classes of landscapes are used for benchmarking.
@article{barrera2016survey,
title={Survey of variation in human transcription factors reveals prevalent DNA binding changes},
author={Barrera, Luis A and Vedenko, Anastasia and Kurland, Jesse V and Rogers, Julia M and Gisselbrecht, Stephen S and Rossin, Elizabeth J and Woodard, Jaie and Mariani, Luca and Kock, Kian Hong and Inukai, Sachi and others},
journal={Science},
volume={351},
number={6280},
pages={1450--1454},
year={2016},
publisher={American Association for the Advancement of Science}
}
RNA Landscapes
Predicting RNA secondary structures is a well-studied problem. There are efficient and accurate dynamic programming approaches to calculates secondary structure of short RNA sequences. These landscapes give us a good proxy for a consistent oracle over entire domain of large landscapes. We use the ViennaRNA package to simulate binding landscapes of RNA sequences as a ground truth oracle.
Our sandbox allows for constructing arbitrarily complex landscapes (although we discourage large RNA sequences as the accuracy of the simulator deteriorates above 200 nucleotides). As benchmark, we provide a series of 36 increasingly complex RNA binding landscapes. These landscapes each come with at least 5 suggested starting sequences, with various initial optimization.
The simplest landscapes are binding landscapes with a single hidden target (often larger than the design sequence resulting in multiple peaks). The designed sequences is meant to be optimized to bind the target with the minimum binding energy (we use duplex energy as our objective). We estimate optimal(y) by computing the binding energy of the perfect complement of the target and normalize the fitnesses using that measure (hence this is only an approximation and often a slight underestimate). RNA landscapes show many local peaks, and often multiple global peaks due to symmetry.
Additionally, we construct more complex landscapes by increasing the number of hidden targets, enforcing specific conservation patterns, and composing the scores of ea
