Chroma
A generative model for programmable protein design
Install / Use
/learn @generatebio/ChromaREADME
Get Started | Sampling | Design | Conditioners | License
Chroma is a generative model for designing proteins programmatically.
Protein space is complex and hard to navigate. With Chroma, protein design problems are represented in terms of composable building blocks from which diverse, all-atom protein structures can be automatically generated. As a joint model of structure and sequence, Chroma can also be used for common protein modeling tasks such as generating sequences given backbones, packing side-chains, and scoring designs.
We provide protein conditioners for a variety of constraints, including substructure, symmetry, shape, and neural-network predictions of some protein classes and annotations. We also provide an API for creating your own conditioners in a few lines of code.
Internally, Chroma uses diffusion modeling, equivariant graph neural networks, and conditional random fields to efficiently sample all-atom structures with a complexity that is sub-quadratic in the number of residues. It can generate large complexes in a few minutes on a commodity GPU. You can read more about Chroma, including biophysical and crystallographic validation of some early designs, in our paper, Illuminating protein space with a programmable generative model. Nature 2023.
<div align="center"> <img src="assets/proteins.png" alt="Generated protein examples" width="700px" align="middle"/> </div>Get Started
Note: An API key is required to download and use the pretrained model weights. It can be obtained here.
Colab Notebooks. The quickest way to get started with Chroma is our Colab notebooks, which provide starting points for a variety of use cases in a preconfigured, in-browser environment
- Chroma Quickstart: GUI notebook demonstrating unconditional and conditional generation of proteins with Chroma.
- Chroma API Tutorial: Code notebook demonstrating protein I/O, sampling, and design configuration directly in
python. - Chroma Conditioner API Tutorial: A deeper dive under the hood for implementing new Chroma Conditioners.
PyPi package.You can install the latest release of Chroma with:
pip install generate-chroma
Install latest Chroma from github
git clone https://github.com/generatebio/chroma.git
pip install -e chroma # use `-e` for it to be editable locally.
Sampling
Unconditional monomer. We provide a unified entry point to both unconditional and conditional protein design with the Chroma.sample() method. When no conditioners are specified, we can sample a simple 200-amino acid monomeric protein with
from chroma import Chroma
chroma = Chroma()
protein = chroma.sample(chain_lengths=[200])
protein.to("sample.cif")
display(protein)
Generally, Chroma.sample() takes as input design hyperparameters and Conditioners and outputs Protein objects representing the all-atom structures of protein systems which can be loaded to and from disk in PDB or mmCIF formats.
Unconditional complex. To sample a complex instead of a monomer, we can simply do
from chroma import Chroma
chroma = Chroma()
protein = chroma.sample(chain_lengths=[100, 200])
protein.to("sample-complex.cif")
Conditional complex. We can further customize sampling towards design objectives via Conditioners and sampling hyperparameters. For example, to sample a C3-symmetric homo-trimer with 100 residues per monomer, we can do
from chroma import Chroma, conditioners
chroma = Chroma()
conditioner = conditioners.SymmetryConditioner(G="C_3", num_chain_neighbors=2)
protein = chroma.sample(
chain_lengths=[100],
conditioner=conditioner,
langevin_factor=8,
inverse_temperature=8,
sde_func="langevin",
potts_symmetry_order=conditioner.potts_symmetry_order)
protein.to("sample-C3.cif")
Because compositions of conditioners are conditioners, even relatively complex design problems can follow this basic usage pattern. See the demo notebooks and docstrings for more information on hyperparameters, conditioners, and starting points.
Design
Robust design. Chroma is a joint model of sequence and structure that uses a common graph neural network base architecture to parameterize both backbone generation and conditional sequence and sidechain generation. These sequence and sidechain decoders are diffusion-aware in the sense that they have been trained to predict sequence and side chain not just for natural structures at diffusion time $t=0$ but also on noisy structures at all diffusion times $t \in [0,1]$. As a result, the $t$ hyperpameter of the design network provides a kind of tunable robustness via diffusion augmentation in we trade off between how much the model attempts to design the backbone exactly as specified (e.g. $t=0.0$) versus robust design within a small neighborhood of nearby backbone conformations (e.g. $t=0.5$).
While all results presented in the Chroma publication were done with exact design at $t=0.0$, we have found robust design at times near $t=0.5$ frequently improves one-shot refolding while incurring only minor, often Ångstrom-scale, relaxation adjustments to target backbones. When we compare the performance of these two design modes on our set of 50,000 unconditional backbones that were analyzed in the paper, we see very large improvements in refolding across both AlphaFold and ESMFold that stratifies well across protein length, percent helicity, or similarity to a known structure (See Chroma Supplementary Figure 14 for further context).
<div align="center"> <img src="./assets/refolding.png" alt="alt text" width="700px" align="middle"/> </div></br>The value of diffusion time conditioning $t$ can be set via the design_t parameter in Chroma.sample and Chroma.design. We find that for generated structures, $t = 0.5$ produces highly robust refolding results and is, therefore, the default setting. For experimentally-precise structures, $t = 0.0$ may be more appropriate, and values in between may provide a useful tradeoff between these two regimes.
Design a la carte. Chroma's design network can be accessed separately to design, redesign, and pack arbitrary protein systems. Here we load a protein from the PDB and redesign as
# Redesign a Protein
from chroma import Protein, Chroma
chroma = Chroma()
protein = Protein('1GFP')
protein = chroma.design(protein)
protein.to("1GFP-redesign.cif")
Clamped sub-sequence redesign is also available and compatible with a built-in selection algebra, along with position- and mutation-specific mask constraints as
# Redesign a Protein
from chroma import Protein, Chroma
chroma = Chroma()
protein = Protein('my_favorite_protein.cif') # PDB is fine too
protein = chroma.design(protein, design_selection="resid 20-50 around 5.0") # 5 angstrom bubble around indices 20-50
protein.to("my_favorite_protein_redesign.cif")
We provide more examples of design in the demo notebooks.
Conditioners
Protein design with Chroma is programmable. Our Conditioner framework allows for automatic conditional sampling under arbitrary compositions of protein specifications, which can come in the forms of restraints (biasing the distribution of states) or constraints (directly restrict the domain of underlying sampling process); see Supplementary Appendix M in our paper. We have pre-defined multiple conditioners, including for controlling substructure, symmetry, shape, semantics, and natural-language prompts (see chroma.layers.structure.conditioners), which can be used in arbitrary combinations.
| Conditioner | Class(es) in chroma.conditioners | Example applications |
|----------|----------|----------|
| Symmetry constraint | SymmetryConditioner, ScrewConditioner | Large symmetric assemblies |
| Substructure constraint | SubstructureConditioner | Substructure grafting, scaffold enforcement |
| Shape restraint | ShapeConditioner | Molecular shape control |
| Secondary structure | ProClassConditioner | Secondary-structure specification |
| Domain classification | ProClassConditioner | Specification of class, such as Pfam, CATH, or Taxonomy |
| Text caption | ProCapConditioner | Natural language prompting |
| Sequence | SubsequenceConditioner | Subsequence constraints. |
How it works. The central idea of Conditioners is composable state transformations, where each Conditioner is a function that modifies the state and/or energy of a protein system in a differentiable way (Supplementary Appendix M). For example, to encode symmetry as a constraint we can take as input the assymetric unit and tesselate it according to the desired symmetry group to output a protein system that is symmetric by construction. To encode something like a neural network restraint, we can adjust the total system energy by the negative log probability of the target condition. For both of these, we add on the diffus
