Evolutionary Scale Modeling

Update April 2023: Code for the two simultaneous preprints on protein design is now released! Code for "Language models generalize beyond natural proteins" is under examples/lm-design/. Code for "A high-level programming language for generative protein design" is under examples/protein-programming-language/.

This repository contains code and pre-trained weights for Transformer protein language models from the Meta Fundamental AI Research Protein Team (FAIR), including our state-of-the-art ESM-2 and ESMFold, as well as MSA Transformer, ESM-1v for predicting variant effects and ESM-IF1 for inverse folding. Transformer protein language models were introduced in the 2019 preprint of the paper "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences". ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks. ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.

In November 2022, we released v0 of the ESM Metagenomic Atlas, an open atlas of 617 million predicted metagenomic protein structures. The Atlas was updated in March 2023 in collaboration with EBI. The new v2023_02 adds another 150 million predicted structures to the Atlas, as well as pre-computed ESM2 embeddings. Bulk download, blog post and the resources provided on the Atlas website are documented on this README.

In December 2022, we released two simultaneous preprints on protein design.

"Language models generalize beyond natural proteins" (PAPER, CODE) uses ESM2 to design de novo proteins. The code and data associated with the preprint can be found here.
"A high-level programming language for generative protein design" (PAPER, CODE) uses ESMFold to design proteins according to a high-level programming language.

<details><summary>Citation</summary> For ESM2, ESMFold and ESM Atlas: ```bibtex @article{lin2023evolutionary, title = {Evolutionary-scale prediction of atomic-level protein structure with a language model}, author = {Zeming Lin and Halil Akin and Roshan Rao and Brian Hie and Zhongkai Zhu and Wenting Lu and Nikita Smetanin and Robert Verkuil and Ori Kabeli and Yaniv Shmueli and Allan dos Santos Costa and Maryam Fazel-Zarandi and Tom Sercu and Salvatore Candido and Alexander Rives }, journal = {Science}, volume = {379}, number = {6637}, pages = {1123-1130}, year = {2023}, doi = {10.1126/science.ade2574}, URL = {https://www.science.org/doi/abs/10.1126/science.ade2574}, note={Earlier versions as preprint: bioRxiv 2022.07.20.500902}, } ```

For transformer protein language models:

@article{rives2021biological,
  title={Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences},
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and others},
  journal={Proceedings of the National Academy of Sciences},
  volume={118},
  number={15},
  pages={e2016239118},
  year={2021},
  publisher={National Acad Sciences},
  note={bioRxiv 10.1101/622803},
  doi={10.1073/pnas.2016239118},
  url={https://www.pnas.org/doi/full/10.1073/pnas.2016239118},
}

</details> <details open><summary>Table of contents</summary>

Main models you should use
Usage
ESM Metagenomic Atlas
Notebooks
Available Models and Datasets
Citations
License

</details> <details><summary>What's New</summary>

April 2023: Code for the protein design preprints released under examples/lm-design/.
March 2023: We release an update to the ESM Metagenomic Atlas, v2023_02. See website and bulk download details.
December 2022: The Meta Fundamental AI Research Protein Team (FAIR) released two simultaneous preprints on protein design: "Language models generalize beyond natural proteins" (Verkuil, Kabeli, et al., 2022), and "A high-level programming language for generative protein design" (Hie, Candido, et al., 2022).
November 2022: ESM Metagenomic Atlas, a repository of 600M+ metagenomics structures released, see website and bulk download details
November 2022: ESMFold - new end-to-end structure prediction model released (see Lin et al. 2022)
August 2022: ESM-2 - new SOTA Language Models released (see Lin et al. 2022)
April 2022: New inverse folding model ESM-IF1 released, trained on CATH and UniRef50 predicted structures.
August 2021: Added flexibility to tokenizer to allow for spaces and special tokens (like <mask>) in sequence.
July 2021: New pre-trained model ESM-1v released, trained on UniRef90 (see Meier et al. 2021).
July 2021: New MSA Transformer released, with a minor fix in the row positional embeddings (ESM-MSA-1b).
Feb 2021: MSA Transformer added (see Rao et al. 2021). Example usage in notebook.
Dec 2020: Self-Attention Contacts for all pre-trained models (see Rao et al. 2020)
Dec 2020: Added new pre-trained model ESM-1b (see Rives et al. 2019 Appendix B)
Dec 2020: ESM Structural Split Dataset (see Rives et al. 2019 Appendix A.10)

</details>

Main models you should use <a name="main-models"></a>

| Shorthand | esm.pretrained. | Dataset | Description | |-----------|-----------------------------|---------|--------------| | ESM-2 | esm2_t36_3B_UR50D() esm2_t48_15B_UR50D() | UR50 (sample UR90) | SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with Lin et al. 2022 (Aug 2022 update). | | ESMFold | esmfold_v1() | PDB + UR50 | End-to-end single sequence 3D structure predictor (Nov 2022 update). | | ESM-MSA-1b| esm_msa1b_t12_100M_UR50S() | UR50 + MSA | MSA Transformer language model. Can be used to extract embeddings from an MSA. Enables SOTA inference of structure. Released with Rao et al. 2021 (ICML'21 version, June 2021). | | ESM-1v | esm1v_t33_650M_UR90S_1() ... esm1v_t33_650M_UR90S_5()| UR90 | Language model specialized for prediction of variant effects. Enables SOTA zero-shot prediction of the functional effects of sequence variations. Same architecture as ESM-1b, but trained on UniRef90. Released with Meier et al. 2021. | | ESM-IF1 | esm_if1_gvp4_t16_142M_UR50() | CATH + UR50 | Inverse folding model. Can be used to design sequences for given structures, or to predict functional effects of sequence variation for given structures. Enables SOTA fixed backbone sequence design. Released with Hsu et al. 2022. |

For a complete list of available models, with details and release notes, see Pre-trained Models.

Usage <a name="usage"></a>

Quick start <a name="quickstart"></a>

An easy way to get started is to load ESM or ESMFold through the HuggingFace transformers library, which has simplified the ESMFold dependencies and provides a standardized API and tools to work with state-of-the-art pretrained models.

Alternatively, ColabFold has integrated ESMFold so that you can easily run it directly in the browser on a Google Colab instance.

We also provide an API which you can access through curl or on the ESM Metagenomic Atlas web page.

curl -X POST --data "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL" https://api.esmatlas.com/foldSequence/v1/pdb/

For ESM-MSA-1b, ESM-IF1, or any of the other models you can use the original implementation from our repo directly via the instructions below.

Esm

Install / Use

README

Evolutionary Scale Modeling

Main models you should use <a name="main-models"></a>

Usage <a name="usage"></a>

Quick start <a name="quickstart"></a>

Getting started with this repo <a n