Cosmis
COSMIS is a framework for quantifying the mutational constraint on amino acid sites in 3D spatial neighborhoods. The framework currently maps the landscape of 3D mutational constraint on 6.1 amino acid sites covering >80% (16,533) of human proteins.
Install / Use
/learn @CapraLab/CosmisREADME
COSMIS
COSMIS is a novel framework for quantifying the 3D mutational constraint on amino acid sites in the human proteome. If you find COSMIS useful in your work, please consider citing the following paper:
- Li, B., Roden, D.M., and Capra, J.A. The 3D mutational constraint on amino acid sites in the human proteome. Nat. Commun. 13, 3273 (2022).
What's in each folder?
cosmis
The cosmis folder contains utility code that the top level cosmis code cosmis.py, cosmis_batch.py, and cosmis_sp.py depend on.
figure-code-data
The figure-code-data folder contains standlone R code that can be run to reproduce figures in the main text and supplementary document of the manuscript.
cosmis-scores
The cosmis-scores folder constains precomputed scores for all 16,533 proteins of the human reference proteome currently covered by the framework.
helpers
The helpers folder contains utility scripts written in Python that were called to obtain processed datasets in the mapping-files folder.
structures
The structures folder contains all protein structure in the PDB format based on which COSMIS scores were computed.
supplementary-data
The supplementray-data folder contains all supplementary tables referred to in the published COSMIS paper.
scripts
The scripts folder contains the main application scripts that can be run to compute COSMIS scores depending on use cases.
Using the COSMIS framework
It is recommended that interested users of the COSMIS framework download precomputed COSMIS scores from this repository. However, should you need to run COSMIS using custom-built protein structural models, or to compute COSMIS scores based on protein-protein complexes, please follow the steps below. Also note that we are in the process of repackaging COSMIS into an installable Python package, please check later for updates.
Clone COSMIS
Clone COSMIS to a local directory.
git clone https://github.com/CapraLab/cosmis.git
Note that cloning might fail as this repository is tracked with Git Large File Storage and is over the data quota currently allowed by Git LFS. Please check later as we are sorting out this quota issue.
Install from source
To install cosmis from source, go to the directory where cosmis was cloned and run the following command:
python3 setup.py install
export PYTHONPATH="$PYTHONPATH:</path/to/cosmis>"
Replace </path/to/cosmis> with the actual path to cosmis on your system. You should then be able to locate the main applications scripts in ./build/scripts-3.8 or ./build/scripts-3.9 depending on the Python version on your system. You can copy these scripts to a convenient location of your choice to run them.
Run within a virtual environment
One can also run cosmis application scripts within a virtual environment. It is easiest to set up a separate conda environment to install all required packages and to run these scripts. All required packages can be installed when creating the conda environment, using the following commands:
# run this command to create a conda environment
conda create --name cosmis --file requirements.txt
# activate the environment
conda activate cosmis
# then also use pip to install wget under the environment
pip install wget
Obviously, you will need to install Miniconda or Anaconda before running the command above.
Download required datasets
Some of the required datasets are already made available within this repository (in the database_files folder). However, due to limits on file size, we had to made larger files available through other means. Follow the following steps to get all required input datasets.
- Get all transcript coding sequences from Ensembl (this dataset is already available in
database_files).
wget http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
- Get the amino acid sequences of human reference proteome as annotated by UniProt (this dataset is already available in
database_files).
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
-
Download counts of unique variants at each amino acid position for all gnomAD annotated transcripts.
We preprocessed the gnomAD database and created a JSON formatted file that maps Ensembl stable transcript IDs to unique variant counts and variant types (missense or synonymous) for all position in the human proteome where a SNP variant was annotated by gnomAD. We have made this dataset available through FigShare.
-
Get the mapping table from UniProt protein IDs to Ensembl stable transcript IDs (this dataset is already available in
database_files).A mapping from UniProt protein IDs to Ensembl stable transcript IDs is also required to run COSMIS. We have created such a mapping table and made it available through FigShare
-
Get transcript-level mutation probabilities (this dataset is already available in
database_files).Transcript-level mutation probabilities are required to run COSMIS. You can get them from FigShare.
Run COSMIS
Depending on whether you want run COSMIS on a single monomeric protein or homo-multimeric protein, or a list of monomeric proteins, the script and setup are slightly different.
- Run COSMIS on a single monomeric or homo-multimeric protein.
1.1 Use the following JSON formatted template to supply paths to database files.
{
"ensembl_cds": "/path/to/Homo_sapiens.GRCh38.cds.all.fa.gz",
"uniprot_pep": "/path/to/UP000005640_9606.fasta.gz",
"gnomad_variants": "/path/to/gnomad_filtered/gnomad_variant_counts_hg38.json",
"uniprot_to_enst": "/path/to/uniprot_to_enst.json",
"enst_mp_counts": "/path/to/mutation_probs.tsv"
}
Then, save the file as data_paths.json, for example.
1.2 If you'd like to compute COSMIS score WITHOUT accounting for contacts from neighboring subunits. Run this command
python cosmis_sp.py -c data_paths.json -u <UniProt ID> -p <PDB file> --chain <chain ID of subunit> -o monomeric_cosmis.tsv
1.3 If you'd to compute COSMIS score accounting for contacts from neighboring subunits. Add --multimer to the command above, i.e.
python cosmis_sp.py -c data_paths.json -u <UniProt ID> -p <PDB file> --chain <chain ID of subunit> -o multimeric_cosmis.tsv --multimer
1.4 One can also run the following command to compute COSMIS scores for a subunit which is part of a hetero-oligomeric protein complex. However, cosmis_complex.py has not been thoroughly tested or benchmarked. Interpret the results with caution and let us know if you find anything buggy.
python cosmis_complex.py -c data_path.json -i <chain_to_uniprot_mapping file> -p <PDB file> --chain <chain ID of subunit> -o <output file>
# example
cd examples/
python ../cosmis_complex.py -c cosmis_config.json -i KCNQ1_chain_to_uniprot.txt -p KCNQ1.pdb -o KCNQ1_cosmis.tsv --chain A
- Run COSMIS on a list of monomeric proteins whose structures were obtained from AlphaFold database or SWISS-MODEL repository.
2.1 Use the following JSON formatted template to supply paths to database files.
{
"ensembl_cds": "/path/to/Homo_sapiens.GRCh38.cds.all.fa.gz",
"uniprot_pep": "/path/to/UP000005640_9606.fasta.gz",
"gnomad_variants": "/path/to/gnomad_filtered/gnomad_variant_counts_hg38.json",
"uniprot_to_enst": "/path/to/uniprot_to_enst.json",
"enst_mp_counts": "/path/to/mutation_probs.tsv"
"pdb_dir": "/path/to/pdb_files/",
"output_dir": "./"
}
Then, save the file as data_paths.json, for example.
2.2 If the structures were obtained from AlphaFold database, run the following command
python cosmis_batch.py -c data_paths.json -i <input.txt> -d AlphaFold -l af_cosmis.log
The input file input.txt contains on each line a pair of UniProt ID and PDB filename. For example
A0A024R1R8 A0/A0/AF-A0A024R1R8-F1-model_v1.pdb
A0A024RBG1 A0/A0/AF-A0A024RBG1-F1-model_v1.pdb
A0A024RCN7 A0/A0/AF-A0A024RCN7-F1-model_v1.pdb
A0A075B6H5 A0/A0/AF-A0A075B6H5-F1-model_v1.pdb
A0A075B6H7 A0/A0/AF-A0A075B6H7-F1-model_v1.pdb
A0A075B6H8 A0/A0/AF-A0A075B6H8-F1-model_v1.pdb
A0A075B6H9 A0/A0/AF-A0A075B6H9-F1-model_v1.pdb
A0A075B6I0 A0/A0/AF-A0A075B6I0-F1-model_v1.pdb
A0A075B6I1 A0/A0/AF-A0A075B6I1-F1-model_v1.pdb
A0A075B6I3 A0/A0/AF-A0A075B6I3-F1-model_v1.pdb
assuming that the base directory where the PDB files are stored is /path/to/pdb_files/.
2.3 If the structures were obtained from SWISS-MODEL repository, run the following command
python cosmis_batch.py -c data_paths.json -i <input.txt> -d SWISS-MODEL -l swiss_model_cosmis.log
The input file input.txt contains on each line a pair of UniProt ID and PDB filename. For example
A8MWA4 A8/MW/A4/swissmodel/109_299_5v3m.1.C.pdb
A8MWD9 A8/MW/D9/swissmodel/3_76_4wzj.3.G.pdb
A8MWL7 A8/MW/L7/swissmodel/11_110_2loo.1.A.pdb
A8MX76 A8/MX/76/swissmodel/21_683_3bow.1.A.pdb
A8MXE2 A8/MX/E2/swissmodel/68_331_7jhi.1.A.pdb
A8MXQ7 A8/MX/Q7/swissmodel/136_377_4qfv.1.A.pdb
A8MXT2 A8/MX/T2/swissmodel/95_311_2wa0.1.A.pdb
A8MXU0 A8/MX/U0/swissmodel/23_60_2lwl.1.A.pdb
A8MXY4 A8/MX/Y4/swissmodel/200_532_5v3m.1.C.pdb
A8MYX2 A8/MY/X2/swissmodel/110_172_5cwg.1.A.pdb
again, assuming that the base directory where the PDB files are stored is /path/to/pdb_files/.
Related Skills
node-connect
345.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
106.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
