MutDTA
Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
Install / Use
/learn @kumarlab-compomics/MutDTAREADME
H4H Directory structure and where to find things:
H4H:
Everything important (e.g.: model weights) should be accessible through the shared project directory.
I primarily used my home directory to store the code since I could sync it up with GitHub whereas on the shared project directory we have no internet access to perform “git pull/push” operations.
Thus for files/folders that were too large to store in HOME, I used symbolic links to folders located in the shared project directory. However, I forget exactly how I laid everything out and I no longer have access to the VPN to connect and check.
Nonetheless, all the important stuff we would need, like model checkpoints, should be stored in the shared project directory.
GitHub - https://github.com/jyaacoub/MutDTA/tree/main
Training splits can be found on the GitHub page as well as all my most recent code.
Training on new datasets or new models
- For training on existing data you would use the train_folds.sh script, depending on your comfort with editing the existing python scripts it might be a bit difficult to set up. But you just need to define a new "model_opt" in src/utils/loader.py, and add that model key to the list of options in src/utils/config.py.
- If you make any changes to the input model features this would make things a lot harder since this is essentially building a new dataset with those features and would need to add instructions on how to set that up for protein features, protein edges, and ligand features.
- For entirely new datasets this is more challenging since you basically need to build a new Dataset subclass (inherited from the BaseDataset class) - see PlatinumDataset for a good example on this (it is the cleanest of the 3 dataset classes I have).
GitHub issues
All the issues we encountered with this project are tracked via GitHub. I list some of the more relevant issues below:
Summary of model checkpoints/issues (found in MutDTA/results/):
Basically the only ones that matter are results/model_checkpoints and v103. The rest are just some tests I did to resolve/debug issues.
results/model_checkpoints- These are the models trained on random splitsv103- pocket-only representation checkpointsv113- new training split where we excluded highly targeted (OncoKB) proteins from training.- This leads to consistently worse performance across the board.
v115- since "aflow" (alphaflow edge weights) models had a smaller dataset (due to memory issues when running Alphaflow on AA sequences 1200+) we artificially reduced the sizes of the training sets for the other models so that we could have a fair comparison- This didn't change much.
v128- Test to see if new splits were the issue with weirdly low performance with oncoKB split (they were)
OncoKB distribution drift issue with splits - Issue #131
When we originally started looking into OncoKB I selected highly targeted proteins from OncoKB to be excluded from training sets.
- This caused a big distribution drift issue and resulted in much worse performance, particularly with PDBbind.
Stats on the distribution differences between the manually curated oncokb dataset split vs a random split can be found on the issue page.
- click the details button to see figures.
Missing Amino Acids in PDBs for PDBbind - Issue#102
This means for the pocket versions of our models we can’t readily use existing scripts to get the pocket sequence graph based on the PDBs provided.
- It is possible to fix this, but it needs a LOT of effort since we would also need to retrain the PDBbind models that used graphs with the missing residues.
Pocket representation version of our models - Issue#103
This tracks how the pocket representation of Davis and Kiba models was built. The pull request 135 resolves this with the results in the CSV files.
