Moleculenet

MSc Dissertation: Estimating Uncertainty in Machine Learning Models for Drug Discovery

Generate Convert Improve

Install / Use

/learn @GeorgeBatch/Moleculenet

About this skill

Quality Score

0/100

README

Estimating Uncertainty in Machine Learning Models for Drug Discovery

Project Details

Title: Estimating Uncertainty in Machine Learning Models for Drug Discovery
Type: MSc dissertation
Author: George Batchkala, https://www.linkedin.com/in/george-batchkala/
Supervisor: Professor Garrett M. Morris, garrett.morris@dtc.ox.ac.uk
Institution: University of Oxford
Department: Department of Statistics, 24-29 St Giles', Oxford, OX1 3LB
Project's dates: June 1st, 2020 - September 14th, 2020
Data: MoleculeNet, Physical Chemistry Datasets (http://moleculenet.ai/datasets-1)
GitHub repository: https://github.com/GeorgeBatch/moleculenet

This repository contains all code, results, and plots I produced while completing my MSc dissertation. The pdf file with the full dissertation will be uploaded after it gets marked and I officially complete my degree.

Abstract

"My model says that I had just found an ultimate drug. Can I trust it?"

In this work, I explore ways of quantifying the confidence of machine learning models used in drug discovery. In order to do this, I start with exploring methods to predict physicochemical properties of drugs and drug-like molecules crucial to drug discovery. I first attempt to reproduce and improve upon a subset of results to do with a drug's solubility in water, taken from a popular benchmark set called "MoleculeNet". Using XGBoost, which in the era of Deep Neural Networks, is already classified as a "conventional" machine learning method, I show that I am able to achieve state-of-the-art results. After that, I explore Gaussian Processes and Infinitesimal Jackknife for Random Forests and their associated uncertainty estimates. Finally, I attempt to understand whether the confidence of a model's prediction can be used to answer a similar but more general question: "How do we know when to trust our models?" The answer depends on the model. We can trust Gaussian Processes when they are confident, but the confidence estimates from Random Forests do not give us any assurance.

Related work

This work is mostly based of four papers:

"MoleculeNet: A Benchmark for Molecular Machine Learning" by Wu et al.;
"Learning From the Ligand: Using Ligand-Based Features to Improve Binding Affinity Prediction" by Boyles et al.;
"The Photoswitch Dataset: A Molecular Machine Learning Benchmark for the Advancement of Synthetic Chemistry" by Thawani et al.; and
"Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" by Wager et al..

Aims

In this dissertation I aim to achieve three primary goals:

Reproduce a subset of solubility-related prediction results from the MoleculeNet benchmarking paper;
Improve upon the reproduced results; and
Use uncertainty estimation methods with the best-performing models to get single prediction uncertainty estimates to evaluate and compare these methods.

Data

I used the MoleculeNet dataset which accompanies the MoleculeNet benchmarking paper, and in particular, I focused on the Physical Chemistry datasets: ESOL, FreeSolv, and Lipophilicity. The MoleculeNet datasets are widely used to validate machine learning models used to estimate a particular property directly from small molecules including drug-like compounds.

The Physical Chemistry datasets can be downloaded from MoleculeNet benchmark dataset collection.

Models

I use the following four models for the regression task of physicochemical property prediction:

Obtaining Confidence Intervals

I obtained per-prediction confidence intervals with:

Gaussian Processes (notes, chapter 7, section 7.2)
Bias-corrected Infinitesimal Jackknife estimate for Random Forests (paper)

Implementation

All the data preparation, experiments, and visualisations were done in Python.

To convert molecules from their SMILES string representations to either Molecular Descriptors or Extended-Connectivity Fingerprints, I used the open-source cheminformatics software, RDKit (GitHub).

Wu et al. suggest to use their Python library, DeepChem (GitHub), to reproduce the results. We decided not to use it, since the user API only gives high-level access to the user, while I wanted to have more control of the implementation. To have comparable results, I decided to use the tools which the DeepChem library is built on.

For most of the machine learning pipeline, I used Scikit-Learn (GitHub) for preprocessing, splitting, modelling, prediction, and validation. To obtain the confidence intervals for Random Forests, I used the forestci (GitHub) extension for Scikit-Learn. The implementation of a custom Tanimoto (Jaccard) kernel for Gaussian Process Regression and all the following GP experiments were performed with GPflow (GitHub).

Set-up

In this section I outline the set-up steps required to start reproducing my results. It covers the following stages:

Directory set-up;
Creating an environment with conda;
Data preparation; and
Creation of features.

Environment

In the root (moleculenet) directory create a project environment from the environment.yml file using:

>>> conda env create -f environment.yml

Environment's name is batch-msc, and we activate it using:

>>> conda activate batch-msc

Conda environments make managing Python library dependences and reproducing research much easier. Another reason why we use conda us that some packages, e.g. RDKit: Open-Source Cheminformatics Software, are not available via pip install.

Data preparation

This section covers two data preparation stages: standardising input files and producing the features.

Standardise Names

To automate the process of working with three different datasets (ESOL, FreeSolv, and Lipiphilicity) we standardise the column names from the original CSV files and store the results in the new CSV files.

We need to get hold of ID/Name, SMILES string representation, and measured label value for each of the compounds in all of the three datasets. To do this, run the following commands in the ~/scripts/ directory:

>>> python get_original_id_smiles_labels_lipophilicity.py
>>> python get_original_id_smiles_labels_esol.py
>>> python get_original_id_smiles_labels_freesolv.py

The resulting files are saved in the ~/data/ directory:

esol_original_IdSmilesLabels.csv, esol_original_extra_features.csv
freesolv_original_IdSmilesLabels.csv
lipophilicity_original_IdSmilesLabels.csv

Note: the original file for the ESOL dataset also contained extra features which we also save here.

Compute and Store Features

We show how to produce the features and store them in CSV files.

From the SMILES string representations of the molecules for all three datasets compute Extended-Connectivity Fingerprints and RDKit Molecular D

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

workshop-rules

Materials used to teach the summer camp <Data Science for Kids>

GeorgeBatch

View profile

View on GitHub

GitHub Stars4

CategoryEducation

Updated1y ago

Forks2

GeorgeBatch/moleculenet

Languages

Jupyter Notebook

Security Score

60/100

Audited on Nov 21, 2024

No findings