These files are meant to accompany "What are our models really telling us? A practical tutorial on avoiding common mistakes when building predictive models" by W. Patrick Walters

Please address any questions or corrections to pat_walters@vrtx.com

compare_regression.txt - experimental and predicted LogS for a set of compounds using two different predictive models huuskonen.rdkit - RDKit descriptors for the Huuskonen dataset huuskonen.sol - experimental solubilities for the Huuskonen dataset huuskonen_test.rdkit - RD descriptors for a random test set created from the Huuskonen dataset huuskonen_test.smi - SMILES for a random test set built from the Huuskonen dataset huuskonen_test.sol - experimental solubilities for a random test set created from the Huuskonen dataset huuskonen_test.txt - similarities between the Huuskonen test set and training set huuskonen_train.rdkit - experimental solubilities for a random training set created from the Huuskonen dataset huuskonen_train.smi - SMILES for a random test set built from the Huuskonen dataset huuskonen_train.sol - experimental solubilities for a random training set created from the Huuskonen dataset install_libraries.R - a small R script to install the libraries required by the chapter jcim.rdkit - RDKit descriptors for the JCIM dataset jcim.smi - SMILES for the JCIM dataset jcim.sol - experimental solublities for the JCIM dataset listing_1.R - load data and display box plots to compare distributions listing_2.py - calculate molecular descriptors using the RDKit library listing_3.R - train and test a random forest model based on the Huuskonen dataset listing_4.R - train and test a random forest model based on a subset of the Huuskonen dataset listing_5.R - add simulated error to experimental data to examine the impact of error on correlations listing_6.py - cacluate similarity between pairs of SMILES files and report listing_7.R - the most similar training set molecule for each test set molecule listing_8.R - predict activities of 3 test sets for a training set listing_9.R - calculate errors for regression and plot Pearson r with associated error bars pubchem.rdkit - RDKit descriptors for the PubChem dataset pubchem.smi - SMILES for the PubChem dataset pubchem.sol - experimental solublities for the PubChem dataset

Cheminformaticsbook

Install / Use

README