HiTEA

The High Throughput Experimentation Analyzer (HiTEA) modules from Probing the Chemical "Reactome" with High Throughput Experimentation Data as described in King-Smith et al. Please note: Analysis for the manuscript was performed with the R modules (.R files in R_modules directory) however, python modules for ANOVA, Tukey, and Random Forest tests are also provided. Given the differences between implementations of 2-way ANOVA and Tukey HSD in R and python, some minor differences may be observed.

Installing Juypter Notebook

Some modules (.ipynb) files are jupyter notebook files and require installation of jupyter lab. See the official website for details of installation and running a notebook on your specific system.

Modules of HiTEA

HiTEA is made up of several modules that can be used on their own or in conjuction with one another. Before any analysis can be performed, the dataset(s) must be cleaned. An example of how to accomplish this is laid out in Example_Dataset_Cleanup.ipynb.

Variable Importances

Each of the four datasets described in the manuscript have their own specific module, denoted as Random_Forest_XX.py where XX is the type of reaction being analyzed. All random forest modules can be run with default parameters, however, specific arguments can be passed. For example:

python Random_Forest_Buchwald.py -d {PATH_TO_YOUR_DATASET} -s {PATH_TO_DESIRED_SAVE_DIRECTORY} -t {SPECIFIC_YEAR_TO_ANALYZE} -n {REMOVE 0% YIELDING REACTIONS TRUE / FALSE}

Specifiying a -t argument yields a temporal analysis of that specific year, and specifying -n as True yields the analysis of the dataset without its 0% yielding reactions.

The output will be saved as a csv file in the specified directory. Note that for temporal analyses, the corresponding dataset is XX_with_time.csv where XX is the type of reaction needs to be used.

ANOVA-Tukey

Similar to the variable importances, four specific modules for ANOVA-Tukey analysis are provided denoted as ANOVA_Tukey_XX.py where XX is the type of reaction being analyzed. Again, all modules can be run with default parameters, with temporal analysis or no 0% yielding reaction analysis performed with the relevant flags. For example:

python ANOVA_Tukey_Buchwald.py -d {PATH_TO_YOUR_DATASET} -s {PATH_TO_DESIRED_SAVE_DIRECTORY} -t {SPECIFIC_YEAR_TO_ANALYZE} -n {REMOVE 0% YIELDING REACTIONS TRUE / FALSE}

Visualizing the Chemical Space

Two modules are provided for visualizing the chemical space. These are Example_Ligand_PCA.ipynb and Tanimoto_Heatmap.ipynb which visualize the ligand space and reactant space, respectively.

Dataset Release

The full dataset, 8_SEPT_APPROVED_full_dataset.csv, has been released. You can access it in the data directory. Cleaned Buchwald, Ullmann, homogeneous and heterogeneous hydrogenation datasets can be found in data/cleaned_datasets.

Dependencies

All code has been tested on python 3.7 and R 3.4.4.

Python Dependencies:

Install the following dependencies with: pip install X where X is the name as written below.

scikit-learn
numpy
pandas
matplotlib
mols2grid
adjustText (see here for GitHub repo)
ipython
statsmodels

Install RDKit with conda:

conda install -c rdkit rdkit

R Dependencies (if using R modules):

data.table==1.12.2
mltools==0.3.5
stringi==1.4.3

License:

MIT License

HiTEA

Install / Use

README

HiTEA

Installing Juypter Notebook

Modules of HiTEA

Variable Importances

ANOVA-Tukey

Visualizing the Chemical Space

Dataset Release

Dependencies

Python Dependencies:

R Dependencies (if using R modules):

License: