SkillAgentSearch skills...

MulticompSol

Prediction of multicomponent solubility for organic molecules (ML-based)

Install / Use

/learn @BioE-KimLab/MulticompSol
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MulticompSol Data Repository

This is a data repository associated with the manuscript titled "Enhancing Predictive Models for Solubility in Multi-Solvent Systems using Semi-Supervised Graph Neural Networks" by Hojin Jung‡, Christopher D. Stubbs‡, Sabari Kumar, Raúl Pérez-Soto, Su-min Song, Yeonjoon Kim, and Seonah Kim (‡Equal contribution).

This repository consists of:

  • A novel small molecule solubility database (for solutes in 1-3 solvents)
  • Code to train multicomponent solubility models (for solutes in 1-3 solvents)
  • Code to perform semi-supervised distillation (for up to student 5 with thresholds of 0.3 and 1 / Here, we are providing a subset of COSMO-RS generated database. Contact us for the full data.)

Using this Repository

To use the database and models in this repository, you will need a working installation of Python (v3.8-3.10) on your computer alongside the required packages (see "Packages Required"). All code was tested in Windows 10 64-bit and CentOS Stream 8, and so it should work on most modern operating systems. Please report any issues with using this code on GitHub.

Training Models

  • All model training requires a working Python environment, with GPU access and a CUDA setup ideal but not necessary (see "Packages Required" and "Using this Repository"). Getting CUDA and TensorFlow to work together on a GPU can be challenging, so the GNN model code falls back to a CPU if a GPU cannot be found.
  • For all GNN models, descriptor generation is included as part of model training. Descriptors used can be changed in gnn_multisol.py (atom_features, bond_features, global_features functions). Note that changing the number of features will generally require changing the shapes specified in any preprocessor used.

Training GNN Models

  • To train GNN models, first check whether your machine has CUDA and TensorFlow GPU support setup. This is often a machine-specific process, and depends on your graphics card, its supported CUDA versions, the CUDA versions installed, and the TensorFlow version installed (among other factors)
  • GPU use is not required for GNN model training, but significant slowdowns may occur if a GPU is not used
  • To train GNN models, use the following code snippets as an example (other options available by using the --help flag or checking source code).
    • Subgraph Binary: nohup python train_subgraph_binary.py -n "Example_BinarySubgraph" > Log_ExampleBinarySubgraph.txt &
    • Subgraph Ternary: nohup python train_subgraph_ternary.py -n "Example_TernarySubgraph" > Log_Example_TernarySubgraph.txt &
    • Concat Binary: nohup python train_concat_binary.py -n "Example_BinaryConcat" > Log_ExampleBinaryConcat.txt &
    • Concat Ternary: nohup python train_concat_ternary.py -n "Example_TernaryConcat" > Log_Example_TernaryConcat.txt &
  • Trained GNN models will be saved in models/.../model_files. Each folder has the preprocessor used, the best model (best_model.h5), and the prediction results (kfold_#.csv)

Loading Models

  • GNN Models
    • Trained GNN models can be loaded from the .h5 file found in /model_files/.../best_model.h5. To load, you will need to import the nfp package and pass nfp.custom_objects from nfp as custom_objects to the model load call. Rough example code can be found below.
    • Model results can be found in the same directory as the h5 file, in the csv file named kfold_?.csv, where ? is the fold number for that run (0-4, e.g. kfold_0.csv).
def predict_df(df, model_name, csv_file_dir):
    model_dir = Path.cwd()/(f'model_files/{model_name}')
    csv_name = Path(csv_file_dir).stem
    
    model = tf.keras.models.load_model(model_dir/'best_model.h5', custom_objects = nfp.custom_objects)
	#! Will need to change the preprocessor depending on model - consult the respective training script. 
	# (e.g. train_subgraph_binary.py for binary subgraph models)
    preprocessor = CustomPreprocessor_NFPx2(  
        explicit_hs=False,
        atom_features=atom_features,
        bond_features=bond_features)
    preprocessor.from_json(model_dir/'preprocessor.json')
    
    output_signature = (preprocessor.output_signature,
                        tf.TensorSpec(shape=(), dtype=tf.float32),
                        tf.TensorSpec(shape=(), dtype=tf.float32))

    df_data = tf.data.Dataset.from_generator(
	#! Will need to change dataset generation function depending on model - consult the respective training script. 
	# (e.g. train_subgraph_binary.py for binary subgraph models)
        lambda: create_tf_dataset_NFPx2(df, preprocessor, 1.0, False), output_signature=output_signature)\ 
        .cache()\
        .padded_batch(batch_size=len(df))\
        .prefetch(tf.data.experimental.AUTOTUNE)

    pred_results = model.predict(df_data).squeeze()
    df['predicted'] = pred_results
	return df

Packages Required

All of the following were retrieved from PyPI, but should also be available on conda-forge. Most model development was done in Python 3.8.13, but should work fine for Python 3.8 - 3.10 (3.7 may also work, but hasn't been tested). Note that a few packages require specific version numbers (nfp, TensorFlow, pandas, RDKit). Other packages have their version specified for reproducibility, and it is recommended to use the versions specified when possible.

Utility

  • matplotlib (v3.5.3)
  • seaborn (v0.12.0)
  • JupyterLab (v3.4.5)

Descriptor Generation

  • mordred (v1.2.0)
  • RDKit (v2022.3.5)

ML/Vector Math

  • numpy (v1.23.2)
  • scipy (v1.9.0)
  • pandas (v1.4.3)
  • scikit-learn (v1.1.2) (<1.3)
  • tensorflow (v2.9.1)
  • tensorflow-addons (v0.18.0)
  • Keras (v2.9.0)
  • nfp (v0.3.0 exactly)

Filing Issues

Please report all issues or errors with code on GitHub wherever possible.

Related Skills

View on GitHub
GitHub Stars4
CategoryEducation
Updated1mo ago
Forks1

Languages

Python

Security Score

75/100

Audited on Feb 16, 2026

No findings