MulticompSol

Prediction of multicomponent solubility for organic molecules (ML-based)

Generate Convert Improve

Install / Use

/learn @BioE-KimLab/MulticompSol

About this skill

Quality Score

0/100

README

MulticompSol Data Repository

This is a data repository associated with the manuscript titled "Enhancing Predictive Models for Solubility in Multi-Solvent Systems using Semi-Supervised Graph Neural Networks" by Hojin Jung‡, Christopher D. Stubbs‡, Sabari Kumar, Raúl Pérez-Soto, Su-min Song, Yeonjoon Kim, and Seonah Kim (‡Equal contribution).

This repository consists of:

A novel small molecule solubility database (for solutes in 1-3 solvents)
Code to train multicomponent solubility models (for solutes in 1-3 solvents)
Code to perform semi-supervised distillation (for up to student 5 with thresholds of 0.3 and 1 / Here, we are providing a subset of COSMO-RS generated database. Contact us for the full data.)

Using this Repository

To use the database and models in this repository, you will need a working installation of Python (v3.8-3.10) on your computer alongside the required packages (see "Packages Required"). All code was tested in Windows 10 64-bit and CentOS Stream 8, and so it should work on most modern operating systems. Please report any issues with using this code on GitHub.

Training Models

All model training requires a working Python environment, with GPU access and a CUDA setup ideal but not necessary (see "Packages Required" and "Using this Repository"). Getting CUDA and TensorFlow to work together on a GPU can be challenging, so the GNN model code falls back to a CPU if a GPU cannot be found.
For all GNN models, descriptor generation is included as part of model training. Descriptors used can be changed in gnn_multisol.py (atom_features, bond_features, global_features functions). Note that changing the number of features will generally require changing the shapes specified in any preprocessor used.

Training GNN Models

To train GNN models, first check whether your machine has CUDA and TensorFlow GPU support setup. This is often a machine-specific process, and depends on your graphics card, its supported CUDA versions, the CUDA versions installed, and the TensorFlow version installed (among other factors)
GPU use is not required for GNN model training, but significant slowdowns may occur if a GPU is not used
To train GNN models, use the following code snippets as an example (other options available by using the --help flag or checking source code).
- Subgraph Binary: nohup python train_subgraph_binary.py -n "Example_BinarySubgraph" > Log_ExampleBinarySubgraph.txt &
- Subgraph Ternary: nohup python train_subgraph_ternary.py -n "Example_TernarySubgraph" > Log_Example_TernarySubgraph.txt &
- Concat Binary: nohup python train_concat_binary.py -n "Example_BinaryConcat" > Log_ExampleBinaryConcat.txt &
- Concat Ternary: nohup python train_concat_ternary.py -n "Example_TernaryConcat" > Log_Example_TernaryConcat.txt &
Trained GNN models will be saved in models/.../model_files. Each folder has the preprocessor used, the best model (best_model.h5), and the prediction results (kfold_#.csv)

Loading Models

GNN Models
- Trained GNN models can be loaded from the .h5 file found in /model_files/.../best_model.h5. To load, you will need to import the nfp package and pass nfp.custom_objects from nfp as custom_objects to the model load call. Rough example code can be found below.
- Model results can be found in the same directory as the h5 file, in the csv file named kfold_?.csv, where ? is the fold number for that run (0-4, e.g. kfold_0.csv).

def predict_df(df, model_name, csv_file_dir):
    model_dir = Path.cwd()/(f'model_files/{model_name}')
    csv_name = Path(csv_file_dir).stem
    
    model = tf.keras.models.load_model(model_dir/'best_model.h5', custom_objects = nfp.custom_objects)
	#! Will need to change the preprocessor depending on model - consult the respective training script. 
	# (e.g. train_subgraph_binary.py for binary subgraph models)
    preprocessor = CustomPreprocessor_NFPx2(  
        explicit_hs=False,
        atom_features=atom_features,
        bond_features=bond_features)
    preprocessor.from_json(model_dir/'preprocessor.json')
    
    output_signature = (preprocessor.output_signature,
                        tf.TensorSpec(shape=(), dtype=tf.float32),
                        tf.TensorSpec(shape=(), dtype=tf.float32))

    df_data = tf.data.Dataset.from_generator(
	#! Will need to change dataset generation function depending on model - consult the respective training script. 
	# (e.g. train_subgraph_binary.py for binary subgraph models)
        lambda: create_tf_dataset_NFPx2(df, preprocessor, 1.0, False), output_signature=output_signature)\ 
        .cache()\
        .padded_batch(batch_size=len(df))\
        .prefetch(tf.data.experimental.AUTOTUNE)

    pred_results = model.predict(df_data).squeeze()
    df['predicted'] = pred_results
	return df

Packages Required

All of the following were retrieved from PyPI, but should also be available on conda-forge. Most model development was done in Python 3.8.13, but should work fine for Python 3.8 - 3.10 (3.7 may also work, but hasn't been tested). Note that a few packages require specific version numbers (nfp, TensorFlow, pandas, RDKit). Other packages have their version specified for reproducibility, and it is recommended to use the versions specified when possible.

Utility

matplotlib (v3.5.3)
seaborn (v0.12.0)
JupyterLab (v3.4.5)

Descriptor Generation

mordred (v1.2.0)
RDKit (v2022.3.5)

ML/Vector Math

numpy (v1.23.2)
scipy (v1.9.0)
pandas (v1.4.3)
scikit-learn (v1.1.2) (<1.3)
tensorflow (v2.9.1)
tensorflow-addons (v0.18.0)
Keras (v2.9.0)
nfp (v0.3.0 exactly)

Filing Issues

Please report all issues or errors with code on GitHub wherever possible.

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

BioE-KimLab

View profile

View on GitHub

GitHub Stars4

CategoryEducation

Updated1mo ago

Forks1

BioE-KimLab/MulticompSol

Languages

Python

Security Score

75/100

Audited on Feb 16, 2026

No findings