ChemBench
MoleculeNet benchmark dataset & MolMapNet dataset
Install / Use
/learn @shenwanxiang/ChemBenchREADME
In case you would like to cite this:
1. MolMapNet Dataset
- the following datasets are reported in the paper of <code> <i>"Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations"</i> </code>, please find details of these datasets in this paper
2. Benchmark DataSet in MolNet and Chemprop
These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>task_name</th> <th>task_type</th> <th>n_samples</th> <th>n_task</th> <th>split_method</th> <th>n_cross_split</th> <th>task_metrics</th> </tr> <tr> <th>task_id</th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>01</th> <td>ESOL</td> <td>regression</td> <td>1128</td> <td>1</td> <td>random</td> <td>3</td> <td>RMSE</td> </tr> <tr> <th>02</th> <td>FreeSolv</td> <td>regression</td> <td>642</td> <td>1</td> <td>random</td> <td>3</td> <td>RMSE</td> </tr> <tr> <th>03</th> <td>Lipop</td> <td>regression</td> <td>4200</td> <td>1</td> <td>random</td> <td>3</td> <td>RMSE</td> </tr> <tr> <th>04</th> <td>PDBbind-full</td> <td>regression</td> <td>9880</td> <td>1</td> <td>time</td> <td>1</td> <td>RMSE</td> </tr> <tr> <th>05</th> <td>PDBbind-core</td> <td>regression</td> <td>168</td> <td>1</td> <td>time</td> <td>1</td> <td>RMSE</td> </tr> <tr> <th>06</th> <td>PDBbind-refined</td> <td>regression</td> <td>3040</td> <td>1</td> <td>time</td> <td>1</td> <td>RMSE</td> </tr> <tr> <th>07</th> <td>PCBA</td> <td>classification</td> <td>437929</td> <td>128</td> <td>random</td> <td>3</td> <td>PRC_AUC</td> </tr> <tr> <th>08</th> <td>MUV</td> <td>classification</td> <td>93087</td> <td>17</td> <td>random</td> <td>3</td> <td>PRC_AUC</td> </tr> <tr> <th>09</th> <td>HIV</td> <td>classification</td> <td>41127</td> <td>1</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>10</th> <td>BACE</td> <td>classification</td> <td>1513</td> <td>1</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>11</th> <td>BBBP</td> <td>classification</td> <td>2039</td> <td>1</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>12</th> <td>Tox21</td> <td>classification</td> <td>7831</td> <td>12</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>13</th> <td>ToxCast</td> <td>classification</td> <td>8576</td> <td>617</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>14</th> <td>SIDER</td> <td>classification</td> <td>1427</td> <td>27</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>15</th> <td>ClinTox</td> <td>classification</td> <td>1478</td> <td>2</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>16</th> <td>ChEMBL</td> <td>classification</td> <td>456331</td> <td>1310</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> </tbody></table>Installation
Direct installation:
pip install git+https://github.com/shenwanxiang/ChemBench.git
Developer installation:
git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .
Usage-1: Load the Dataset and MoleculeNet's Split Induces
from chembench import load_data
df, induces = load_data('ESOL')
# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]
Usage-2: Load Dataset As Data Object
from chembench import dataset
data = dataset.load_ESOL()
data.x
data.y
data.description
## regression
dataset.load_Lipop()
dataset.load_ESOL()
dataset.load_FreeSolv()
dataset.load_Malaria()
dataset.load_LMC()
dataset.load_PDBF()
dataset.load_PDBC()
dataset.load_PDBR()
### classification
dataset.load_BBBP()
dataset.load_BACE()
dataset.load_HIV()
dataset.load_MUV()
dataset.load_Tox21()
dataset.load_SIDER()
dataset.load_CYP450()
dataset.load_ToxCast()
dataset.load_ClinTox()
dataset.load_ChEMBL()
dataset.load_PCBA()
Usage-3: Load Cluster Splits
the cluster split results is here, for example, load the cluster splits and random splits for dataset ESOL:
from chembench import get_cluster_induces
induces1 = get_cluster_induces("ESOL", induces = "random_5fcv_5rpts")
induces2 = get_cluster_induces("ESOL", induces = "scaffold_5fcv_1rpts")
print(len(induces1))
print(len(induces2))
For example, the chemical space of the ESOL dataset using 5fold cluster split :

the Kolmogorov-Smirnov statistic on the distribution for the pairwise groups(clusters):

Making a Release
After installing the package in development mode and installing
tox with pip install tox, the commands for making a new release are contained within the finish environment
in tox.ini. Run the following from the shell:
$ tox -e finish
This script does the following:
- Uses BumpVersion to switch the version number in the
setup.cfgandsrc/chembench/version.pyto not have the-devsuffix - Packages the code in both a tar archive and a wheel
- Uploads to PyPI using
twine. Be sure to have a.pypircfile configured to avoid the need for manual input at this step - Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
- Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can
use
tox -e bumpversion minorafte
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
