BBOMol
Surrogate-based black-box optimization method for molecular properties
Install / Use
/learn @jules-leguy/BBOMolREADME
BBOMol
Surrogate-based black-box optimization of molecular properties
Installation
BBOMol depends on EvoMol for evolutionary optimization of the surrogate function. Follow first the installation steps described on <a href='https://github.com/jules-leguy/evomol'>EvoMol repository</a>. Make sure to follow Installation and DFT optimization sections.
Then follow the following commands to install BBOMol.
$ git clone https://github.com/jules-leguy/BBOMol.git # Cloning repository
$ cd BBOMol # Moving into BBOMol directory
$ conda activate evomolenv # Activating anaconda environment
$ python -m pip install . # Installing BBOMol
Finally, type the following commands to install ChemDesc, a dependency that is required to compute the molecular descriptors.
$ cd .. # Go back to the previous directory if you are still in the BBOMol installation directory
$ git clone https://github.com/jules-leguy/ChemDesc.git # Clone ChemDesc
$ cd ChemDesc # Move into ChemDesc directory
$ conda activate evomolenv # Activate environment
$ conda install -c conda-forge dscribe=1.2.1 # Installing DScribe dependency
$ python -m pip install . # Install ChemDesc
To use BBOMol, make sure to activate the evomolenv conda environment.
Quickstart
Running a black-box optimization of the HOMO energy using an RBF-based kernel and the MBTR descriptor. The merit function that is optimized by the evolutionary algorithm is the expected improvement of the surrogate function.
from bbomol import run_optimization
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
run_optimization({
"obj_function": "homo",
"merit_optim_parameters": {
"merit_type": "EI",
},
"surrogate_parameters": {
"GPR_instance": GaussianProcessRegressor(1.0 * RBF(1.0) + WhiteKernel(1.0), normalize_y=True),
"descriptor": {
"type": "MBTR"
}
}
})
Settings
A dictionary can be given to bbomol.run_optimization to describe the experiment to be performed. This dictionary can contain up to 6 entries, that are described in this section.
Default values are represented in bold.
Objective function
The "obj_function" attribute describes the (costly) objective function to be optimized by the algorithm. It can
be defined according to the formalism of EvoMol. Any value accepted for this
attribute by EvoMol is also accepted here. This includes implemented functions (e.g. "homo"), custom Python
functions evaluating a SMILES and functions combining several properties. See the relevant section in
EvoMol documentation.
Surrogate parameters
The "surrogate_parameters" attribute describes the parameters of the Gaussian process regression model (Kriging)
that is used as a surrogate of the objective function. This includes the setting of the molecular descriptor. It can
be set with a dictionary containing the following entries.
-
"GPR_instance": instance of sklearn.gaussian_process.GaussianProcessRegressor (default :GaussianProcessRegressor(1.0*RBF(1.0)+WhiteKernel(1.0), normalize_y=True)) -
"max_train_size": maximum possible size of the surrogate training dataset. If more samples are available in the dataset of solutions (including solutions of design of experiments) then max_train_size samples are sampled uniformly before each training (at the beginning of each optimization step). -
"descriptor": a dictionary that defines the descriptor to be used to represent the solutions. The"type"attribute is used to select the descriptor, which can be configured using the following set of attributes."type": name of the descriptor to be used.- "MBTR" : many-body tensor representation, using DScribe implementation.
- "shingles" : boolean or integer vector of shingles.
- "SOAP" : smooth overlap of atomic positions, using DScribe implementation.
-
Parameter common to MBTR and SOAP
"species": list of atomic symbols that can be represented (["H", "C", "O", "N", "F"]).
-
Parameters common to Shingles and random vectors
"vect_size": size of the descriptor (2000).
-
Parameters specific to MBTR (see DScribe documentation)
"atomic_numbers_n","inverse_distances_n","cosine_angles_n": number of bins to respectively encode the atomic numbers (10), the interatomic distances (25) and interatomic angles (25).
-
Parameters specific to the vector of shingles
"lvl": radius of the shingles (1)."count": if False, the descriptor is a boolean vector that represents whether the i<sup>th</sup> shingle is present in the molecule. If True, the descriptor is an integer vector that counts the number of occurrences of the i<sup>th</sup> shingle in the molecule (True)."external_dict"dictionary or path to a json dictionary that maps SMILES of shingles with a unique identifier that will be used as index in the output representation.
-
Parameters specific to SOAP (see DScribe documentation)
"rcut": cutoff for local environments (6.0 Å)"nmax","lmax": resp. the number of radial basis functions (8) and the maximum degree of spherical harmonics (6)."average": whether to average all local environments ("inner", "outer") or to consider the environments independently ("off").
-
Parameters specific to the Gaussian random vector
"mu": mean of the Gaussian distribution (0)"sigma": standard deviation of the Gaussian distribution (1)
Merit optimization parameters
The "merit_optim_parameters" attribute is used to describe the merit function and the parameters of its
evolutionary optimization. It can be set with a dictionary containing the following entries.
"merit_type": merit function. It can be either the expected improvement of the surrogate function ("EI"), the probability of improvement ("POI") or the surrogate function directly ("surrogate")."merit_xi": value of the ξ parameter of the expected improvement or probability of improvement (0.01). This parameter is only interpreted if"merit_type"is set to "EI" or "POI"."evomol_parameters": dictionary describing the parameters for the evolutionary optimization of the merit function, using the EvoMol algorithm. See the relevant section in EvoMol documentation. The"action_space_parameters"and"optimization_parameters"attributes can be set here. They respectively define the number of optimization steps at each merit optimization phase, and the chemical space of the solutions that will be generated. The other attributes are set automatically by BBOMol. It is also possible to set the"io_parameters"attribute for specific purposes, but some attributes may be overwritten. Default value :{ "optimization_parameters": { "max_steps": 10, }, "action_space_parameters": { "max_heavy_atoms": 9, "atoms": "C,N,O,F" } }"init_pop_size": number of solutions that are drawn from the dataset of known solutions to be inserted in the initial population of the evolutionary algorithm, at each optimization phase and for each evolutionary optimization instance (10)."init_pop_strategy": strategy to select the solutions from the dataset of known solutions to be inserted in the initial population of the evolutionary optimization instances. Available strategies :- "methane" : always starting the evolutionary optimization from the methane molecule.
- "best" : selecting the
"init_pop_size"best solutions according to the objective function. - "random" : selecting randomly
"init_pop_size"solutions with uniform probability. - "random_weighted" selecting randomly
"init_pop_size"solutions with a probability that is proportional to their objective function value.
"n_merit_optim_restarts": number of merit function evolutionary optimization instances at each merit optimization phase (10)."n_best_retrieved": number of (best) solutions that are retrieved from each evolutionary optimization restart to be evaluated using the objective function and inserted in the dataset of known solutions (1).
Black-box optimization parameters
The "bbo_optim_parameters" attribute is used to define the parameters of the black-box
optimization. It can be set with a dictionary containing the following entries.
"max_obj_calls": number of calls to the objective function before stopping the algorithm (1000)."score_assigned_to_failed_solutions": the default behaviour is to ignore the solutions that fail either the descriptors computation or the evaluation
