SkillAgentSearch skills...

SequentialLearningApp

Sequential Learning App for Materials Discovery (SLAMD)

Install / Use

/learn @BAMresearch/SequentialLearningApp
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SLAMD - Benchmarking App

  1. General Information
  2. Sequential Learning
  3. Installation
  4. Starting the App
  5. Known Issues
  6. Quick Guide
  7. Case-Studies

General Information

This repository supplements our app "SLAMD - Sequential Learning App for Materials Discovery". If you are interested in checking it out visit https://github.com/BAMresearch/WEBSLAMD and https://slamd-demo.herokuapp.com/

Here we focus on evaluating Sequential Learning on various datasets from the literature. This is done using the notebook SequentialLearningApp.ipynb located inside the benchmarking folder. The main purpose is running experiments against known datasets and comparing results for various choices of AI algorithms and Sequential Learning parameters. The results are compared to baseline random draws. Further, in order to check the baseline performance of the various ML models, we evaluate R2 scores on all datasets in a dedicated notebook, baseline_performance.ipynb.

This app is based on Jupyter Notebooks in combination with the UI framework Voilà. The app runs as a webapp directly in the browser (details concerning the setup can be found below).

Sequential Learning

Sequential Learning (SL) is frequently recognized as having great potential to accelerate materials research with a small number of highly complex data points. SL ranks the experiments based on their utility. This is done by coupling the predictions of a Machine Learning model with a decision rule that guides the experimental procedure. The underlying idea is that not all experiments are equally useful. Some experiments provide more information than others. In contrast to classical design of experiments (DOE), where (only) the experimental parameters are optimized, the potential outcomes of the experiments themselves are the decisive factor. The most promising experiments are preferred over dead-end experiments and experiments whose outcome is already known. Each new experiment is selected to maximize the amount of useful information, using previous experiments as a guide for the next experiment.

Installation

Use the package manager pip to install the requirements.txt file (admin rights may be required for pip install). It is recommended but not mandatory to create a virtual environment with venv or conda.

Below we show how to install the dependencies using a virtual environment. If you decide not to use a virtual environment (or conda) simply skip the first two bullet points.

Navigate to the app directory and execute the following commands in your terminal:

  • Create a new venv running python -m venv <name_of_virtualenv> (in case of errors try to use python3 instead of python in the command)
  • Enter venv by executing .\<name_of_virtualenv>\Scripts\activate (Windows) or source /<name_of_virtualenv>/bin/activate (Unix and Mac)
  • Install all dependencies via pip install -r requirements.txt (in case of errors try to use python3 instead of python in the command)

This app includes the the lolopy Random Forest algorithm with uncertainties from https://github.com/CitrineInformatics/lolo. In order to use this method it is required to install Java SE, e.g. from https://www.oracle.com/de/java/technologies/downloads/#java17.

Note: We tested this setup for the following environments:

  • Windows host machine with Python version 3.10.5 and pip version 21.3.1.
  • Windows host machine with Python version 3.11.1 and pip version 22.3.1.

Troubleshooting

Depending on your Python version, operating system and/or choice of using virtual environment, conda or working globally, there might be errors upon installation. In this case, check the error messages and try to either update the library versions defined in requirements.txt or try another version of Python and/or pip.

Enabling UI elements

After installing the requirements you need to enable the UI elements:

jupyter nbextension enable --py widgetsnbextension

Starting the App

Once all packages were installed successfully, start the app by running

voila SequentialLearningApp.ipynb

relative to the benchmarking directory. A window in your default browser should open now. You can now use the app.

Known Issues

Comma errors may occur when uploading Excel data. It is recommended to use the CSV file format.

If the number of targets is changed during benchmarking, the result plot may not appear. The results will still be saved to the results table, so there will be no loss of data.

There is still a bug when running Sequential Learning with targets as well as a-priori information with the weight of the a-priori feature different from 1. For our benchmarking experiments we were looking at the case were all the weights are set to 1, so the corresponding results for these experiments are still reliable. Note however, that in the "Materials Discovery" part of our main app the issue is already fixed. Thus in case you want to extend this code and fix the bug, you might find our implementation in WEBSLAMD useful.

Quick Guide

In this quick guide the functions of the app are described in detail. The app is divided into the four main windows Upload, Data Info, Design Space Explorer, and Benchmarking, which are explained below.

Upload

In the upload window, the material data can be imported in CSV or Excel format via a dialog. Benchmarking data must be complete, i.e. for each material composition there must also be (at least) one experimental result.

In the upload dialog it is possible to set the CSV separator and the decimal separator and to delete non-numeric data. In addition, lines at the beginning of the file can be skipped (e.g. header data, etc.). At the end of this process, the data is displayed to the user for plausibility checking. Here it can be quickly and easily checked whether the decimal places are correctly specified and all data is numerical.

img.png

Data Info

This window gives a detailed overview of the uploaded data. Besides the data preview, there is a detailed list of all variables ('Info' button) and some basic statistical characteristics of the variables ('Stats' button).

img.png

Design Space Explorer

The Design Space Explorer allows the visualization of complex relationships in the data. Here, specific dependencies between selected variables can be displayed as a scatter plot, the interrelationships and distributions of the variables can be mapped as a scatter matrix, and correlations can be visualized as a correlation heatmap. These tools allow a quick visual overview, e.g. of collinearities of the characteristics for feature selection or trade-offs between different material properties, which are to be optimized.

img.png

Benchmarking

This window provides the core SL framework of our app. This allows you to assess the potential benefits of SLAMD for your application. It is divided into the tabs "Settings" - here the optimization scenario can be defined - and "Sequential Learning parameters" - here the algorithms can be selected, set and virtual experiments can be performed.

Configure Optimization

This window lets the user interactively set up the boundary conditions of the SL problem. The Materials Data (input feature) and target properties can be selected simply by mouse click. It is possible to select multiple target properties (Multi-Objective Optimization). The optimization is then based on the sum (or difference - depending on whether maximization or minimization is desired) of the normalized properties.

SLAMD can also consider a-priori information. Not all targets need verification in the lab. Costs and CO2 footprint, for instance, can be collected from databases upfront. However, they can play an important role in optimization, especially in the case of green materials. The feature "A-priori Information" allows to include such data into the optimization. Their uncertainty is considered to be zero in the MLI utility.

The target can be specified as a quantile of the given properties (or their combinations in case of Multi-Objective Optimization). A lower target threshold (e.g. 90%) accelerates the SL optimization. However, this makes it increasingly difficult for SL to outperform a random process. The target threshold also offers the possibility to define a default value as the optimization limit (to activate it, the check box must be checked).

The "Show Input Data" button generates a target data table showing the data selected as the target for optimization. This makes it easy to check the plausibility of the above configuration.

img.png

Sequential Learning Parameters

The initial sample size and the batch size can be chosen here. Some SL algorithms require at least 3 samples. It is recommended to not choose less than 4 initial samples.

This tab lets the user select from various fast and powerful algorithms and two utility functions. The algorithms are based on

  • Lolopy Random Forest Regression - requires installation of JAVA SDK
  • Gaussian Process Regression

For each, three variations are provided with the first basically using standard parameters without any further tuning, the second one applying a dimensional reduction based on PCA before performing the supervised task, and the third combining forward feature selection and a rather simple variant of grid search bef

Related Skills

View on GitHub
GitHub Stars9
CategoryEducation
Updated1y ago
Forks6

Languages

Jupyter Notebook

Security Score

75/100

Audited on Sep 9, 2024

No findings