DEAL
Data efficient active learning for machine learning potentials
Install / Use
/learn @luigibonati/DEALREADME
DEAL
Data Efficient Active Learning for Machine Learning Potentials
Reference: Perego S. & Bonati L. Data efficient machine learning potentials for modeling catalytic reactivity via active learning and enhanced sampling, npj Computational Materials 10, 291 (2024) doi: 10.1038/s41524-024-01481-6
DEAL selects non-redundant structures from atomistic trajectories via Sparse Gaussian Processes (SGP), to be used to train machine-learning interatomic potentials.
It consists of two steps:
- preselection using the MLP uncertainty (e.g. max uncertainty obtained with query-by-committee)
- select a dataset of non-redundant configurations using the local predictive variance of a Gaussian Process
In addition, step 2. can be also used to perform subsampling of a trajectory even without the uncertainty pre-selection (see examples)
A short practical introduction describes the main ingredients (SGP, local descriptors, uncertainty).
Code highlights
-
Select structures based on SGP predictive variance.
-
Analyze selected structures (e.g. along the trajectory or as a function of a CV)
<img src="imgs/analysis.png" alt="drawing" width="824"/> -
Interactive visualization using chemiscope
<a href="https://chemiscope.org/?load=https://raw.githubusercontent.com/luigibonati/DEAL/refs/heads/main/examples/2_subsampling_formate/selection/deal_0.1_chemiscope.json.gz"> <img src="imgs/chemiscope-viewer.png" alt="drawing" width="412"></a>
Table of contents
📚 Contents
deal/– The core Python package.examples/– Two realistic workflows demonstrating how to use DEAL in practice.npj_supporting_data/– Jupyter notebooks reproducing the full workflow described in the publication, including the use of Gaussian Process Regression for reaction-pathway exploration.tests/– A test suite to verify that the installation is correct and all components work as expected.
🔧 Dependencies
DEAL requires:
python<=3.13flare==1.3.3basepandasnumpychemiscope(for visualization)
🚀 Installation
Below is a complete installation sequence.
Create environment
conda create -n deal python=3.12
conda activate deal
Install FLARE 1.3.3b
(See: https://mir-group.github.io/flare/installation/install.html)
conda install -y gcc gxx cmake openmp liblapacke openblas -c conda-forge
git clone https://github.com/mir-group/flare.git -b 1.3.3b
cd flare
pip install .
cd ..
- Install DEAL
git clone https://github.com/luigibonati/DEAL.git
cd DEAL
pip install .
Usage
DEAL can be run either with a command-line tool (deal) or using the python class (DEAL).
Minimal example
deal --file traj.xyz --threshold 0.1
DEAL will automatically:
- detect atomic species from the first frame
- use default GP/kernel/descriptor parameters
- use default output names
📄 With a YAML config file
For more customization you can create an input.yaml file (below are the default):
deal -c input.yaml
data:
files: ["traj.xyz"] # can be a single file or a list of files
format: "extxyz" # file format (e.g. extxyz, xyz, ...)
index: ":" # frame selection (e.g. ":", "0:100", [0,10,20]) [see ASE notation]
shuffle: false # whether to shuffle the frames before processing (suggested true for MD data)
seed: 42
deal:
threshold: 0.1 #can be a single value or a list of values
update_threshold: 0.08 # if not set it is chosen as 0.8 * threshold
max_atoms_added: 0.2 # limit the number of selected environments added per configuration (can be int (number of atoms) float (0,1) (fraction of total atoms), or -1 (no limit)
initial_atoms: none # specify which atoms to use for GP initialization (list, fraction or number. Default: none, 1 atom per species)
output_prefix: deal # prefix for output files (threshold will be appended as suffix)
force_only: true
train_hyps: false # whether to re-train hyperparameters at each iteration (slower)
verbose: true # allowed values: true/false/"debug" (default: true)
save_gp: false
save_full_trajectory: false # if true, writes <prefix>_trajectory_uncertainty.xyz with per-atom array "atomic_uncertainty"
flare_calc:
gp: SGP_Wrapper # (see flare's documentation)
kernels:
- name: NormalizedDotProduct
sigma: 2
power: 2
descriptors:
- name: B2
nmax: 8
lmax: 3
cutoff_function: cosine
radial_basis: chebyshev
cutoff: 4.5
One can also use a base config file and override via CLI the options:
deal -c input.yaml -t 0.15
Python Usage
# Import
from deal import DataConfig, DEALConfig, FlareConfig, DEAL
# Define Config (uses defaults where not provided)
data_cfg = DataConfig(files="traj.xyz")
deal_cfg = DEALConfig(
threshold=0.1,
output_prefix="deal",
)
flare_cfg = FlareConfig()
# Instantiate DEAL class
deal = DEAL(data_cfg, deal_cfg, flare_cfg)
# Run
deal.run()
Output files
In both cases the following file is generated (with the default output_prefix=deal):
deal_selected.xyz– selected frames
Contains the atomic configurations where the GP uncertainty exceeded the threshold. Includes atoms.info["frame"] indicating the original trajectory index.
If save_gp: true, DEAL also writes:
2. deal_flare.json – final GP model
If save_full_trajectory: true, DEAL also writes:
3. deal_trajectory_uncertainty.xyz – full trajectory with uncertainties
Contains all processed frames with:
- per-atom array
atomic_uncertainty(saved inatoms.arrays) - frame scalar
max_atomic_uncertainty(saved inatoms.info)
Create a chemiscope file
After running deal, one can create a chemiscope visualization file with:
deal-chemiscope --prefix deal
or explicitly with the trajectory file and optionally also a COLVAR file:
deal-chemiscope --trajectory deal_selected.xyz --colvar COLVAR
deal-chemiscope creates a chemiscope visualization file from the selected structures.
The file includes all numeric properties available in atoms.info for each frame, and,
when --colvar is provided, it also includes the properties loaded from the COLVAR file.
This writes deal_chemiscope.json.gz (by default with --prefix deal).
It can be viewed online at https://chemiscope.org/ or inside Python:
import chemiscope
chemiscope.show_input('deal_chemiscope.json.gz')
See also the Chemiscope documentation.
Multiple thresholds
If the CLI receives a list of thresholds, DEAL will run once per threshold.
deal:
threshold:
- 0.10
- 0.15
- 0.20
Equivalent behaviour in Python:
for thr in [0.10, 0.15, 0.20]:
deal_cfg.threshold = thr
deal_cfg.output_prefix = f"run_thr{thr}"
DEAL(data_cfg, deal_cfg, flare_cfg).run()
🎛️ Choice of the parameters
Below a quick guide, see the introduction to DEAL for a more in-depth explanation.
Descriptors
Local environments are characterized via the Atomic Cluster Expansion formalism as implemented in flare. Key hyperparameters: body order (B1/B2), radial degree nmax, angular degree lmax, and cutoff (in Å).
descriptors:
- name: B2
nmax: 8
lmax: 3
cutoff_function: cosine
radial_basis: chebyshev
cutoff: 4.5
Threshold
threshold: 0.1
update_threshold: 0.08 # if not set it is chosen as 0.8 * threshold
max_atoms_added: -1 # no limit on the number of selected environment of a given configuration to the GP.
initial_atoms: 0.15 # use up to 15% of the atoms (of each species) for GP initialization
Thethreshold parameter in the DEAL configuration controls when a local environment is flagged by the SGP’s predictive variance (normalized by the noise hyperparameter). If any environment exceeds the threshold, the GP is updated and that environment (plus any others above update_threshold, up to max_atoms_added) is added.
Some tips:
- The uncertainty values are unitless, and range between 0 and 1. Lower threshold means more selected structures; higher threshold, fewer selections.
- A good starting point is around 0.1. As a rule of thumb, homogeneous, condensed and/or crystalline systems tend to have fewer different local environments and require smaller thresholds (<<0.1), whereas heterogeneous systems may require larger ones (>0.1). This is also connected with the number of species (more species -> higher treshold required).
- Try a few values and compare how many structures are selected; distributions often are very similar across thresholds, what changes is the number of structures. One can decide based on the computational budget (for DFT calculations).
- A practical strategy is to perform incremental selection: start with a high threshold, then decrease it progressively until a target number of structures is reached (see example #4).
- Use chemiscope to inspect selected structures and identify which environments triggered selection.
Note: the training time scales unfavorably with the number of samples. For a large dataset,
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
