GAIA
GAIA automates the generation of reactive MLIP datasets for atomistic simulations.
Install / Use
/learn @samsungDS-PoCs/GAIAREADME
GAIA
GAIA is a framework to generate datasets with an automated pipeline for machine learning interatomic potentials.
Prerequisites
Quantum mechanics (QM) package
A QM package is required to run GAIA. <br>
Currently it is designed to use VASP, but will support more packages later.
Distributed environment with shared storage
GAIA has been implemented under the assumption of distributed environment. <br>
Also, shared storage is required for each node to access the same directory with an identical path.
Job scheduler
GAIA is currently designed to use SLURM as the job scheduler, <br>
but with minor code modifications, one can easily adapt it to other schedulers or execute it on a single node.
Dependencies
We provide requirements.txt that allows users to fully reproduce the environment used for the GAIA implementation. <br>
GAIA also requires the following binaries:
CREST, nebmake.pl, Open Babel, xTB, xTB-IFF
Usage
Config file
- user_config provides an example YAML file with user-defined settings for the data-generator, data-improver, and GAIA-Bench.
- base_config serves as a skeleton configuration. It includes default values for advanced parameters, while user-defined parameters override those in the base config.
Data-generator
Input preparation
- Chemical components <br> GAIA supports both periodic (e.g., metals) and non-periodic (e.g., molecules with organic species) components. <br> Each should follow the format of .POSCAR and .xyz, respectively.
Run
$ cd GAIA
$ python main.py -a data_generator -c {user_config (.yaml)} -o {out_dir} -p {prefix}
- If out_dir is
/home/GAIA_outand prefix isfirst, artifacts and the log is saved in/home/GAIA_out/first/
Data-improver
Input preparation
- Trainset, validset and model checkpoint <br>
Data improver provides recommendations based on error metrics on validset, as well as trainset itself, <br>
which requires a valid dataset (.extxyz) and a trained model checkpoint (e.g. .pt or .pth), in addition to a train dataset. <br>
The MLIP framework with
calculatorfor the checkpoint should be also set up.
Run
$ cd GAIA
$ python main.py -a data_improver -c {user_config (.yaml)} -o {out_dir} -p {prefix}
GAIA-Bench
Input preparation
- GAIA-Bench datasets and model checkpoint <br>
GAIA-Bench includes four benchmark tasks, of which the datasets are available at GAIA-Bench <br>
A model checkpoint to test is required; the MLIP framework with
calculatorfor the checkpoint should be also set up.
Run
$ cd GAIA
$ python main.py -a benchmark -c {user_config (.yaml)} -o {out_dir} -p {prefix}
Dataset and model checkpoint
Titan25 is an MLIP dataset constructed with GAIA, comprising 1.8M data points across 11 elements. SNet-T25 is an MLIP trained on this dataset. See GAIA paper for details.
Citation
If using this code, please cite our work as follows:
@article{gaia2025,
title={Scalable Reactive Atomistic Dynamics with GAIA},
author={Song, Suhwan and Kim, Heejae and Jang, Jaehee and Cho, Hyuntae and Kim, Gunhee and Kim, Geonu},
journal={arXiv preprint arXiv:2509.25798},
year={2025}
}
