GAIA

GAIA is a framework to generate datasets with an automated pipeline for machine learning interatomic potentials.

GAIA

Prerequisites

Quantum mechanics (QM) package

A QM package is required to run GAIA. Currently it is designed to use VASP, but will support more packages later.

Distributed environment with shared storage

GAIA has been implemented under the assumption of distributed environment. Also, shared storage is required for each node to access the same directory with an identical path.

Job scheduler

GAIA is currently designed to use SLURM as the job scheduler, but with minor code modifications, one can easily adapt it to other schedulers or execute it on a single node.

Dependencies

We provide requirements.txt that allows users to fully reproduce the environment used for the GAIA implementation. GAIA also requires the following binaries: CREST, nebmake.pl, Open Babel, xTB, xTB-IFF

Usage

Config file

user_config provides an example YAML file with user-defined settings for the data-generator, data-improver, and GAIA-Bench.
base_config serves as a skeleton configuration. It includes default values for advanced parameters, while user-defined parameters override those in the base config.

Data-generator

Input preparation

Chemical components GAIA supports both periodic (e.g., metals) and non-periodic (e.g., molecules with organic species) components. Each should follow the format of .POSCAR and .xyz, respectively.

Run

$ cd GAIA
$ python main.py -a data_generator -c {user_config (.yaml)} -o {out_dir} -p {prefix}

If out_dir is /home/GAIA_out and prefix is first, artifacts and the log is saved in /home/GAIA_out/first/

Data-improver

Input preparation

Trainset, validset and model checkpoint Data improver provides recommendations based on error metrics on validset, as well as trainset itself, which requires a valid dataset (.extxyz) and a trained model checkpoint (e.g. .pt or .pth), in addition to a train dataset. The MLIP framework with calculator for the checkpoint should be also set up.

Run

$ cd GAIA
$ python main.py -a data_improver -c {user_config (.yaml)} -o {out_dir} -p {prefix}

GAIA-Bench

Input preparation

GAIA-Bench datasets and model checkpoint GAIA-Bench includes four benchmark tasks, of which the datasets are available at GAIA-Bench A model checkpoint to test is required; the MLIP framework with calculator for the checkpoint should be also set up.

Run

$ cd GAIA
$ python main.py -a benchmark -c {user_config (.yaml)} -o {out_dir} -p {prefix}

Dataset and model checkpoint

Titan25 is an MLIP dataset constructed with GAIA, comprising 1.8M data points across 11 elements. SNet-T25 is an MLIP trained on this dataset. See GAIA paper for details.

Citation

If using this code, please cite our work as follows:

@article{gaia2025,
  title={Scalable Reactive Atomistic Dynamics with GAIA},
  author={Song, Suhwan and Kim, Heejae and Jang, Jaehee and Cho, Hyuntae and Kim, Gunhee and Kim, Geonu},
  journal={arXiv preprint arXiv:2509.25798},
  year={2025}
}

GAIA

Install / Use

README

GAIA

Prerequisites

Quantum mechanics (QM) package

Distributed environment with shared storage

Job scheduler

Dependencies

Usage

Config file

Data-generator

Input preparation

Run

Data-improver

Input preparation

Run

GAIA-Bench

Input preparation

Run

Dataset and model checkpoint

Citation