NablaDFT
nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset
Install / Use
/learn @AIRI-Institute/NablaDFTREADME
$\nabla^2$ DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
<p align="left"> <a href="https://developer.nvidia.com/cuda-downloads"><img alt="CUDA versions" src="https://img.shields.io/badge/cuda-11.8~12.1-green"></a> <a href="https://github.com/AIRI-Institute/nablaDFT/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-blue"></a> <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" style="max-width:100%;"></a> </p>This is the repository for nablaDFT Dataset and Benchmark. The current version is 2.0. The code and data from the initial publication are accessible here: 1.0 branch. <br/> Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$ DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level (ωB97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$ DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.<br/> More details can be found in the version 1 paper and version 2 paper.
If you are using nablaDFT in your research paper, please cite us as
@article{khrabrov2024nabla2dftuniversalquantumchemistry,
title={$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials},
author={Kuzma Khrabrov and Anton Ber and Artem Tsypin and Konstantin Ushenin and Egor Rumiantsev and Alexander Telepov and Dmitry Protasov and Ilya Shenbin and Anton Alekseev and Mikhail Shirokikh and Sergey Nikolenko and Elena Tutubalina and Artur Kadurin},
year={2024},
eprint={2406.14347},
archivePrefix={arXiv},
primaryClass={physics.chem-ph},
url={https://arxiv.org/abs/2406.14347},
}
@article{10.1039/D2CP03966D,
author ="Khrabrov, Kuzma and Shenbin, Ilya and Ryabov, Alexander and Tsypin, Artem and Telepov, Alexander and Alekseev, Anton and Grishin, Alexander and Strashnov, Pavel and Zhilyaev, Petr and Nikolenko, Sergey and Kadurin, Artur",
title ="nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset",
journal ="Phys. Chem. Chem. Phys.",
year ="2022",
volume ="24",
issue ="42",
pages ="25853-25863",
publisher ="The Royal Society of Chemistry",
doi ="10.1039/D2CP03966D",
url ="http://dx.doi.org/10.1039/D2CP03966D"}

Installation
git clone https://github.com/AIRI-Institute/nablaDFT && cd nablaDFT/
pip install .
Dataset
We propose a benchmarking dataset based on a subset of Molecular Sets (MOSES) dataset. Resulting dataset contains 1 936 931 molecules with atoms C, N, S, O, F, Cl, Br, H. It contains 226 424 unique Bemis-Murcko scaffolds and 34 572 unique BRICS fragments.<br/> For each molecule in the dataset we provide from 1 to 62 unique conformations, with 12 676 264 total conformations. For each conformation, we have calculated its electronic properties including the energy (E), DFT Hamiltonian matrix (H), and DFT overlap matrix (S). All properties were calculated using the Kohn-Sham method at ωB97X-D/def2-SVP levels of theory using the quantum-chemical software package Psi4, version 1.5. <br/> We provide several splits of the dataset that can serve as the basis for comparison across different models.<br/> As part of the benchmark, we provide separate databases for each subset and task and a complete archive with wave function files produced by the Psi4 package that contains quantum chemical properties of the corresponding molecule and can be used in further computations.
Downloading dataset
Hamiltonian databases
Links to hamiltonian databases including different train and test subsets are in file Hamiltonian databases<br/>
Energy databases
Links to energy databases including different train and test subsets are in file Energy databases
Raw psi4 wave functions
Csv file with links to 7z archives: wfns.csv.gz.
Each archive consists of npy files within npys/ directory named in the following manner: {moses_id}_{conformation_id}.npy
Summary file
The csv file with conformations index, SMILES, atomic DFT properties : summary.csv.gz
The csv file with conformations index, energies and forces for optimization trajectories: trajectories_summary.csv
Conformations files
Tar archive with xyz files archive
Accessing elements of the dataset
Hamiltonian database
Downloading of the smallest file (train-tiny data split, 14 Gb):
wget https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/nablaDFTv2/hamiltonian_databases/train_2k.db
Minimal usage example:
from nablaDFT.dataset import HamiltonianDatabase
train = HamiltonianDatabase("train_2k.db")
# atoms numbers, atoms positions, energy, forces, core hamiltonian, overlap matrix, coefficients matrix,
# moses_id, conformation_id
Z, R, E, F, H, S, C, moses_id, conformation_id = train[0]
Energies database
Downloading of the smallest file (train-tiny data split, 51 Mb):
wget https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/nablaDFTv2/energy_databases/train_2k_v2_formation_energy_w_forces.db
Minimal usage example:
from ase.db import connect
train = connect("train_2k_v2_formation_energy_w_forces.db")
atoms_data = train.get(1)
Working with raw psi4 wavefunctions
wget https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/moses_wfns_compressed/wfns_moses_conformers_archive_0.7z
7z x wfns_moses_conformers_archive_0.7z
cd npys/
A variety of properties can be loaded directly from the wavefunction files. See main paper for more details. Properties include DFT matrices:
import numpy as np
wfn = np.load('50000_0.npy', allow_pickle=True).tolist()
orbital_matrix_a = wfn["matrix"]["Ca"] # alpha orbital coefficients
orbital_matrix_b = wfn["matrix"]["Cb"] # beta orbital coefficients
density_matrix_a = wfn["matrix"]["Da"] # alpha electonic density
density_matrix_b = wfn["matrix"]["Db"] # beta electonic density
aotoso_matrix = wfn["matrix"]["aotoso"] # atomic orbital to symmetry orbital transformation matrix
core_hamiltonian_matrix = wfn["matrix"]["H"] # core Hamiltonian matrix
fock_matrix_a = wfn["matrix"]["Fa"] # DFT alpha Fock matrix
fock_matrix_b = wfn["matrix"]["Fb"] # DFT betta Fock matrix
and bond orders for covalent and non-covalent interactions and atomic charges:
import psi4
wfn = psi4.core.Wavefunction.from_file('50000_0.npy')
psi4.oeprop(wfn, "MAYER_INDICES")
psi4.oeprop(wfn, "WIBERG_LOWDIN_INDICES")
psi4.oeprop(wfn, "MULLIKEN_CHARGES")
psi4.oeprop(wfn, "LOWDIN_CHARGES")
meyer_bos = wfn.array_variables()["MAYER INDICES"] # Mayer bond indices
lodwin_bos = wfn.array_variables()["WIBERG LOWDIN INDICES"] # Wiberg bond indices
mulliken_charges = wfn.array_variables()["MULLIKEN CHARGES"] # Mulliken atomic charges
lowdin_charges = wfn.array_variables()["LOWDIN CHARGES"] # Löwdin atomic charges
Models
- Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions (SchNOrb)
- SE(3)-equivariant prediction of molecular wavefunctions and electronic densities (PhiSNet)
- A continuous-filter convolutional neural network for modeling quantum interactions (SchNet)
- Equivariant message passing for the prediction of tensorial properties and molecular spectra (PaiNN)
- Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules (DimeNet++)
- [EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations (EquiformerV2)](./nablaDFT/equiformer_v2/REA
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
