SymbolicMathematics
Deep Learning for Symbolic Mathematics
Install / Use
/learn @facebookresearch/SymbolicMathematicsREADME
Deep Learning for Symbolic Mathematics
PyTorch original implementation of Deep Learning for Symbolic Mathematics (ICLR 2020).
This repository contains code for:
- Data generation
- Functions F with their derivatives f
- Functions f with their primitives F
- Forward (FWD)
- Backward (BWD)
- Integration by parts (IBP)
- Ordinary differential equations with their solutions
- First order (ODE1)
- Second order (ODE2)
- Training
- Half-precision (float16)
- Multi-GPU
- Multi-node
- Evaluation:
- Greedy decoding
- Beam search evaluation
We also provide:
- Datasets
- Train / Valid / Test sets for all tasks considered in the paper
- Trained models
- Models trained with different configurations of training data
- Notebook
- An ipython notebook with an interactive demo of the model on function integration
Dependencies
Datasets and Trained Models
We provide datasets for each task considered in the paper:
| Dataset | #train | Link | | ------------------------------|:----------:|:-------------------------------------------------------------------------------:| | Integration (FWD) | 45M | Link | | Integration (BWD) | 88M | Link | | Integration (IBP) | 23M | Link | | Differential equations (ODE1) | 65M | Link | | Differential equations (ODE2) | 32M | Link |
We also provide models trained on the above datasets, for integration:
| Model training data | Accuracy (FWD) | Accuracy (BWD) | Accuracy (IBP) | Link | | --------------------|:--------------:|:--------------:|:--------------:|:---------------------------------------------------------------------------------:| | FWD | 97.2% | 16.1% | 89.2% | Link | | BWD | 31.6% | 99.6% | 60.0% | Link | | IBP | 55.3% | 85.5% | 99.3% | Link | | FWD + BWD | 96.8% | 99.6% | 86.1% | Link | | BWD + IBP | 56.7% | 99.5% | 98.7% | Link | | FWD + BWD + IBP | 95.6% | 99.5% | 99.6% | Link |
and for differential equations:
| Model training data | Accuracy (ODE1) | Accuracy (ODE2) | Link | | --------------------|:---------------:|:---------------:|:--------------------------------------------------------------------------:| | ODE1 | 97.2% | - | Link | | ODE2 | - | 88.2% | Link |
All accuracies above are given using a beam search of size 10. Note that these datasets and models slightly differ from the ones used in the paper.
Data generation
If you want to use your own dataset / generator, it is possible to train a model by generating data on the fly.
However, the generation process can take a while, so we recommend to first generate data, and export it into a dataset that can be used for training. This can easily be done by setting --export_data true:
python main.py --export_data true
## main parameters
--batch_size 32
--cpu true
--exp_name prim_bwd_data
--num_workers 20 # number of processes
--tasks prim_bwd # task (prim_fwd, prim_bwd, prim_ibp, ode1, ode2)
--env_base_seed -1 # generator seed (-1 for random seed)
## generator configuration
--n_variables 1 # number of variables (x, y, z)
--n_coefficients 0 # number of coefficients (a_0, a_1, a_2, ...)
--leaf_probs "0.75,0,0.25,0" # leaf sampling probabilities
--max_ops 15 # maximum number of operators (at generation, but can be much longer after derivation)
--max_int 5 # max value of sampled integers
--positive true # sign of sampled integers
--max_len 512 # maximum length of generated equations
## considered operators, with (unnormalized) sampling probabilities
--operators "add:10,sub:3,mul:10,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:1,acos:1,atan:1,sinh:1,cosh:1,tanh:1,asinh:1,acosh:1,atanh:1"
## other generations parameters can be found in `main.py` and `src/envs/char_sp.py`
Data will be exported in the prefix and infix formats to:
./dumped/prim_bwd_data/EXP_ID/data.prefix./dumped/prim_bwd_data/EXP_ID/data.infix
data.prefix and data.infix are two parallel files containing the same number of lines, with the same equations written in prefix and infix representations respectively. In these files, each line contains an input (e.g. the function to integrate) and the associated output (e.g. an integral) separated by a tab. In practice, the model only operates on prefix data. The infix data is optional, but more human readable, and can be used for debugging purposes.
Note that some generators are very fast, such as prim_bwd, which only requires to generate a random function and to differentiate it. The others are significantly longer. For instance, the validity of differential equations is checked (symbolically and numerically) after generation, which can be expensive. In our case, we generated the data across a large number of CPUs to create a large training set. For reproducibility, we provide our training / validation / test datasets in the links above. Generators can be made faster by decreasing the timeout generation time in char_sp.py, but this may slightly reduce the set of equations that the generator can produce.
If you generate your own dataset, you will notice that the generator generates a lot of duplicates (which is inevitable if you parallelize the generation). In practice, we remove duplicates using:
cat ./dumped/prim_bwd_data/*/data.prefix \
| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}{c[$0]++}END{for (i in c) printf("%i|%s\n",c[i],i)}' \
> data.prefix.counts
The resulting format is the following:
count1|input1_prefix output1_prefix
count2|input2_prefix output2_prefix
...
Where the input and output are separated by a tab, and equations are sorted by counts. This is under this format that data has to be given to the model. The number of counts is not used by the model, but was not removed in case of potential curriculum learning. The last part consists in simply splitting the dataset into training / validation / test sets. This can be done with the split_data.py script:
# create a valid and a test set of 10k equations
python split_data.py data.prefix.counts 10000
# remove valid inputs that are in the train
mv data.prefix.counts.valid data.prefix.counts.valid.old
awk -F"[|\t]" 'NR==FNR { lines[$2]=1; next } !($2 in lines)' <(cat data.prefix.counts.train) data.prefix.counts.valid.old \
> data.prefix.counts.valid
# test test inputs that are in the train
mv data.prefix.counts.test data.prefix.counts.test.old
awk -F"[|\t]" 'NR==FNR { lines[$2]=1; next } !($2 in lines)' <(cat data.prefix.counts.train) data.prefix.counts.test.old \
> data.prefix.counts.test
Training
To train a model, you first need data. You can either generate it using the scripts above, or download the data provided in this repository. For instance:
wget https://dl.fbaipublicfiles.com/SymbolicMathematics/data/prim_fwd.tar.gz
tar -xvf prim_fwd.tar.gz
Once you have a training / validation / test set, you can train using the following command:
python main.py
## main parameters
--exp_name first_train # experiment name
--fp16 true --amp 2 # float16 training
## dataset location
--tasks "prim_fwd" # task
--reload_data "prim_fwd,prim_fwd.train,prim_fwd.valid,prim_fwd.test" # data location
--reload_size 40000000 # training set size
## model parameters
--emb_dim 1024 # model dimension
--n_enc_layers 6 # encoder layers
--n_dec_layers 6 # decoder layers
--n_heads 8 # number of heads
## training parameters
--optimizer "adam,lr=0.0001" # model optimizer
--batch_size 32 # batch size
--epoch_size 300000 # epoch size (number of equations per epoch)
--validation_metrics valid_prim_fwd_acc # validation metric (when to save the model)
Additional training parameters can be found in main.py.
E
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
workshop-rules
Materials used to teach the summer camp <Data Science for Kids>
