Error Feedback Can Accurately Compress Preconditioners

The official project repository for the EFCP paper from DAS Lab @ Institute of Science and Technology Austria.

Installing custom CUDA kernels for M-FAC

Suppose the project is in the home directory ~/EFCP, the CUDA kernels can be installed using the following commands:

$ cd ~/EFCP/cuda/mfac_kernel
$ python setup_cuda.py install

We used M-FAC on RTX-3090 and A6000 GPUs.

Reproducing experiments

We provide a shell script to reproduce all our experiments and we recommend using WandB to track the results.

ImageNet

For this experiment we build on top of the FFCV repository and add a few more features to the parameters (see the custom section).

Dataset generation. The FFCV repository uses the ImageNet dataset to pre-process it in order to obtain the FFCV dataset. Make sure you set the correct paths in ~/EFCP/ffcv-imagenet/write_imagenet.sh before running this script file.

Image scaling. Comment out the section resolution in the yaml config.

Running the experiment. Run the following commands after replacing the parameter values starting with prefix @ with your own values.

$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/ffcv-imagenet
$ bash write_imagenet.sh
$ CUDA_VISIBLE_DEVICES=0 python train_imagenet.py \
    --data.train_dataset @TRAIN_PATH \
    --data.val_dataset @VALIDATION-PATH \
    --logging.folder @LOGGING_FOLDER \
    --wandb.project @WANDB_PROJECT \
    --wandb.group @WANDB_GROUP\
    --wandb.job_type @WANDB_JOB_TYPE \
    --wandb.name @WANDB_NAME \
    --data.num_workers 12 \
    --data.in_memory 1 \
    --config-file rn18_configs/rn18_88_epochs.yaml \
    --training.optimizer kgmfac \
    --training.batch_size 1024 \
    --training.momentum 0 \
    --training.weight_decay 1e-05 \
    --lr.lr 0.001 \
    --lr.lr_schedule_type linear \
    --custom.damp 1e-07 \
    --custom.k 0.01 \
    --custom.seed @SEED \
    --custom.wd_type wd

ASDL

For this experiment we build on top of the ASDL repository. We integrate our M-FAC implementations in the following files:

~/EFCP/asdl/asdl/precondition/mfac.py for Dense M-FAC
~/EFCP/asdl/asdl/precondition/sparse_mfac.py for Sparse M-FAC

Features added. We added the following new parameters to the existing repository:

clip_type - specifies whether clipping should be performed by value or by norm (val, norm)
clip_bound - the value used in clipping. Set it to 0 to disable clipping, regardless of the value of clip_type
ignore_bn_ln_type - used to perform BN/LN ablation. Possible values are none, all, modules

$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/asdl/examples/arxiv_results
$ CUDA_VISIBLE_DEVICES=0 python train.py \
    --wandb_project @WANDB_PROJECT \
    --wandb_group @WANDB_GROUP\
    --wandb_job_type @WANDB_JOB_TYPE \
    --wandb_name @WANDB_NAME \
    --folder @LOGGING_FOLDER \
    --ngrads 1024 \
    --momentum 0 \
    --dataset cifar10 \
    --optim <kgmfac OR lrmfac> \
    --k 0.01 \
    --rank 4 \
    --epochs 20 \
    --batch_size 32 \
    --model rn18 \
    --weight_decay 0.0005 \
    --ignore_bn_ln_type all \
    --lr 0.03 \
    --clip_type norm \
    --clip_bound 10 \
    --damp 1e-05 \
    --seed 1

BERT training

We use the HuggingFace repository stated in the original M-FAC paper and integrate Sparse M-FAC to experiment with Question Answering and Text Classification. The following commands can be used to reproduce our experiments for QA and GLUE using the parameters from Appendix D in our paper.

Instructions for GLUE/MNLI. Run Sparse-MFAC on BERT-Base:

$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/huggingface/examples/MFAC_optim
python run_glue.py \
    --wandb_project @WANDB_PROJECT \
    --wandb_group @WANDB_GROUP\
    --wandb_job_type @WANDB_JOB_TYPE \
    --wandb_name @WANDB_NAME \
    --output_dir @OUTPUT_DIR \
    --seed @SEED \
    --logging_strategy steps \
    --logging_steps 10 \
    --model_name_or_path bert-base \
    --task_name mnli \
    --num_train_epochs 3 \
    --optim <kgmfac OR lrmfac> \
    --k 0.01 \
    --rank 4 \
    --lr 2e-5 \
    --damp 5e-5 \
    --ngrads 1024

All available arguments are available in the following classes:

ModelArguments
DataTrainingArguments
TrainingArguments
CustomArgs: stores our arguments for M-FAC optimizers and we are using lr from here instead of learning_rate from DataTrainingArguments

Other useful parameters for the run_glue.py script:

--do_train 
--do_eval 
--do_predict 
--max_seq_length 128 
--per_device_train_batch_size 32 
--overwrite_output_dir 
--save_strategy epoch # instead of logging_strategy and logging_steps that we used
--save_total_limit 1

Instructions for QA/SquadV2. Run Sparse-MFAC on BERT-Base:

$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/huggingface/examples/MFAC_optim
python run_qa.py \
    --wandb_project @WANDB_PROJECT \
    --wandb_group @WANDB_GROUP\
    --wandb_job_type @WANDB_JOB_TYPE \
    --wandb_name @WANDB_NAME \
    --output_dir @OUTPUT_DIR \
    --seed @SEED \
    --logging_strategy steps \
    --logging_steps 10 \
    --model_name_or_path bert-base \
    --num_train_epochs 2 \
    --optim <kgmfac OR lrmfac> \
    --k 0.01 \
    --rank 4 \
    --ngrads 1024 \
    --lr 3e-5 \
    --damp 5e-5

Own training pipeline

We use our own training pipeline to train a small ResNet-20 on CIFAR-10 and for our linear probing experiment that uses Logistic Regression on a synthetic dataset. The notations for the hyper-parameters are introduced in the first paragraph of the Appendix.

CIFAR-10 / ResNet-20 (272k params). For these particular experiments, check the parameters in the Appendix C of the paper and match them with the ones in ~/EFCP/args/args_mfac.py

Follow these short instructions to run Top-K or Low-Rank strategies:

S-MFAC (Top-k compression):
- use --optim kgmfac & --k 0.01 (the parameter --rank will be ignored)
LR-MFAC (Low-Rank compression):
- use --optim lrmfac & --rank 1 -- (the parameter --k will be ignored)

$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP
python main.py \
    --wandb_project @WANDB_PROJECT \
    --wandb_group @WANDB_GROUP\
    --wandb_job_type @WANDB_JOB_TYPE \
    --wandb_name @WANDB_NAME \
    --seed @SEED \
    --root_folder @EXPERIMENT_FOLDER \
    --dataset_path @PATH_TO_DATASET \
    --dataset_name cifar10 \
    --model rn20 \
    --epochs 164 \
    --batch_size 128 \
    --lr_sched step \
    --optim <kgmfac OR lrmfac> \
    --k 0.01 \
    --rank 4 \
    --ngrads 1024 \
    --lr 1e-3 \
    --damp 1e-4 \
    --weight_decay 1e-4 \
    --momentum 0 \
    --wd_type wd

Logistic Regression / Synthetic Data. For this experiment we use the same script main.py using the hyper-parameters from the Appendix A in our paper. The dataset we used is publicly available here. Below we only present the script to run Sparse GGT. In order to run other optimizers, please have a look at the method get_optimizer from helpers/training.py file and at the method get_arg_parse from args/args_mfac.py that stores command line arguments.

$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ CUDA_VISIBLE_DEVICES=0 python main.py \
    --wandb_project @WANDB_PROJECT \
    --wandb_group @WANDB_GROUP\
    --wandb_job_type @WANDB_JOB_TYPE \
    --wandb_name @WANDB_NAME \
    --seed @SEED \
    --root_folder @EXPERIMENT_FOLDER \
    --dataset_path @PATH_TO_RN50x16-openai-imagenet1k \
    --dataset_name rn50x16openai \
    --model logreg \
    --epochs 10 \
    --batch_size 128 \
    --lr_sched cos \
    --optim ksggt \
    --k 0.01 \
    --ngrads 100 \
    --lr 1 \
    --weight_decay 0 \
    --ggt_beta1 0 \
    --ggt_beta2 1 \
    --ggt_eps 1e-05

Quantify Preconditioning

We describe the preconditioning quantification in Section 6 of our paper. We use quantify_preconditioning method to compute the metrics for scaling and rotation, which requires the raw gradient g and the preconditioned gradient u. We would like to mention that calling this method at each time step for large models (such as BERT-Base) slows down training by a lot because the operations are performed using large tensors. Moreover, the quantiles are computed in numpy because pytorch raises an error when calling quantile function for large tensors.

EFCP

Install / Use

README