EFCP
The repository contains code to reproduce the experiments from our paper Error Feedback Can Accurately Compress Preconditioners available below:
Install / Use
/learn @IST-DASLab/EFCPREADME
Error Feedback Can Accurately Compress Preconditioners
The official project repository for the EFCP paper from DAS Lab @ Institute of Science and Technology Austria.
Installing custom CUDA kernels for M-FAC
Suppose the project is in the home directory ~/EFCP, the CUDA kernels can be installed using the following commands:
$ cd ~/EFCP/cuda/mfac_kernel
$ python setup_cuda.py install
We used M-FAC on RTX-3090 and A6000 GPUs.
Reproducing experiments
We provide a shell script to reproduce all our experiments and we recommend using WandB to track the results.
ImageNet
For this experiment we build on top of the FFCV repository and add a few more features to the parameters (see the custom section).
<strong>Dataset generation</strong>. The FFCV repository uses the ImageNet dataset to pre-process it in order to obtain the FFCV dataset. Make sure you set the correct paths in ~/EFCP/ffcv-imagenet/write_imagenet.sh before running this script file.
<strong>Image scaling.</strong> Comment out the section resolution in the yaml config.
<strong>Running the experiment.</strong> Run the following commands after replacing the parameter values starting with prefix @ with your own values.
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/ffcv-imagenet
$ bash write_imagenet.sh
$ CUDA_VISIBLE_DEVICES=0 python train_imagenet.py \
--data.train_dataset @TRAIN_PATH \
--data.val_dataset @VALIDATION-PATH \
--logging.folder @LOGGING_FOLDER \
--wandb.project @WANDB_PROJECT \
--wandb.group @WANDB_GROUP\
--wandb.job_type @WANDB_JOB_TYPE \
--wandb.name @WANDB_NAME \
--data.num_workers 12 \
--data.in_memory 1 \
--config-file rn18_configs/rn18_88_epochs.yaml \
--training.optimizer kgmfac \
--training.batch_size 1024 \
--training.momentum 0 \
--training.weight_decay 1e-05 \
--lr.lr 0.001 \
--lr.lr_schedule_type linear \
--custom.damp 1e-07 \
--custom.k 0.01 \
--custom.seed @SEED \
--custom.wd_type wd
ASDL
For this experiment we build on top of the ASDL repository. We integrate our M-FAC implementations in the following files:
~/EFCP/asdl/asdl/precondition/mfac.pyfor Dense M-FAC~/EFCP/asdl/asdl/precondition/sparse_mfac.pyfor Sparse M-FAC
<strong>Features added.</strong> We added the following new parameters to the existing repository:
clip_type- specifies whether clipping should be performed by value or by norm (val,norm)clip_bound- the value used in clipping. Set it to0to disable clipping, regardless of the value ofclip_typeignore_bn_ln_type- used to perform BN/LN ablation. Possible values arenone,all,modules
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/asdl/examples/arxiv_results
$ CUDA_VISIBLE_DEVICES=0 python train.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--folder @LOGGING_FOLDER \
--ngrads 1024 \
--momentum 0 \
--dataset cifar10 \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--epochs 20 \
--batch_size 32 \
--model rn18 \
--weight_decay 0.0005 \
--ignore_bn_ln_type all \
--lr 0.03 \
--clip_type norm \
--clip_bound 10 \
--damp 1e-05 \
--seed 1
BERT training
We use the HuggingFace repository stated in the original M-FAC paper and integrate Sparse M-FAC to experiment with Question Answering and Text Classification. The following commands can be used to reproduce our experiments for QA and GLUE using the parameters from <strong>Appendix D</strong> in our paper.
<strong>Instructions for GLUE/MNLI.</strong> Run Sparse-MFAC on BERT-Base:
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/huggingface/examples/MFAC_optim
python run_glue.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--output_dir @OUTPUT_DIR \
--seed @SEED \
--logging_strategy steps \
--logging_steps 10 \
--model_name_or_path bert-base \
--task_name mnli \
--num_train_epochs 3 \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--lr 2e-5 \
--damp 5e-5 \
--ngrads 1024
All available arguments are available in the following classes:
- ModelArguments
- DataTrainingArguments
- TrainingArguments
- CustomArgs: stores our arguments for M-FAC optimizers and we are using
lrfrom here instead oflearning_ratefromDataTrainingArguments
Other useful parameters for the run_glue.py script:
--do_train
--do_eval
--do_predict
--max_seq_length 128
--per_device_train_batch_size 32
--overwrite_output_dir
--save_strategy epoch # instead of logging_strategy and logging_steps that we used
--save_total_limit 1
<strong>Instructions for QA/SquadV2.</strong> Run Sparse-MFAC on BERT-Base:
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/huggingface/examples/MFAC_optim
python run_qa.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--output_dir @OUTPUT_DIR \
--seed @SEED \
--logging_strategy steps \
--logging_steps 10 \
--model_name_or_path bert-base \
--num_train_epochs 2 \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--ngrads 1024 \
--lr 3e-5 \
--damp 5e-5
Own training pipeline
We use our own training pipeline to train a small ResNet-20 on CIFAR-10 and for our linear probing experiment that uses Logistic Regression on a synthetic dataset. The notations for the hyper-parameters are introduced in the first paragraph of the <strong>Appendix</strong>.
<strong>CIFAR-10 / ResNet-20 (272k params).</strong> For these particular experiments, check the parameters in the <strong>Appendix C</strong> of the paper and match them with the ones in ~/EFCP/args/args_mfac.py
<strong>Follow these short instructions to run Top-K or Low-Rank strategies:</strong>
- <strong>S-MFAC</strong> (Top-k compression):
- use
--optim kgmfac&--k 0.01(the parameter--rankwill be ignored)
- use
- <strong>LR-MFAC</strong> (Low-Rank compression):
- use
--optim lrmfac&--rank 1-- (the parameter--kwill be ignored)
- use
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP
python main.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--seed @SEED \
--root_folder @EXPERIMENT_FOLDER \
--dataset_path @PATH_TO_DATASET \
--dataset_name cifar10 \
--model rn20 \
--epochs 164 \
--batch_size 128 \
--lr_sched step \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--ngrads 1024 \
--lr 1e-3 \
--damp 1e-4 \
--weight_decay 1e-4 \
--momentum 0 \
--wd_type wd
<strong>Logistic Regression / Synthetic Data.</strong> For this experiment we use the same script main.py using the hyper-parameters from the <strong>Appendix A</strong> in our paper. The dataset we used is publicly available here. Below we only present the script to run Sparse GGT. In order to run other optimizers, please have a look at the method get_optimizer from helpers/training.py file and at the method get_arg_parse from args/args_mfac.py that stores command line arguments.
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ CUDA_VISIBLE_DEVICES=0 python main.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--seed @SEED \
--root_folder @EXPERIMENT_FOLDER \
--dataset_path @PATH_TO_RN50x16-openai-imagenet1k \
--dataset_name rn50x16openai \
--model logreg \
--epochs 10 \
--batch_size 128 \
--lr_sched cos \
--optim ksggt \
--k 0.01 \
--ngrads 100 \
--lr 1 \
--weight_decay 0 \
--ggt_beta1 0 \
--ggt_beta2 1 \
--ggt_eps 1e-05
Quantify Preconditioning
We describe the preconditioning quantification in <strong>Section 6</strong> of our paper. We use quantify_preconditioning method to compute the metrics for scaling and rotation, which requires the raw gradient g and the preconditioned gradient u. We would like to mention that calling this method at each time step for large models (such as BERT-Base) slows down training by a lot because the operations are performed using large tensors. Moreover, the quantiles are computed in numpy because pytorch raises an error when calling quantile function for large tensors.
