SkyComputing
Sky Computing: Accelerating Geo-distributed Computing in Federated Learning
Install / Use
/learn @hpcaitech/SkyComputingREADME
Sky Computing
Introduction
Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to devices based on the their hardware sepcification. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT in a 64-node cluster. Our paper can be found at https://arxiv.org/abs/2202.11836
The concept sky computing was first introduced by Dr. Katarzyna Keahey et al. They used this word to describe a cross-cloud compute pattern. And later Prof. Stoica and Prof. Shenker generalized this word to geo-distributed computing. Our project is based on their definition. [1] [2]
Installation
git clone git@github.com:hpcaitech/SkyComputing.git
python -m pip install -r requirements.txt
cd ./scaelum
python -m pip install -v -e .
Experiment (using BERT)
To benchmark the Sky Computing, we prepared a single demo which you can run on your cluster to train BERT.
Prepare BERT model
Bidirectional Encoder Representations from Transformers (aka BERT) is one of the state-of-the-art deep learning models for Natural Language Processing. In the experiment part, we use BERT to run a simple benchmark.
cd $PROJECT
mkdir -p BERT/model && cd BERT/model
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip
Prepare GLUE MNLI dataset
The General Language Understanding Evaluation (aka GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. And the Multi-Genre Natural Language Inference (aka MNLI) is one of the tasks in GLUE, it is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.
cd $PROJECT
mkdir -p BERT/data && cd BERT/data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/1502038877f6a88c225a34450793fbc3ea87eaba/download_glue_data.py
python download_glue_data.py --data_dir ./glue_data --tasks MNLI
Configuration
To run dllb in your cluster, you need to write a config file which contains the necessary information about training, e.g. model layers, useful environment variables. We have provided a well-commentted example, and here are some most important option:
# your project path
PROJECT = os.getenv("PROJECT")
# allocation type, valid values are even, optimal and dynamic
ALLOCATE_TYPE = "even"
# num of node (including the central server)
CORE_NUM = 4
Run scripts
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We used slurm script to run our experiment.
#!/bin/sh
#SBATCH --job-name=gpu16 # Job name
#SBATCH -o gpu16.o%j # Name of stdout output file
#SBATCH -e gpu16.e%j # Name of stderr error file
#SBATCH -N 16 # Node numbers
#SBATCH -n 16 # GPU numbers
#SBATCH --time=02:00:00 # Run time (hh:mm:ss)
# run
python ./ip_addr.py > "./HOST"
srun python ./launch.py -c "./experiment/config.py"
Citation
@misc{zhu2022sky,
title={Sky Computing: Accelerating Geo-distributed Computing in Federated Learning},
author={Jie Zhu and Shenggui Li and Yang You},
year={2022},
eprint={2202.11836},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Reference
@article{keahey2009sky,
title={Sky computing},
author={Keahey, Katarzyna and Tsugawa, Mauricio and Matsunaga, Andrea and Fortes, Jose},
journal={IEEE Internet Computing},
volume={13},
number={5},
pages={43--51},
year={2009},
publisher={IEEE}
}
@inproceedings{stoica2021cloud,
title={From cloud computing to sky computing},
author={Stoica, Ion and Shenker, Scott},
booktitle={Proceedings of the Workshop on Hot Topics in Operating Systems},
pages={26--32},
year={2021}
}
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
flutter-tutor
Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
16.9kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
