MOROCCO
MOdel ResOurCe COnsumption. Evaluate Russian SuperGLUE models performance: inference speed, RAM usage. Reproducible scores using Docker
Install / Use
/learn @RussianNLP/MOROCCOREADME
MOROCCO
MOdel ResOurCe COnsumption. Repository to evaluate Russian SuperGLUE models performance: inference speed, GPU RAM usage. Move from static text submissions with predictions to reproducible Docker-containers.
Each disc corresponds to <a href="jiant">Jiant baseline model</a>, disc size is proportional to GPU RAM usage. By X axis there is model inference speed in records per second, by Y axis model score averaged by 9 Russian SuperGLUE tasks.
- Smaller models have higher inference speed.
rugpt3-smallprocesses ~200 records per second whilerugpt3-large— ~60 records/second. bert-multilingualis a bit slower thanrubert*due to worse Russian tokenizer.bert-multilingualsplits text into more tokens, has to process larger batches.- It is common that larger models show higher score but in our case
rugpt3-medium,rugpt3-largeperform worse than smallerrubert*models. rugpt3-largehas more parameters thanrugpt3-mediumbut is currently trained for less time and has lower score.
Papers
- <a href="https://arxiv.org/abs/2104.14314">MOROCCO: Model Resource Comparison Framework</a>
- <a href="https://arxiv.org/abs/2010.15925">RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark</a>
How to measure model performance using MOROCCO and submit it to Russian SuperGLUE leaderboard?
Build Docker containers for each Russian SuperGLUE task
To benchmark model performance with MOROCCO use Docker, store model weights inside container, provide the following interface:
- Read test data from stdin;
- Write predictions to stdout;
- Handle
--batch-sizeargument. MOROCCO runs container with--batch-size=1to estimate model size in GPU RAM.
docker pull russiannlp/rubert-parus
docker run --gpus all --interactive --rm \
russiannlp/rubert-parus --batch-size 8 \
< TERRa/test.jsonl \
> preds.jsonl
# TERRa/test.jsonl
{"premise": "Гвардейцы подошли к грузовику, ...", "hypothesis": "Гвардейцы подошли к сломанному грузовику.", "idx": 0}
{"premise": "\"К настоящему моменту число ...", "hypothesis": "Березовский открывает аккаунты во всех соцсетях.", "idx": 1}
...
# preds.jsonl
{"idx": 0, "label": "entailment"}
{"idx": 1, "label": "entailment"}
...
Refer to <a href="tfidf/">tfidf/</a> for minimal example and instructions on how to build Docker container. Minimal TF-IDF example runs on CPU, ignores --batch-size argument. Refer to <a href="jiant/">jiant/</a> for example on how to build GPU container.
Build containers for each Russian SuperGLUE task:
docker image ls
russiannlp/rubert-danetqa
russiannlp/rubert-lidirus
russiannlp/rubert-muserc
russiannlp/rubert-parus
russiannlp/rubert-rcb
russiannlp/rubert-rucos
russiannlp/rubert-russe
russiannlp/rubert-rwsd
russiannlp/rubert-terra
russiannlp/rugpt3-large-danetqa
russiannlp/rugpt3-large-lidirus
...
Rent instance at Yandex Cloud
MOROCCO runs all benchmarks on the same hardware. We use <a href="https://cloud.yandex.ru/docs/compute/concepts/gpus">Yandex Cloud gpu-standard-v1 instance</a>:
- NVIDIA® Tesla® V100 GPU with 32 GB GPU RAM
- 8 Intel Broadwell CPUs
- 96 GB RAM
We ask MOROCCO benchmark participants to rent the same instance at Yandex Cloud for their own expense. Current rent price is ~75 rubles/hour.
Create GPU instance using Yandex Cloud CLI:
- By default <a href="https://cloud.yandex.ru/docs/overview/concepts/quotas-limits">quota for number of GPU instances is zero</a>. <a href="https://console.cloud.yandex.ru/support/create-ticket">Create a ticket</a>, ask support to increase your quota to 1.
- Default HDD size is 50 GB, tweak
--create-boot-diskto increase the size. --preemptiblemeans that the instance is force stopped after 24 hours. Data stored on HDD is saved, all data in RAM is lost. Preemptible instance is cheaper, it costs ~75 rubles/hour.
yc resource-manager folder create --name russian-superglue
yc vpc network create --name default --folder-name russian-superglue
yc vpc subnet create \
--name default \
--network-name default \
--range 192.168.0.0/24 \
--zone ru-central1-a \
--folder-name russian-superglue
yc compute instance create \
--name default \
--zone ru-central1-a \
--network-interface subnet-name=default,nat-ip-version=ipv4 \
--create-boot-disk image-folder-id=standard-images,image-family=ubuntu-2004-lts-gpu,type=network-hdd,size=50 \
--cores=8 \
--memory=96 \
--gpus=1 \
--ssh-key ~/.ssh/id_rsa.pub \
--folder-name russian-superglue \
--platform-id gpu-standard-v1 \
--preemptible
Stop GPU instance, pay just for HDD storage. Start to continue experiments.
yc compute instance stop --name default --folder-name russian-superglue
yc compute instance start --name default --folder-name russian-superglue
Drop GPU instance, network and folder.
yc compute instance delete --name default --folder-name russian-superglue
yc vpc subnet delete --name default --folder-name russian-superglue
yc vpc network delete --name default --folder-name russian-superglue
yc resource-manager folder delete --name russian-superglue
Produce benchmark logs
Use <a href="bench/main.py">bench/main.py</a> to collect CPU and GPU usage during container inference:
- Download <a href="https://russiansuperglue.com/tasks/">tasks data from Russian SuperGLUE site</a>, extract archive to
data/public/; - Increase/decrease
--input-size=2000for optimal runtime. RuBERT processes 2000 PARus records in ~5 seconds, long enough to estimate inference speed; - Increase/decrease
--batch-size=32to max GPU RAM usage. RuBERT uses 100% GPU RAM on PARus with batch size 32; main.pycallspsandnvidia-smi, parses output, writes CPU and GPU usage to stdout, repeats 3 times per second.
python main.py bench russiannlp/rubert-parus data/public parus --input-size=2000 --batch-size=32 > 2000_32_01.jsonl
# data/public
data/public/LiDiRus
data/public/LiDiRus/LiDiRus.jsonl
data/public/MuSeRC
data/public/MuSeRC/test.jsonl
data/public/MuSeRC/val.jsonl
data/public/MuSeRC/train.jsonl
...
# 2000_32_01.jsonl
{"timestamp": 1655476624.532146, "cpu_usage": 0.953, "ram": 292663296, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476624.8558557, "cpu_usage": 0.767, "ram": 299151360, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476625.1793833, "cpu_usage": 0.767, "ram": 299151360, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476625.5032206, "cpu_usage": 0.83, "ram": 342458368, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476625.8275468, "cpu_usage": 0.728, "ram": 349483008, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476626.1513274, "cpu_usage": 0.762, "ram": 341012480, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476626.4759278, "cpu_usage": 0.762, "ram": 341012480, "gpu_usage": null, "gpu_ram": null}
...
{"timestamp": 1655476632.3156314, "cpu_usage": 0.775, "ram": 1693970432, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476632.6450512, "cpu_usage": 0.78, "ram": 1728303104, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476632.975281, "cpu_usage": 0.728, "ram": 1758257152, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476633.3079898, "cpu_usage": 0.8, "ram": 1758818304, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476633.6325083, "cpu_usage": 0.808, "ram": 1787203584, "gpu_usage": null, "gpu_ram": null}
{"timestamp": 1655476633.9611752, "cpu_usage": 0.774, "ram": 1199480832, "gpu_usage": 0.0, "gpu_ram": 12582912}
{"timestamp": 1655476634.413833, "cpu_usage": 0.78, "ram": 1324830720, "gpu_usage": 0.0, "gpu_ram": 326107136}
{"timestamp": 1655476634.7563012, "cpu_usage": 0.727, "ram": 1331073024, "gpu_usage": 0.0, "gpu_ram": 393216000}
{"timestamp": 1655476635.0970583, "cpu_usage": 0.73, "ram": 1334509568, "gpu_usage": 0.0, "gpu_ram": 405798912}
{"timestamp": 1655476635.4380798, "cpu_usage": 0.74, "ram": 1387737088, "gpu_usage": 0.02, "gpu_ram": 433061888}
{"timestamp": 1655476635.7793305, "cpu_usage": 0.696, "ram": 1425448960, "gpu_usage": 0.0, "gpu_ram": 445644800}
{"timestamp": 1655476636.1234272, "cpu_usage": 0.698, "ram": 1447387136, "gpu_usage": 0.0, "gpu_ram": 451936256}
{"timestamp": 1655476636.4652247, "cpu_usage": 0.704, "ram": 1506942976, "gpu_usage": 0.0, "gpu_ram": 462422016}
{"timestamp": 1655476636.8055842, "cpu_usage": 0.668, "ram": 1542393856, "gpu_usage": 0.02, "gpu_ram": 485490688}
{"timestamp": 1655476637.146097, "cpu_usage": 0.673, "ram": 1587482624, "gpu_usage": 0.0, "gpu_ram": 495976448}
{"timestamp": 1655476637.4880967, "cpu_usage": 0.678, "ram": 1635229696, "gpu_usage": 0.01, "gpu_ram": 512753664}
{"timestamp": 1655476637.8288727, "cpu_usage": 0.641, "ram": 1664548864, "gpu_usage": 0.01, "gpu_ram": 523239424}
...
Produce benchmark logs for each task:
- Benchmark with
--input-size=1,--batch-size=1. This way MOROCCO estimates model init time and model size in GPU RAM. We assume that 1 record takes almost no time to process and almost no space in GPU RAM. So all run time is init time and max GPU RAM usage is model size; - Benchmark with
--input-size=X,--batch-size=YwhereX > 1. Choose suchXso that model takes at least several seconds to process input. Otherwise the inference speed estimate is not robust. Choose suchYso that model still fits in GPU RAM, maximize GPU utilization, inferefence speed; - Repeat every measurement 5 times for better median estimates;
- Save logs to
logs/$task/${input_size}_${batch_size}_${index}.jsonlfiles. Do not change path pattern,main.py plot|statsparse file path to get task, input and batch sizes.
input_size=2000
batch_size=32
model=russiannlp/rubert
for task in rwsd parus rcb danetqa muserc russe rucos terra lidirus
do
mkdir -p logs/$task
for index in 01 02 03 04 05
do
python main.py bench $model-$tas
