Etalon
LLM Serving Performance Evaluation Harness
Install / Use
/learn @project-etalon/EtalonREADME
Setup
Clone repository
git clone https://github.com/project-etalon/etalon.git
Create conda environment
conda create -n etalon python=3.10
conda activate etalon
Install etalon
cd etalon
pip install -e .
Installing with vLLM (optional)
pip install -e ".[vllm]"
Setup Wandb [Optional]
First create and setup your account at https://<your-org>.wandb.io/ or public Wandb and obtain API key. Then run the following command and enter API key linked to your wandb account:
wandb login --host https://<your-org>.wandb.io
To opt out of wandb, do any of the following:
- Don't pass any wandb related args like
--wandb-project,--wandb-groupandwandb-run-namewhen running python scripts. Alternatively, pass in--no-should-write-metricsinstead of--should-write-metricsboolean flag. - Run
export WANDB_MODE=disabledin your shell or add this to~/.zshrcor~/.bashrc. Remember to reload your shell usingsource ~/.zshrcorsource ~/.bashrc.
Running Code
Running with Public APIs
Export API Key and URL
export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE=https://api.endpoints.anyscale.com/v1
Running Benchmark
python -m etalon.run_benchmark \
--client_config_model "meta-llama/Meta-Llama-3-8B-Instruct" \
--max_completed_requests 150 \
--timeout 600 \
--client_config_num_clients 2 \
--client_config_num_concurrent_requests_per_client 5 \
--metrics_config_output_dir "result_outputs" \
--request_interval_generator_config_type "poisson" \
--poisson_request_interval_generator_config_qps 0.5 \
--request_length_generator_config_type "trace" \
--trace_request_length_generator_config_trace_file "./data/processed_traces/arxiv_summarization_filtered_stats_llama2_tokenizer.csv" \
--trace_request_length_generator_config_max_tokens 8192 \
--deadline_config_ttft_deadline 0.3 \
--deadline_config_tbt_deadline 0.03 \
--metrics_config_should_write_metrics \
--metrics_config_wandb_project Project \
--metrics_config_wandb_group Group \
--metrics_config_wandb_run_name Run
There are many more arguments for running benchmark, run the following to know more:
python -m etalon.run_benchmark -h
Running with Open Source Systems
etalon can be run with any open source LLM inference system. If open source system does not provide OpenAI Compatible APIs, then kindly implement new LLM clients to support new open source system as explained in here.
Here we give an example with vLLM.
Launch vLLM Server
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 -tp 1 --rope-scaling '{"type":"dynamic","factor":2.0}'
If we need higher context length than supported by the model with certain scale factor, then we can add rope-scaling as --rope-scaling '{"type":"dynamic","factor":2.0}'. Adjust type and factor as per the use case.
Export API Key and URL
export OPENAI_API_KEY=token-abc123
export OPENAI_API_BASE=http://localhost:8000/v1
And then we can run the benchmark as shown here. Be sure to update --model flag to same model used to launch vLLM.
Saving Results
The results of the benchmark are saved in the results directory specified by the --output-dir argument.
Running Prefill Profiler
To profile prefill times of open source systems and create a prefill time predictor for a given model and open source system, based on input prompt length, we can run etalon.prefill_profiler.
Launch any open source system and setup API keys and URL as shown for vLLM.
python -m etalon.prefill_profiler \
--client_config_model "meta-llama/Meta-Llama-3-8B-Instruct" \
--timeout 600 \
--metrics_config_output_dir "prefill_experiments/prefill_profiler_vllm_llama-3-8b" \
--metrics_config_should_use_given_dir true
To modify range of prompt tokens for which prefill times get profiled, use the flag --prefill-lengths as follows:
python -m etalon.prefill_profiler \
--client_config_model "meta-llama/Meta-Llama-3-8B-Instruct" \
--timeout 600 \
--metrics_config_output_dir "prefill_experiments/prefill_profiler_vllm_llama-3-8b" \
--metrics_config_should_use_given_dir true \
--prefill_profiler_config_prefill_lengths 256 512 1024 2048 4096 8192 16384 32768 65536
Running Capacity Search
Important: Run prefill profiler for a given model and open source system before running capacity search of deadline-based SLO type.
Refer to readme file of etalon/capacity_search folder to know more about how to run capacity search.
Implementing New LLM Clients
To implement a new LLM client, you need to implement the base class etalon.llm_client.BaseLLMClient and decorate it as a ray actor.
from etalon.llm_client import BaseLLMClient
import ray
@ray.remote
class CustomLLMClient(BaseLLMClient):
def send_llm_request(self, request_config: RequestConfig) -> Tuple[Metrics, str, RequestConfig]:
"""Make a single completion request to a LLM API
Returns:
Metrics about the performance characteristics of the request.
The text generated by the request to the LLM API.
The request_config used to make the request. This is mainly for logging purposes.
"""
...
Citation
If you use our work, please consider citing our paper:
@misc{agrawal2024etalonholisticperformanceevaluation,
title={etalon: Holistic Performance Evaluation Framework for LLM Inference Systems},
author={Amey Agrawal and Anmol Agarwal and Nitin Kedia and Jayashree Mohan and Souvik Kundu and Nipun Kwatra and Ramachandran Ramjee and Alexey Tumanov},
year={2024},
eprint={2407.07000},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.07000},
}
Acknowledgement
This repository was originally created as fork from LLMPerf project.
