Deplodock
Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.
Install / Use
/learn @cloudrift-ai/DeplodockREADME
Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.
Project Structure
- deplodock/ — Python package
- deplodock.py — CLI entrypoint
- logging_setup.py — CLI logging configuration
- hardware.py — GPU specs and instance type mapping
- detect.py — GPU detection via PCI sysfs (local and remote)
- commands/ — CLI layer (thin argparse handlers, see ARCHITECTURE.md)
- deploy/ —
deploy local,deploy ssh,deploy cloudcommands - bench/ —
benchcommand - teardown.py —
teardowncommand - vm/ —
vm create/deletecommands (GCP, CloudRift)
- deploy/ —
- recipe/ — Recipe loading, dataclass types, engine flag mapping (see ARCHITECTURE.md)
- deploy/ — Compose generation, deploy orchestration
- provisioning/ — Cloud provisioning, SSH transport, VM lifecycle
- benchmark/ — Benchmark tracking, config, task enumeration, execution
- planner/ — Groups benchmark tasks into execution groups for VM allocation
- recipes/ — Model deploy recipes (YAML configs per model)
- experiments/ — Experiment parameter sweeps (self-contained recipe + results)
- docker/ — Custom Docker images (e.g., vLLM ROCm for MI350X)
- docs/ — Technical notes and engine-specific guides
- sglang-awq-moe.md — SGLang quantization for AWQ MoE models
- tests/ — pytest tests (see ARCHITECTURE.md)
- scripts/ — Analysis and visualization scripts
- utils/ — Standalone utility scripts
- config.yaml — Benchmark configuration
- Makefile — Build automation
- pyproject.toml — Package metadata and tool config
Quick Start
Install
git clone https://github.com/cloudrift-ai/deplodock.git
cd deplodock
make setup
Deploy a Model
deplodock deploy ssh \
--recipe recipes/GLM-4.6-FP8 \
--ssh user@host
Deploy Locally
deplodock deploy local \
--recipe recipes/Qwen3-Coder-30B-A3B-Instruct-AWQ
Teardown
deplodock deploy ssh \
--recipe recipes/GLM-4.6-FP8 \
--ssh user@host \
--teardown
Dry Run
Preview commands without executing:
deplodock deploy ssh \
--recipe recipes/GLM-4.6-FP8 \
--ssh user@host \
--dry-run
Recipes
Recipes are declarative YAML configs in recipes/<model>/recipe.yaml. Each recipe defines a model, engine settings, and a matrices section for benchmark configurations.
Format
model:
huggingface: "org/model-name"
engine:
llm:
tensor_parallel_size: 8
pipeline_parallel_size: 1
gpu_memory_utilization: 0.9
context_length: 16384
max_concurrent_requests: 512
vllm:
image: "vllm/vllm-openai:v0.17.0"
extra_args: "--kv-cache-dtype fp8" # Flags not covered by named fields
benchmark:
max_concurrency: 128
num_prompts: 256
random_input_len: 8000
random_output_len: 8000
# Simple single-point entry (implicit zip)
matrices:
deploy.gpu: "NVIDIA H200 141GB"
deploy.gpu_count: 8
Cross-Product and Zip Combinators
The matrices section supports two combinators for generating benchmark variants:
cross: Cartesian product of all list-valued axes. Scalars are broadcast.zip: Element-wise pairing of equal-length lists. Scalars are broadcast.
A plain matrices dict (no cross/zip key) is an implicit zip.
# Cross-product: 3 GPUs × 2 configs = 6 variants
matrices:
cross:
deploy.gpu_count: 1 # scalar → broadcast
deploy.gpu: # list → cross-product axis
- "NVIDIA GeForce RTX 5090"
- "NVIDIA H100 80GB"
- "NVIDIA H200 141GB"
zip: # zip sub-dict → one compound axis
engine.llm.max_concurrent_requests: [128, 512]
benchmark.max_concurrency: [128, 512]
# Concurrency sweep (zip: 8 runs from one entry)
matrices:
deploy.gpu: "NVIDIA GeForce RTX 5090"
engine.llm.max_concurrent_requests: [1, 2, 4, 8, 16, 32, 64, 128]
benchmark.max_concurrency: [1, 2, 4, 8, 16, 32, 64, 128]
Within a cross node: scalars broadcast, lists are independent axes (cartesian product), nested zip dicts bundle their lists into one compound axis. Within a zip node: scalars broadcast, lists are zipped element-wise (must all be the same length). cross/zip keys are only treated as combinators when their value is a dict.
Matrix entries use dot-notation for all parameter paths. deploy.gpu is required.
Variant Filtering
Use --filter to run a subset of variants:
deplodock bench recipes/my-recipe --filter "deploy.gpu=*5090*"
deplodock bench recipes/my-recipe --filter "deploy.gpu=*5090*" --filter "batches=1"
Multiple --filter flags use AND logic. Values are matched with fnmatch glob patterns against the expanded parameter values.
deploy.driver_version and deploy.cuda_version (optional) request a specific NVIDIA driver / CUDA toolkit on the target host. If the installed version already matches (prefix-match — "550" matches 550.127.05), provisioning is a no-op. On a mismatch, a remote (ssh/cloud) deploy installs the requested version, reboots the host, and waits for SSH to come back. Local deploys refuse to run privileged commands and will error out instead — these fields are intended for remote machines only.
Engine-agnostic fields (tensor_parallel_size, context_length, etc.) live at engine.llm. Engine-specific fields (image, extra_args) nest under engine.llm.vllm or engine.llm.sglang.
Docker Options
Arbitrary docker-compose service keys can be injected via engine.llm.docker_options. This is useful for GPU-specific container settings like ROCm's security options:
engine:
llm:
docker_options:
security_opt:
- seccomp=unconfined
cap_add:
- SYS_PTRACE
Keys already managed by the compose template (image, volumes, ports, healthcheck, etc.) are rejected at load time.
SGLang Matrix Entry Example
To benchmark with SGLang alongside vLLM, use a cross-product with the engine image:
matrices:
cross:
deploy.gpu: "NVIDIA GeForce RTX 5090"
deploy.gpu_count: 1
engine.llm.sglang.image: ["", "lmsysorg/sglang:v0.5.9"]
Named Fields → CLI Flags
| Recipe YAML key | vLLM CLI flag | SGLang CLI flag |
|---------------------------|----------------------------|--------------------------|
| tensor_parallel_size | --tensor-parallel-size | --tp |
| pipeline_parallel_size | --pipeline-parallel-size | --pp |
| data_parallel_size | --data-parallel-size | --dp |
| gpu_memory_utilization | --gpu-memory-utilization | --mem-fraction-static |
| context_length | --max-model-len | --context-length |
| max_concurrent_requests | --max-num-seqs | --max-running-requests |
These flags must not appear in extra_args — load_recipe() validates this and raises an error on duplicates.
Command Recipes (Generic Workload)
A recipe may declare a command block instead of engine.llm to run an arbitrary tool on the provisioned VM (e.g. a microbenchmark, a profiling sweep, or nvidia-smi). The harness expands the matrix, renders the command template per variant, runs it on the VM, and pulls back result files.
command:
stage: ["scripts"] # repo paths to ship to the VM; empty = no staging
run: |
nvidia-smi --query-gpu=name,memory.used --format=csv > $task_dir/result.csv
echo "marker,$marker" >> $task_dir/result.csv
result_files: # filenames or shell globs (expanded on the remote)
- result.csv
- "*.log"
timeout: 60
matrices:
deploy.gpu: "NVIDIA GeForce RTX 5090"
deploy.gpu_count: 1
marker: [a, b, c]
The run template uses string.Template $var syntax. Substitution variables are the variant params (flattened to leaf names — deploy.gpu → gpu, marker → marker) plus harness-injected $task_dir, $gpu_device_ids, and $repo_dir (when staging is configured). command and engine.llm are mutually exclusive.
Staging uses git ls-files --cached --others --exclude-standard <paths> so unversioned edits ride along without a commit, while gitignored files are excluded. Each pulled result file lands in the run directory as {variant}_{basename}.
Aggregate Post-Processing
A recipe may optionally declare an aggregate block that runs locally after all variants complete. Useful for combining per-variant results into comparison tables.
aggregate:
run: |
