Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.

Project Structure

deplodock/ — Python package
- deplodock.py — CLI entrypoint
- logging_setup.py — CLI logging configuration
- hardware.py — GPU specs and instance type mapping
- detect.py — GPU detection via PCI sysfs (local and remote)
- commands/ — CLI layer (thin argparse handlers, see ARCHITECTURE.md)
  - deploy/ — deploy local, deploy ssh, deploy cloud commands
  - bench/ — bench command
  - teardown.py — teardown command
  - vm/ — vm create/delete commands (GCP, CloudRift)
- recipe/ — Recipe loading, dataclass types, engine flag mapping (see ARCHITECTURE.md)
- deploy/ — Compose generation, deploy orchestration
- provisioning/ — Cloud provisioning, SSH transport, VM lifecycle
- benchmark/ — Benchmark tracking, config, task enumeration, execution
- planner/ — Groups benchmark tasks into execution groups for VM allocation
recipes/ — Model deploy recipes (YAML configs per model)
experiments/ — Experiment parameter sweeps (self-contained recipe + results)
docker/ — Custom Docker images (e.g., vLLM ROCm for MI350X)
docs/ — Technical notes and engine-specific guides
- sglang-awq-moe.md — SGLang quantization for AWQ MoE models
tests/ — pytest tests (see ARCHITECTURE.md)
scripts/ — Analysis and visualization scripts
utils/ — Standalone utility scripts
config.yaml — Benchmark configuration
Makefile — Build automation
pyproject.toml — Package metadata and tool config

Quick Start

Install

git clone https://github.com/cloudrift-ai/deplodock.git
cd deplodock
make setup

Deploy a Model

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --ssh user@host

Deploy Locally

deplodock deploy local \
  --recipe recipes/Qwen3-Coder-30B-A3B-Instruct-AWQ

Teardown

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --ssh user@host \
  --teardown

Dry Run

Preview commands without executing:

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --ssh user@host \
  --dry-run

Recipes

Recipes are declarative YAML configs in recipes/<model>/recipe.yaml. Each recipe defines a model, engine settings, and a matrices section for benchmark configurations.

Format

model:
  huggingface: "org/model-name"

engine:
  llm:
    tensor_parallel_size: 8
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    context_length: 16384
    max_concurrent_requests: 512
    vllm:
      image: "vllm/vllm-openai:v0.17.0"
      extra_args: "--kv-cache-dtype fp8"    # Flags not covered by named fields

benchmark:
  max_concurrency: 128
  num_prompts: 256
  random_input_len: 8000
  random_output_len: 8000

# Simple single-point entry (implicit zip)
matrices:
  deploy.gpu: "NVIDIA H200 141GB"
  deploy.gpu_count: 8

Cross-Product and Zip Combinators

The matrices section supports two combinators for generating benchmark variants:

cross: Cartesian product of all list-valued axes. Scalars are broadcast.
zip: Element-wise pairing of equal-length lists. Scalars are broadcast.

A plain matrices dict (no cross/zip key) is an implicit zip.

# Cross-product: 3 GPUs × 2 configs = 6 variants
matrices:
  cross:
    deploy.gpu_count: 1                    # scalar → broadcast
    deploy.gpu:                            # list → cross-product axis
      - "NVIDIA GeForce RTX 5090"
      - "NVIDIA H100 80GB"
      - "NVIDIA H200 141GB"
    zip:                                   # zip sub-dict → one compound axis
      engine.llm.max_concurrent_requests: [128, 512]
      benchmark.max_concurrency: [128, 512]

# Concurrency sweep (zip: 8 runs from one entry)
matrices:
  deploy.gpu: "NVIDIA GeForce RTX 5090"
  engine.llm.max_concurrent_requests: [1, 2, 4, 8, 16, 32, 64, 128]
  benchmark.max_concurrency: [1, 2, 4, 8, 16, 32, 64, 128]

Within a cross node: scalars broadcast, lists are independent axes (cartesian product), nested zip dicts bundle their lists into one compound axis. Within a zip node: scalars broadcast, lists are zipped element-wise (must all be the same length). cross/zip keys are only treated as combinators when their value is a dict.

Matrix entries use dot-notation for all parameter paths. deploy.gpu is required.

Variant Filtering

Use --filter to run a subset of variants:

deplodock bench recipes/my-recipe --filter "deploy.gpu=*5090*"
deplodock bench recipes/my-recipe --filter "deploy.gpu=*5090*" --filter "batches=1"

Multiple --filter flags use AND logic. Values are matched with fnmatch glob patterns against the expanded parameter values.

deploy.driver_version and deploy.cuda_version (optional) request a specific NVIDIA driver / CUDA toolkit on the target host. If the installed version already matches (prefix-match — "550" matches 550.127.05), provisioning is a no-op. On a mismatch, a remote (ssh/cloud) deploy installs the requested version, reboots the host, and waits for SSH to come back. Local deploys refuse to run privileged commands and will error out instead — these fields are intended for remote machines only.

Engine-agnostic fields (tensor_parallel_size, context_length, etc.) live at engine.llm. Engine-specific fields (image, extra_args) nest under engine.llm.vllm or engine.llm.sglang.

Docker Options

Arbitrary docker-compose service keys can be injected via engine.llm.docker_options. This is useful for GPU-specific container settings like ROCm's security options:

engine:
  llm:
    docker_options:
      security_opt:
        - seccomp=unconfined
      cap_add:
        - SYS_PTRACE

Keys already managed by the compose template (image, volumes, ports, healthcheck, etc.) are rejected at load time.

SGLang Matrix Entry Example

To benchmark with SGLang alongside vLLM, use a cross-product with the engine image:

matrices:
  cross:
    deploy.gpu: "NVIDIA GeForce RTX 5090"
    deploy.gpu_count: 1
    engine.llm.sglang.image: ["", "lmsysorg/sglang:v0.5.9"]

Named Fields → CLI Flags

| Recipe YAML key | vLLM CLI flag | SGLang CLI flag | |---------------------------|----------------------------|--------------------------| | tensor_parallel_size | --tensor-parallel-size | --tp | | pipeline_parallel_size | --pipeline-parallel-size | --pp | | data_parallel_size | --data-parallel-size | --dp | | gpu_memory_utilization | --gpu-memory-utilization | --mem-fraction-static | | context_length | --max-model-len | --context-length | | max_concurrent_requests | --max-num-seqs | --max-running-requests |

These flags must not appear in extra_args — load_recipe() validates this and raises an error on duplicates.

Command Recipes (Generic Workload)

A recipe may declare a command block instead of engine.llm to run an arbitrary tool on the provisioned VM (e.g. a microbenchmark, a profiling sweep, or nvidia-smi). The harness expands the matrix, renders the command template per variant, runs it on the VM, and pulls back result files.

command:
  stage: ["scripts"]              # repo paths to ship to the VM; empty = no staging
  run: |
    nvidia-smi --query-gpu=name,memory.used --format=csv > $task_dir/result.csv
    echo "marker,$marker" >> $task_dir/result.csv
  result_files:                    # filenames or shell globs (expanded on the remote)
    - result.csv
    - "*.log"
  timeout: 60

matrices:
  deploy.gpu: "NVIDIA GeForce RTX 5090"
  deploy.gpu_count: 1
  marker: [a, b, c]

The run template uses string.Template $var syntax. Substitution variables are the variant params (flattened to leaf names — deploy.gpu → gpu, marker → marker) plus harness-injected $task_dir, $gpu_device_ids, and $repo_dir (when staging is configured). command and engine.llm are mutually exclusive.

Staging uses git ls-files --cached --others --exclude-standard <paths> so unversioned edits ride along without a commit, while gitignored files are excluded. Each pulled result file lands in the run directory as {variant}_{basename}.

Aggregate Post-Processing

A recipe may optionally declare an aggregate block that runs locally after all variants complete. Useful for combining per-variant results into comparison tables.

aggregate:
  run: |

Deplodock

Install / Use

README