SkillAgentSearch skills...

Deplodock

Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.

Install / Use

/learn @cloudrift-ai/Deplodock
About this skill

Quality Score

0/100

Category

Operations

Supported Platforms

Zed

README

<p align="center"> <img src="logo.png" alt="DeploDock" width="300"> </p> <p align="center"> <a href="https://pypi.org/project/deplodock/"><img src="https://img.shields.io/pypi/v/deplodock" alt="PyPI"></a> <a href="https://github.com/cloudrift-ai/deplodock/actions/workflows/tests.yml"><img src="https://github.com/cloudrift-ai/deplodock/actions/workflows/tests.yml/badge.svg" alt="Tests"></a> <a href="https://discord.gg/cloudrift"><img src="https://img.shields.io/discord/1150997934113030174?label=Discord" alt="Discord"></a> </p>

Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.

Project Structure

Quick Start

Install

git clone https://github.com/cloudrift-ai/deplodock.git
cd deplodock
make setup

Deploy a Model

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --ssh user@host

Deploy Locally

deplodock deploy local \
  --recipe recipes/Qwen3-Coder-30B-A3B-Instruct-AWQ

Teardown

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --ssh user@host \
  --teardown

Dry Run

Preview commands without executing:

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --ssh user@host \
  --dry-run

Recipes

Recipes are declarative YAML configs in recipes/<model>/recipe.yaml. Each recipe defines a model, engine settings, and a matrices section for benchmark configurations.

Format

model:
  huggingface: "org/model-name"

engine:
  llm:
    tensor_parallel_size: 8
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    context_length: 16384
    max_concurrent_requests: 512
    vllm:
      image: "vllm/vllm-openai:v0.17.0"
      extra_args: "--kv-cache-dtype fp8"    # Flags not covered by named fields

benchmark:
  max_concurrency: 128
  num_prompts: 256
  random_input_len: 8000
  random_output_len: 8000

# Simple single-point entry (implicit zip)
matrices:
  deploy.gpu: "NVIDIA H200 141GB"
  deploy.gpu_count: 8

Cross-Product and Zip Combinators

The matrices section supports two combinators for generating benchmark variants:

  • cross: Cartesian product of all list-valued axes. Scalars are broadcast.
  • zip: Element-wise pairing of equal-length lists. Scalars are broadcast.

A plain matrices dict (no cross/zip key) is an implicit zip.

# Cross-product: 3 GPUs × 2 configs = 6 variants
matrices:
  cross:
    deploy.gpu_count: 1                    # scalar → broadcast
    deploy.gpu:                            # list → cross-product axis
      - "NVIDIA GeForce RTX 5090"
      - "NVIDIA H100 80GB"
      - "NVIDIA H200 141GB"
    zip:                                   # zip sub-dict → one compound axis
      engine.llm.max_concurrent_requests: [128, 512]
      benchmark.max_concurrency: [128, 512]
# Concurrency sweep (zip: 8 runs from one entry)
matrices:
  deploy.gpu: "NVIDIA GeForce RTX 5090"
  engine.llm.max_concurrent_requests: [1, 2, 4, 8, 16, 32, 64, 128]
  benchmark.max_concurrency: [1, 2, 4, 8, 16, 32, 64, 128]

Within a cross node: scalars broadcast, lists are independent axes (cartesian product), nested zip dicts bundle their lists into one compound axis. Within a zip node: scalars broadcast, lists are zipped element-wise (must all be the same length). cross/zip keys are only treated as combinators when their value is a dict.

Matrix entries use dot-notation for all parameter paths. deploy.gpu is required.

Variant Filtering

Use --filter to run a subset of variants:

deplodock bench recipes/my-recipe --filter "deploy.gpu=*5090*"
deplodock bench recipes/my-recipe --filter "deploy.gpu=*5090*" --filter "batches=1"

Multiple --filter flags use AND logic. Values are matched with fnmatch glob patterns against the expanded parameter values.

deploy.driver_version and deploy.cuda_version (optional) request a specific NVIDIA driver / CUDA toolkit on the target host. If the installed version already matches (prefix-match — "550" matches 550.127.05), provisioning is a no-op. On a mismatch, a remote (ssh/cloud) deploy installs the requested version, reboots the host, and waits for SSH to come back. Local deploys refuse to run privileged commands and will error out instead — these fields are intended for remote machines only.

Engine-agnostic fields (tensor_parallel_size, context_length, etc.) live at engine.llm. Engine-specific fields (image, extra_args) nest under engine.llm.vllm or engine.llm.sglang.

Docker Options

Arbitrary docker-compose service keys can be injected via engine.llm.docker_options. This is useful for GPU-specific container settings like ROCm's security options:

engine:
  llm:
    docker_options:
      security_opt:
        - seccomp=unconfined
      cap_add:
        - SYS_PTRACE

Keys already managed by the compose template (image, volumes, ports, healthcheck, etc.) are rejected at load time.

SGLang Matrix Entry Example

To benchmark with SGLang alongside vLLM, use a cross-product with the engine image:

matrices:
  cross:
    deploy.gpu: "NVIDIA GeForce RTX 5090"
    deploy.gpu_count: 1
    engine.llm.sglang.image: ["", "lmsysorg/sglang:v0.5.9"]

Named Fields → CLI Flags

| Recipe YAML key | vLLM CLI flag | SGLang CLI flag | |---------------------------|----------------------------|--------------------------| | tensor_parallel_size | --tensor-parallel-size | --tp | | pipeline_parallel_size | --pipeline-parallel-size | --pp | | data_parallel_size | --data-parallel-size | --dp | | gpu_memory_utilization | --gpu-memory-utilization | --mem-fraction-static | | context_length | --max-model-len | --context-length | | max_concurrent_requests | --max-num-seqs | --max-running-requests |

These flags must not appear in extra_argsload_recipe() validates this and raises an error on duplicates.

Command Recipes (Generic Workload)

A recipe may declare a command block instead of engine.llm to run an arbitrary tool on the provisioned VM (e.g. a microbenchmark, a profiling sweep, or nvidia-smi). The harness expands the matrix, renders the command template per variant, runs it on the VM, and pulls back result files.

command:
  stage: ["scripts"]              # repo paths to ship to the VM; empty = no staging
  run: |
    nvidia-smi --query-gpu=name,memory.used --format=csv > $task_dir/result.csv
    echo "marker,$marker" >> $task_dir/result.csv
  result_files:                    # filenames or shell globs (expanded on the remote)
    - result.csv
    - "*.log"
  timeout: 60

matrices:
  deploy.gpu: "NVIDIA GeForce RTX 5090"
  deploy.gpu_count: 1
  marker: [a, b, c]

The run template uses string.Template $var syntax. Substitution variables are the variant params (flattened to leaf names — deploy.gpugpu, markermarker) plus harness-injected $task_dir, $gpu_device_ids, and $repo_dir (when staging is configured). command and engine.llm are mutually exclusive.

Staging uses git ls-files --cached --others --exclude-standard <paths> so unversioned edits ride along without a commit, while gitignored files are excluded. Each pulled result file lands in the run directory as {variant}_{basename}.

Aggregate Post-Processing

A recipe may optionally declare an aggregate block that runs locally after all variants complete. Useful for combining per-variant results into comparison tables.

aggregate:
  run: |
 
View on GitHub
GitHub Stars28
CategoryOperations
Updated3h ago
Forks4

Languages

Python

Security Score

90/100

Audited on Apr 10, 2026

No findings