Backend.AI
Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, Gaudi NPU, Google TPU, GraphCore IPU and other NPUs.
Install / Use
/learn @lablup/Backend.AIREADME
Backend.AI
Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, Rebellions, FuriosaAI, HyperAccel, Google TPU, Graphcore IPU and other NPUs.
It allocates and isolates the underlying computing resources for multi-tenant computation sessions on-demand or in batches with customizable job schedulers with its own orchestrator named "Sokovan".
All its functions are exposed as REST and GraphQL APIs.
Requirements
Python & Build Tools
- Python: 3.13.x (main branch requires CPython 3.13.7)
- Pantsbuild: 2.27.x
- See full version compatibility table
Infrastructure
Required:
- Docker 20.10+ (with Compose v2)
- PostgreSQL 16+ (tested with 16.3)
- Redis 7.2+ (tested with 7.2.11)
- etcd 3.5+ (tested with 3.5.14)
- Prometheus 3.x (tested with 3.1.0)
Recommended (for observability):
- Grafana 11.x (tested with 11.4.0)
- Loki 3.x (tested with 3.5.0)
- Tempo 2.x (tested with 2.7.2)
- OpenTelemetry Collector
→ Detailed infrastructure setup: Infrastructure Documentation
System
- OS: Linux (Debian/RHEL-based) or macOS
- Permissions: sudo access for installation
- Resources: 4+ CPU cores, 8GB+ RAM recommended for development
Getting Started
Quick Start (Development)
1. Clone and Install
git clone https://github.com/lablup/backend.ai.git
cd backend.ai
./scripts/install-dev.sh
This script will:
- Check required dependencies (Docker, Python, etc.)
- Set up Python virtual environment with Pantsbuild
- Start halfstack infrastructure (PostgreSQL, Redis, etcd, Grafana, etc.)
- Initialize database schemas
- Create default API keypairs and user accounts
2. Start Backend.AI Services
Start each component in separate terminals:
Manager (Terminal 1):
./backend.ai mgr start-server --debug
Agent (Terminal 2):
./backend.ai ag start-server --debug
Storage Proxy (Terminal 3):
./py -m ai.backend.storage.server
Web Server (Terminal 4):
./py -m ai.backend.web.server
App Proxy (Terminal 5-6, optional for in-container service access):
./backend.ai app-proxy-coordinator start-server --debug
./backend.ai app-proxy-worker start-server --debug
3. Run Your First Session
Set up client environment:
source env-local-user-session.sh
# This script prints your default User ID and Password;
./backend.ai login
# When prompted, enter the User ID and Password shown above.
Run a simple Python session:
./backend.ai run python -c "print('Hello Backend.AI!')"
Or access Web UI at http://localhost:8090 with credentials from env-local-*.sh files.
Accessing Compute Sessions (aka Kernels)
Backend.AI provides websocket tunneling into individual computation sessions (containers), so that users can use their browsers and client CLI to access in-container applications directly in a secure way.
- Jupyter: data scientists' favorite tool
- Most container images have intrinsic Jupyter and JupyterLab support.
- Web-based terminal
- All container sessions have intrinsic ttyd support.
- SSH
- All container sessions have intrinsic SSH/SFTP/SCP support with auto-generated per-user SSH keypair. PyCharm and other IDEs can use on-demand sessions using SSH remote interpreters.
- VSCode
- Most container sessions have intrinsic web-based VSCode support.
Working with Storage
Backend.AI provides an abstraction layer on top of existing network-based storages (e.g., NFS/SMB), called vfolders (virtual folders). Each vfolder works like a cloud storage that can be mounted into any computation sessions and shared between users and user groups with differentiated privileges.
Installation for Multi-node Tests & Production
Please consult our documentation for community-supported materials. Contact the sales team (contact@lablup.com) for professional paid support and deployment options.
Architecture
For comprehensive system architecture, component interactions, and infrastructure details, see:
Component Architecture Documentation
This document covers:
- System architecture diagrams and component flow
- Port numbers and infrastructure setup
- Component dependencies and communication protocols
- Development and production environment configuration
Contents in This Repository
This repository contains all open-source server-side components and the client SDK for Python as a reference implementation of API clients.
Directory Structure
src/ai/backend/: Source codesmanager/: Manager as the cluster control-planemanager/api: Manager API handlersaccount_manager/: Unified user profile and SSO managementagent/: Agent as per-node controlleragent/docker/: Agent's Docker backendagent/k8s/: Agent's Kubernetes backendagent/dummy/: Agent's dummy backendkernel/: Agent's kernel runner counterpartrunner/: Agent's in-kernel prebuilt binarieshelpers/: Agent's in-kernel helper packagecommon/: Shared utilitiesclient/: Client SDKcli/: Unified CLI for all componentsinstall/: SCIE-based TUI installerstorage/: Storage proxy for offloading storage operationsstorage/api: Storage proxy's manager-facing and client-facing APIsappproxy/: App proxy for accessing container apps from outsideappproxy/coordinator: App proxy coordinator who provisions routing circuitsappproxy/worker: App proxy worker who forwards the trafficweb/: Web UI serverstatic/: Backend.AI WebUI release artifacts
logging/: Logging subsystemplugin/: Plugin subsystemtest/: Integration test suitetestutils/: Shared utilities used by unit testsmeta/: Legacy meta packageaccelerator/: Intrinsic accelerator plugins
docs/: Unified documentationtests/manager/,agent/, ...: Per-component unit tests
configs/manager/,agent/, ...: Per-component sample configurations
docker/: Dockerfiles for auxiliary containersfixtures/manager/, ...: Per-component fixtures for development setup and tests
plugins/: A directory to place plugins such as accelerators, monitors, etc.scripts/: Scripts to assist development workflowsinstall-dev.sh: The single-node development setup script from the working copy
stubs/: Type annotation stub packages written by ustools/: A directory to host Pants-related toolingdist/: A directory to put build artifacts (.whl files) and Pants-exported virtualenvschanges/: News fragments for towncrierpants.toml: The Pants configurationpyproject.toml: Tooling configuration (towncrier, pytest, mypy)BUILD: The root build config file**/BUILD: Per-directory build config filesBUILD_ROOT: An indicator to mark the build root directory for PantsCLAUDE.md: The steering guide for agent-assisted developmentrequirements.txt: The unified requirements file*.lock,tools/*.lock: The dependency lock filesdocker-compose.*.yml: Per-version recommended halfstack container configsREADME.md: This fileMIGRATION.md: The migration guide for updating between major releasesVERSION: The unified version declaration
Server-side components are licensed under LGPLv3 to promote non-proprietary open innovation in the open-source community while other shared libraries and client SDKs are distributed under the MIT license.
There is no obligation to open your service/system codes if you just run the server-side components as-is (e.g., just run as daemons or import the components without modification in your codes). Please contact us (contact-at-lablup-com) for commercial consulting and more licensing details/options about individual use-cases.
Major Components
Backend.AI consists of the following core components:
Server-Side Components
Manager - Central API gateway and orchestrator
- Routes REST/GraphQL requests and orchestrates cluster operations
- Session scheduling via Sokovan orchestrator
- User authentication and RBAC authorization
- Plugin interfaces:
backendai_scheduler_v10,backendai_agentselector_v10,backendai_hook_v20,backendai_webapp_v20,backendai_monitor_stats_v10,backendai_monitor_error_v10 - Legacy repo: https://github.com/lablup/backend.ai-manager
Agent - Kernel lifecycle management on compute nodes
- Manages Docker containers (kernels) on individual nodes
- Self-registers to cluster via heartbeats
- Plugin interfaces:
backendai_accelerator_v21,backendai_monitor_stats_v10,backendai_monitor_error_v10 - Legacy repo: https://github.com/lablup/backend.ai-agent
Storage Proxy - Virtual folder and storage backend abstraction
- Unified interface for multiple storage backends
- Real-time performance metrics and acceleration APIs
- Legacy repo: https://github.com/lablup/backend.ai-storage-proxy
Webserver - Web UI hosting and session management
- Hosts Backend.AI WebUI (SPA)
- Session management and API request signing
- Legacy repo: https://github.co
Related Skills
gh-issues
339.5kFetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
oracle
339.5kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
tmux
339.5kRemote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.
