Lxm3
LXM3: XManager launch backend for HPC clusters
Install / Use
/learn @ethanluoyc/Lxm3README
LXM3: XManager launch backend for HPC clusters
<img src="https://raw.githubusercontent.com/ethanluoyc/lxm3/main/docs/logo.png" alt="logo created by GPT-4" style="width:400px;"/>lxm3 provides an implementation for DeepMind's XManager launch API that aims to provide a similar experience for running experiments on traditional HPC. It provides a local execution backend and support for the SGE and Slurm schedulers.
Installation
For running on a cluster, you should install Singularity and rsync before using lxm3. It may be possible to run on the cluster without Singularity, but that path was not tested thoroughly.
You can install lxm3 from PyPI by running.
pip install lxm3
You can also install from GitHub for the latest features.
# Consider pinning to a specific commit/tag.
pip install git+https://github.com/ethanluoyc/lxm3
Prerequisites
Set up configuration file (required)
You should create a configuration file for setting up credentials and storage location for your cluster. In addition, the configuration file is also required to specify the storage location for the local executor.
Create a configuration file at $XDG_CONFIG_HOME/lxm3/config.toml (defaults to ~/.config/lxm3/config.toml) with the following content:
# Configuration for running in local mode.
[local]
[local.storage]
# Configuration to lxm3 to stage local artifacts.
staging = "~/.cache/lxm3"
# Configuration for running on clusters. Omit if you are only using local mode.
[[clusters]]
# Set a name for this cluster, e.g., "cs"
name = "<TODO>"
# Replace with the server you normally use for ssh into the cluster, e.g. "beaker.cs.ucl.ac.uk"
server = "<TODO>"
# Fill in the username you use for this cluster.
user = "<TODO>"
# Uncomment and update the line below if you would like to use a private key file ssh.
# ssh_private_key = "~/.ssh/<private key name>"
# Uncomment and update the line below if you would like to use a password for ssh.
# password = "<password>"
# Uncomment and update the line below if you need to connect to the cluster
# via a jump server. This corresponds to the proxycommand option in ssh_config.
# proxycommand = ""
[clusters.storage]
# Replace with the path to a staging directory on the cluster. lxm3 uses this directory for storing all files required to run your job.
# This should be an absolute directory and should not be a symlink
staging = "<absolute path to your home directory>/lxm3-staging"
Install Singularity/Apptainer (optional)
If you use the SingularityContainer executable, you should install Singularity/Apptainer
on your machine. Your cluster should have installed Singularity/Apptainer on your HPC cluster as well.
Follow the instructions on the singularity website or apptainer website to install Singularity/Apptainer respectively. Currently, lxm3
supports Apptainer via the singularity symlink.
Install Docker (optional)
We recommend installing Docker as well even though you are using Singularity/Apptainer.
This would allow your to use Docker's build cache to speed up the build process. An experimental
DockerContainer is also provided for running jobs with Docker.
Writing lxm3 launch scripts
At a high level you can launch experiment by creating a launch script
called launcher.py that looks like:
# Create an experiment and acquite its context
with xm_cluster.create_experiment(experiment_title="hello world") as experiment:
# Define an specification for the executable you want to run
spec = xm_cluster.PythonPackage(
path=".",
entrypoint=xm_cluster.ModuleName("my_package.main"),
)
# Define an executor for the executable
# To launch locally
executor = xm_cluster.Local()
# or, if you want to use SGE:
# executor = xm_cluster.GridEngine()
# or, if you want to run on a Slurm cluster:
executor = xm_cluster.Slurm()
# package your code
[executable] = experiment.package(
[xm.Packageable(spec, executor_spec=executor.Spec())]
)
# add jobs to your experiment
experiment.add(
xm.Job(executable=executable, executor=executor)
)
and launch the experimet from the command line with
lxm3 launch launcher.py
Many things happen under the hood. Since lxm3 implements the XManager API, you should get familiar with the concepts in the XManager. Once you are familiar with the concepts, checkout the examples/ directory for a quick start guide.
Components
lxm3 provides the following executable specification and executors.
Executable specifications
| Name | Description |
| ----------- | ----------- |
| lxm3.xm_cluster.PythonPackage | A python application packageable with pip |
| lxm3.xm_cluster.UniversalPackage | A universal package |
| lxm3.xm_cluster.SingularityContainer | An executable running with Singularity |
Executors
| Name | Description |
| ----------- | ----------- |
| lxm3.xm_cluster.Local | Runs a executable locally, mainly used for testing |
| lxm3.xm_cluster.GridEngine | Runs a executable on SGE cluster |
| lxm3.xm_cluster.Slurm | Runs a executable on Slurm cluster |
Jobs
- Currently, only
xm.Jobandxm.JobGeneratorthat generatesxm.Jobare supported. - We support HPC array jobs via
xm_cluster.ArrayJob. See below.
Implementation Details
Managing Dependencies with Containers
lxm3 uses of Singularity containers for running jobs on HPCs.
lxm3 aims at providing a easy workflow for launching jobs on traditional HPC clusters, which deviates from typical workflows for launching experiments on Cloud platforms.
lxm3 is designed for working with containerized applications using Singularity as the runtime. Singularity is a popular choice for HPC clusters because it allows users to run containers without requiring root privileges, and is supported by many HPC clusters worldwide.
There are many benefits to using containers for running jobs on HPCs compared to traditional isolation via venv or conda.
venvandcondaare laid out as a directory of files on the environment. For many HPCs, normally these will be installed on a networked filesystem such as NFS. Operations on these virtual environments are slow and inefficient. For example, on our cluster, removing acondaenvironment with many dependencies can take an hour when these environments are on NFS. There are usually quota put in places not only for the file sizes but also the number of files. For ML projects that uses depends on many (large) packages such as TensorFlow, PyTorch, it is very easy to hit the quota limit. Singularity containers are a single file. This is both easy for deployment and also avoids the file number quota.- Containers provide consistent environment for running jobs on different clusters as well as making it easy to use system dependencies not installed on HPC's host environment.
Automated Deployment.
HPC deployments normally use a filesystem that are detached from the filesystems of the user's workstation. Many tutorials for running jobs on HPCs request the users to either clone their repository on the login node or ask the users manually copy files to the cluster. Doing this repeatedly is tedious. lxm3 automates the deployments from your workstation to the HPC cluster so that you can do most of your work locally without having directly login into the cluster.
Unlike Docker or other OCI images that are composed of multiple layers,
the Singularity Image Format (SIF) used by Singularity is a single file that contains the entire filesystem of the container. While this is convenient as deployment to a remote cluster can be performed with a single scp/rsync command. The lack of layer caching/sharing makes repeated deployments slow and inefficient.
For this reason, unlike typical cloud deployments where the application and dependencies are packaged into a single image, lxm3 uses a two-stage packaging process to separate the application and dependencies.
This allows applications with heavy dependencies to be packaged once and reused across multiple experiments by reusing the same singularity container.
For Python applications, we rely on the user to first build a runtime image for all of the dependencies and use standard Python pacakging tools to create a distribution that is deployed separately to the cluster.
Concretely, the user is expected to create a simple pyproject.toml which describes how to create a distribution for their applications.
This is convenient, as lxm3 does not have to invent a custom packaging format for Python applications. For example, a simple pyproject.toml that uses hatchling as the build backend looks like:
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "py_package"
authors = [{ name = "Joe" }]
version = "0.1.0"
lxm3 uses pip install --no-deps to create a zip archive that contains all of your application code.
In addition to the packaging, lxm3 also allows you to carry extra files for
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
