Ceems
A Prometheus exporter and a REST API server to export metrics of compute units of resource managers like SLURM, Openstack, k8s, _etc_
Install / Use
/learn @ceems-dev/CeemsREADME
Compute Energy & Emissions Monitoring Stack (CEEMS)
<!-- markdown-link-check-disable -->| | |
| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CI/CD |
|
| Docs |
|
| Package |
|
| Meta |
|
Compute Energy & Emissions Monitoring Stack (CEEMS) (pronounced as kiːms) contains a Prometheus exporter to export metrics of compute instance units and a REST API server that serves the metadata and aggregated metrics of each compute unit. Optionally, it includes a TSDB load balancer that supports basic access control on TSDB so that one user cannot access metrics of another user.
"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the repository is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.
Although CEEMS was born out of a need to monitor energy and carbon footprint of compute workloads, it supports monitoring performance metrics as well. In addition, it leverages eBPF framework to monitor IO and network metrics in a resource manager agnostic way. It also supports eBPF based zero instrumentation continuous profiling of compute units.
🎯 Features
- Monitors energy, performance, IO and network metrics for different types of resource managers (SLURM, Openstack, k8s)
- Supports different energy sources like RAPL, HWMON, Cray's PM Counters and BMC via IPMI or Redfish
- Supports NVIDIA (MIG, time sharing, MPS and vGPU) and AMD GPUs (Partition like CPX, QPX, TPX, DPX)
- Supports zero instrumentation eBPF based continuous profiling using Grafana Pyroscope as backend
- Realtime access to metrics via Grafana dashboards or a simple CLI tool
- Multi-tenancy and access control to Prometheus and Pyroscope datasources in Grafana
- Stores aggregated metrics in a separate DB that can be retained for long time
- CEEMS apps are capability aware
⚙️ Install CEEMS
[!WARNING] DO NOT USE pre-release versions as the API has changed quite a lot between the pre-release and stable versions.
Installation instructions of CEEMS components can be found in docs.
📽️ Demo
<p><a href="https://ceems-demo.myaddr.tools" target="_blank"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/demo_screenshot.png" alt="Access Demo"> </a></p>Openstack and SLURM have been deployed on a small cloud instance and monitored using CEEMS. As neither RAPL nor IPMI readings are available on cloud instances, energy consumption is estimated by assuming a Thermal Design Power (TDP) value and current usage of the instance. Several dashboards have been created in Grafana for visualizing metrics which are listed below.
- Overall usage of cluster
- Usage of different Projects/Accounts by SLURM and Openstack
- Usage of Openstack resources by a given user and project
- Usage of SLURM resources by a given user and project
[!WARNING] All the dashboards provided in the demo instance are only meant to be for demonstrative purposes. They should not be used in production without properly protecting datasources.
Visualizing metrics with Grafana
Grafana can be used for visualization of metrics and below are some of the screenshots of dashboards.
Time series compute unit CPU metrics
<p align="center"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/cpu_ts_stats.png" width="1200"> </p>Time series compute unit GPU metrics
<p align="center"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/gpu_ts_stats.png" width="1200"> </p>List of compute units of user with aggregate metrics
<p align="center"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/job_list_user.png" width="1200"> </p>Aggregate usage metrics of a user
<p align="center"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/agg.png" width="1200"> </p>Aggregate usage metrics of a project
<p align="center"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/agg_proj.png" width="1200"> </p>Energy usage breakdown between project members
<p align="center"> <img src="https://raw.githubusercontent.com/ceems-dev/ceems/main/website/static/img/dashboards/breakdown.png" width="1200"> </p>Usage metrics via CLI tool
CEEMS ships a CLI tool for presenting usage metrics to end users for the deployments where Grafana usage is not possible or prohibitive.
cacct --starttime="2025-01-01" --endtime="2025-03-22"
┌─────────┬─────────┬──────────┬────────┬────────┬──────────┬──────────────────────────────────────┬────────┬────────┬──────────┬──────────────────────────────────────┐
│ JOB ID │ ACCOUNT │ ELAPSED │ CPU US │ CPU ME │ HOST ENE │ HOST EMISSIO │ GPU US │ GPU ME │ GPU ENER │ GPU EMISSION │
│ │ │ │ AGE(%) │ M. USA │ RGY(KWH) │ NS(GMS) │ AGE(%) │ M. USA │ GY(KWH) │ S(GMS) │
│ │ │ │ │ GE(%) │ │ │ │ GE(%) │ │ │
│ │ │ │ │ │ │ EMAPS_TOTAL │ OWID_TOTAL │ RTE_TOTAL │ │ │ │ EMAPS_TOTAL │ OWID_TOTAL │ RTE_TOTAL │
├─────────┼─────────┼──────────┼────────┼────────┼──────────┼─────────────┼────────────┼───────────┼────────┼────────┼──────────┼─────────────┼────────────┼───────────┤
│ 106 │ bedrock │ 00:10:05 │ 99.32 │ 3.39 │ 0.053818 │ 4.725182 │ 5.648855 │ 3.860008 │ │ │ │ │ │ │
│ 108 │
