SkillAgentSearch skills...

Ekg

Essential Kubernetes Gauges

Install / Use

/learn @nobl9/Ekg
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

EKG (Essential Kubernetes Gauges)

Monitor the health of a Kubernetes cluster with ease, using Service Level Objectives (SLOs).

Essential Kubernetes Gauges (EKG) provides a set of standardized, prefabricated SLOs that measure the reliability of a Kubernetes cluster. You can think of these SLOs as a check engine light that tells you when your EKS cluster is misbehaving, with a historical record of when the cluster was behaving as desired and when not. SLOs allow you to set adjustable goals for the reliability aspects of your clusters. EKG includes SLOs that measure several aspects of a cluster:

  • Control Plane Health
    • Is the Kubernetes API responding normally?
    • Is it performant?
  • Cluster Health
    • Are the nodes healthy?
    • Is there some minimum of resource headroom?
    • Can we start new workloads?
  • Workload Health
    • Is anything in a bad state?
    • Is there at least some kind of workload running?
  • Resource Efficiency
    • Are resources underutilized in this cluster?
    • Is the cluster scaling in such a way that it is making good use of resources, without endangering workloads?
  • Cost Efficiency measurements have been proposed as future enhancement and are under consideration.

These aspects of cluster behavior provide a gauge on how well Kubernetes (as an application) is running, as well as how the overall cluster is faring, along with some insight into the health of the workloads the cluster is supporting. By running EKG's SLOs, you can measure how well your clusters are doing over time, share numbers, charts, and reports on cluster reliability across your teams, managers, and stakeholders, and use all the features of Nobl9, including alert policies and alert integrations with a wide variety of popular tools.

For more information on the specific SLOs included in EKG and how to make use of them, please see the SLO docs

Supported Products

While EKG's SLOs and techniques are usable with any flavor of Kubernetes, and while the underlying instrumentation frameworks are compatible with many distributions, we currently provide end-to-end automation for the following products:

  • AWS EKS (Elastic Container Service for Kubernetes)

Support for additional Kubernetes distributions has been proposed and is under consideration.

Architecture

diagram

Deployment of EKG with default settings on an existing EKS cluster

Amazon Managed Service for Prometheus is created (optionally), and metrics are collected from electrodes:

  • kube-state-metrics - metrics for the health of the various objects inside, such as deployments, nodes and pods
  • node-problem-detector - metrics for the health of the node e.g. infrastructure daemon issues: ntp service down, hardware issues e.g. bad CPU, memory or disk, kernel issues e.g. kernel deadlock, corrupted file system, container runtime issues e.g. unresponsive runtime daemon
  • Kuberhealthy - metrics for the health of the cluster, performs synthetic tests that ensures daemonsets, deployments can be deployed, DNS resolves names, etc.

Metrics are plumbed into Amazon Managed Service for Prometheus using AWS Distro for OpenTelemetry (ADOT). It collects metrics exposed by other services too, thus it can be the only thing you need to collect all of the platform and application metrics from a cluster.

The repository gives fine-grain control over what can be deployed or reused. For instance, you can install only chosen electrodes, or all of them, reuse existing Amazon Managed Service for Prometheus, or install only ADOT. For details check the documentation of specific modules. Electrodes can be used with any Kubernetes cluster - EKS, GKE, on-premise, etc.

To learn how to contribute please read the contribution guidelines.

End-to-end installation of EKG on fresh EKS cluster

  1. Prerequisites. You will need

    • A Nobl9 Organization. If you don't already have a Nobl9 org, you can sign up for Nobl9 free edition at https://app.nobl9.com/signup/
    • Terraform. If you need to install it: https://developer.hashicorp.com/terraform/downloads
    • An AWS account, with configuration and credentials connecting it to Terraform, for example by installing AWS CLI
    • An EKS cluster. We assume you have a bunch of these, but if you want to spin up a fresh test cluster to try out EKG in isolation, how about following the steps in this tutorial? The tutorial defaults to Terraform Cloud (which is quite nice) but for this exercise we recommend you click on the Terraform OSS tabs as you proceed.
    • You need to configure IAM OIDC provider for the EKS cluster. Tutorial linked above does it for you, but if you are using an existing cluster, you may need to do it manually.
  2. Create a terraform.tfvars file. A staring point can be found in terraform.tfvars.example

    cp terraform.tfvars.example terraform.tfvars
    # edit that file with an editor of your choice
    # provide values for your AWS region, cluster name, and Nobl9 organization
    
  3. Provide required secrets to the Nobl9 Terraform Provider. In the Nobl9 web UI, go to Settings > Access Keys, create an access key (save it somewhere) and then set the values as env vars:

    export TF_VAR_nobl9_client_id="<your Nobl9 Client ID>"
    export TF_VAR_nobl9_client_secret="<your Nobl9 Client Secret>"
    
  4. Use Terraform to install the EKG components. In the root of this repository, run:

terraform init
terraform apply

Output of the above is

Apply complete! Resources: 36 added, 0 changed, 0 destroyed.

Outputs:

amp_ws_endpoint = "https://aps-workspaces.us-east-2.amazonaws.com/workspaces/ws-abcdef12-3456-7890-abcd-ef1234567890/"

Congratulations! EKG is up and running. Go back to the Nobl9 web UI and explore your newly created SLOs. For more information about these SLOs, how to use them, and how to tune them to your clusters' conditions and workloads, see the SLO docs

What all does this install?

Inside your EKS cluster

  • kube-state-metrics
  • Kuberhealthy
  • node-problem-detector
  • Amazon Distro for OpenTelemetry (ADOT)
  • A Nobl9 Agent compatible with and connecting to Amazon Managed Service for Prometheus

Inside your Nobl9 Organization

  • A Nobl9 Project and a Service to hold the SLOs

  • The EKG SLOs (several prefab SLOs for Kubernetes)

  • An agent-based Data Source configured to receive data from the Nobl9 Agent running in the EKS cluster

Elsewhere inside your AWS Account

  • An IAM User with access keys (configured in the Nobl9 Agent) and an inline policy allowing it to access the Amazon Managed Service for Prometheus workspace

  • An Amazon Managed Service for Prometheus workspace, available at the URL output as amp_ws_endpoint

The amp_ws_endpoint is a URL for Amazon Managed Service for Prometheus that can be directly used for instance in Grafana or Nobl9. Deploying Grafana or other visualization tools is not in the scope of this project, but if you are looking for a quick and clean Grafana to play with as you explore the metrics in your newly created Prometheus, how about running it locally from a docker image and include the additional env vars required to allow it to connect to Amazon's managed prom.

Or in short:

docker run -d -p 3000:3000 --name="grafana" -e "AWS_SDK_LOAD_CONFIG=true" -e "GF_AUTH_SIGV4_AUTH_ENABLED=true" grafana/grafana-oss

Then when you configure a Prometheus data source it will offer AWS specific settings to connect it to that amp_ws_endpoint

Helpful Links

<!-- BEGIN_TF_DOCS -->

Requirements

| Name | Version | |------|---------| | <a name="requirement_terraform"></a> terraform | >= 1.1.0 | | <a name="requirement_aws"></a> aws | >= 3.72 | | <a name="requirement_helm"></a> helm | >= 2.4.1 | | <a name="requirement_kubernetes"></a> kubernetes | >= 2.10 | | <a name="requirement_nobl9"></a> nobl9 | 0.26.0 |

Providers

| Name | Version | |------|---------| | <a name="provider_aws"></a> aws | 5.53.0 | | <a name="provider_kubernetes"></a> kubernetes | 2.30.0 |

Modules

| Name | Source | Version | |------|--------|---------| | <a name="module_adot_amp"></a> adot_amp | ./modules/adot-amp | n/a | | <a name="module_electrodes"></a> electrodes | ./modules/electrodes | n/a | | <a name="module_nobl9"></a> nobl9 | ./modules/nobl9 | n/a |

Resources

| Name | Type | |------|------| | [kubernetes_namespace.this](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/res

Related Skills

View on GitHub
GitHub Stars83
CategoryDevelopment
Updated7mo ago
Forks7

Languages

HCL

Security Score

87/100

Audited on Aug 11, 2025

No findings