Holmesgpt
SRE Agent - CNCF Sandbox Project
Install / Use
/learn @HolmesGPT/HolmesgptREADME
Open-source AI agent for investigating production incidents and finding root causes. Works with any stack — Kubernetes, VMs, cloud providers, databases, and SaaS platforms. We are a Cloud Native Computing Foundation sandbox project. Originally created by Robusta.Dev, with major contributions from Microsoft.
- Petabyte-scale data: Server-side filtering, JSON tree traversal, and tool output transformers keep large payloads out of context windows
- Memory-safe execution: Per-tool memory limits, streaming large results to disk, and automatic output budgeting prevent OOM kills when querying large observability datasets
- Deep integrations: Prometheus, Grafana, Datadog, Kubernetes, and many more—plus any REST API
- Bidirectional alert integrations: Fetch alerts from AlertManager, PagerDuty, OpsGenie, or Jira—and write findings back
- Any LLM provider: OpenAI, Anthropic, Azure, Bedrock, Gemini, and more
- No Kubernetes required: Works with any infrastructure — VMs, bare metal, cloud services, or containers
- Operator mode: Optionally run as a Kubernetes operator for automated investigations
How it Works
HolmesGPT uses an agentic loop to query live observability data from multiple sources and identify root causes.
<img width="3114" alt="holmesgpt-architecture-diagram" src="https://github.com/user-attachments/assets/f659707e-1958-4add-9238-8565a5e3713a" />
🔗 Data Sources
HolmesGPT integrates with popular observability and cloud platforms. The following data sources ("toolsets") are built-in. Add your own.
| Data Source | Notes | |-------------|-------| | <img src="images/integration_logos/aks-icon.png" alt="AKS" width="20" style="vertical-align: middle;"> AKS | Azure Kubernetes Service cluster and node health diagnostics | | <img src="images/integration_logos/argocd-icon.png" alt="ArgoCD" width="20" style="vertical-align: middle;"> ArgoCD | Get status, history and manifests and more of apps, projects and clusters | | <img src="images/integration_logos/aws_logo.png" alt="AWS" width="20" style="vertical-align: middle;"> AWS | RDS events, instances, slow query logs, and more (MCP) | | <img src="images/integration_logos/azure.png" alt="Azure" width="20" style="vertical-align: middle;"> Azure | Azure resources and diagnostics (MCP) | | <img src="images/integration_logos/azure.png" alt="Azure SQL" width="20" style="vertical-align: middle;"> Azure SQL | Database health, performance, connections, and slow queries | | <img src="images/integration_logos/confluence_logo.png" alt="Confluence" width="20" style="vertical-align: middle;"> Confluence | Private runbooks and documentation | | <img src="images/integration_logos/confluence_logo.png" alt="Confluence MCP" width="20" style="vertical-align: middle;"> Confluence (MCP) | Private runbooks and documentation (MCP) | | <img src="images/integration_logos/coralogix-icon.png" alt="Coralogix" width="20" style="vertical-align: middle;"> Coralogix | Retrieve logs for any resource | | <img src="images/integration_logos/datadog_logo.png" alt="Datadog" width="20" style="vertical-align: middle;"> Datadog | Query logs, metrics, and traces | | <img src="images/integration_logos/docker_logo.png" alt="Docker" width="20" style="vertical-align: middle;"> Docker | Get images, logs, events, history and more | | <img src="images/integration_logos/opensearchserverless-icon.png" alt="Elasticsearch" width="20" style="vertical-align: middle;"> Elasticsearch / OpenSearch | Query logs, cluster health, shard and index diagnostics | | <img src="images/integration_logos/gcpmonitoring-icon.png" alt="GCP" width="20" style="vertical-align: middle;"> GCP | Google Cloud Platform resources (MCP) | | <img src="images/integration_logos/github_logo.png" alt="GitHub" width="20" style="vertical-align: middle;"> GitHub | Repositories, issues, and pull requests (MCP) | | <img src="images/integration_logos/grafana-icon.png" alt="Grafana" width="20" style="vertical-align: middle;"> Grafana | Query and analyze dashboard configurations and panels | | <img src="images/integration_logos/helm_logo.png" alt="Helm" width="20" style="vertical-align: middle;"> Helm | Release status, chart metadata, and values | | <img src="images/integration_logos/http-icon.png" alt="Internet" width="20" style="vertical-align: middle;"> Internet | Public runbooks, community docs etc | | <img src="images/integration_logos/kafka_logo.png" alt="Kafka" width="20" style="vertical-align: middle;"> Kafka | Fetch metadata, list consumers and topics or find lagging consumer groups | | <img src="images/integration_logos/kubernetes-icon.png" alt="Kubernetes" width="20" style="vertical-align: middle;"> Kubernetes | Pod logs, K8s events, and resource status (kubectl describe) | | <img src="images/integration_logos/kubernetes-icon.png" alt="Kubernetes Remediation" width="20" style="vertical-align: middle;"> Kubernetes Remediation (MCP) | Apply fixes like scaling, rollbacks, and resource edits (MCP) | | <img src="images/integration_logos/grafana_loki-icon.png" alt="Loki" width="20" style="vertical-align: middle;"> Loki | Query logs for Kubernetes resources or any query | | <img src="images/integration_logos/postgres-icon.png" alt="MariaDB" width="20" style="vertical-align: middle;"> MariaDB | MariaDB database queries and diagnostics (MCP) | | <img src="images/integration_logos/postgres-icon.png" alt="MongoDB" width="20" style="vertical-align: middle;"> MongoDB | Query data, diagnose performance, inspect schemas, find slow operations | | <img src="images/integration_logos/postgres-icon.png" alt="MongoDB Atlas" width="20" style="vertical-align: middle;"> MongoDB Atlas | Cluster health, slow queries, and performance diagnostics | | <img src="images/integration_logos/newrelic_logo.png" alt="NewRelic" width="20" style="vertical-align: middle;"> NewRelic | Investigate alerts, query tracing data | | <img src="images/integration_logos/openshift-icon.png" alt="OpenShift" width="20" style="vertical-align: middle;"> OpenShift | Projects, routes, builds, security context constraints, and deployment configs | | <img src="images/integration_logos/prefect-icon.png" alt="Prefect" width="20" style="vertical-align: middle;"> Prefect (MCP) | Workflow orchestration monitoring, flow runs, and worker health (MCP) | | <img src="images/integration_logos/prometheus-icon.png" alt="Prometheus" width="20" style="vertical-align: middle;"> Prometheus | Investigate alerts, query metrics and generate PromQL queries | | <img src="images/integration_logos/rabbit_mq_logo.png" alt="RabbitMQ" width="20" style="vertical-align: middle;"> RabbitMQ | Partitions, memory/disk alerts, troubleshoot split-brain scenarios and more | | <img src="images/integration_logos/robusta_logo.png" alt="Robusta" width="20" style="vertical-align: middle;"> Robusta | Multi-cluster monitoring, historical change data, runbooks, PromQL graphs and more | | <img src="images/integration_logos/servicenow-icon.png" alt="ServiceNow" width="20" style="vertical-align: middle;"> ServiceNow | Query tables and incident records | | [<img src="images/integr
