SkillAgentSearch skills...

Chaosmeta

A chaos engineering platform for supporting the complete fault drill lifecycle.

Install / Use

/learn @traas-stack/Chaosmeta

README

中文版README

Official Document

Introduction

ChaosMeta is a cloud-native chaos engineering platform open sourced by Ant Group. It embodies the methodologies, technologies and products that Ant Group has accumulated over many years in the practice of large-scale red and blue offensive and defensive drills at the company level. With the "Risk Catalog" (internal general risk scenario manual for technical components in various fields) as theoretical guidance, combined with technical practice, it has escorted Ant Group's various promotional activities for many years.

ChaosMeta is a platform dedicated to supporting all stages of fault drills, covering platform capabilities in multiple stages such as access detection, traffic injection, fault injection, fault measurement, fault recovery, and recovery measurement. While liberating productivity for users, it is also pursuing the future form of chaos engineering: one-click automated drills, and even intelligent drills.

Core advantages

Simple and easy to use, provides user interface, low threshold for use

Support visual user interface, Kubernetes API, command line, HTTP API, and other methods. docs/static/componentlink.png

Fully verified by a large amount of practical experience, high reliability

The Blue Army team of Ant Group has been deeply involved in the chaos engineering industry for many years. It holds company-level large-scale red and blue offensive and defensive drills every year, facing all the company's businesses, and many businesses also conduct 7X24-hour drills and monthly normal drills

Internal drill object types cover cloud products, Kubernetes, Operator applications, databases (OceanBase, Etcd, etc.), middleware (message queues, distributed scheduling, configuration centers, etc.), business applications (Java applications, C++ applications, Golang applications)

High flexibility, supporting a variety of user needs

Whether the user wants a complete chaos engineering platform, or just wants the underlying platform capabilities such as remote injection, orchestration and scheduling, or even just wants the single-machine fault injection capability, or manages and injects targets on or off the cloud Failure, there are corresponding deployment plans to meet

Rich fault injection capabilities, cloud native chaos engineering

Because Ant Group attaches great importance to offensive and defensive drills, it has led to large-scale and high-frequency drills, which in turn has promoted the construction of various fault injection capabilities. And because Ant has a huge internal infrastructure scale, coupled with the low fault tolerance of finance, the stability requirements for infrastructure such as Kubernetes and middleware are very high. Therefore, Ant Chaos Engineering has accumulated rich fault capabilities in the cloud-native field. and exercise experience.

The platform has powerful capabilities, supports the complete "chaos engineering life cycle", and is oriented towards automation.

ChaosMeta covers access detection, traffic injection, fault injection, fault measurement, fault recovery, recovery measurement and other stages of platform capabilities, as the technical basis of "automated chaos engineering"

In addition to the platform capability support of the exercise process, another big mountain in the automated exercise is the design of the experiment. At present, it is difficult to completely rely on machines to automatically design. However, we can systematically abstract the reusable experience and organize it into a book. When conducting chaos engineering exercises on the same type of components, we can quickly reuse it. This is the original intention of the risk catalog design

<img src="docs/static/riskdir_en.png" width="50%" >

ChaosMeta will realize the automated drill capability of one-click physical examination based on the technical foundation of "Chaos Engineering Life Cycle" and the theoretical basis of "Risk Catalog", directly generate the target stability score, and greatly liberate users in chaos

Architecture overview

User layer (Client)

The Client layer is mainly composed of chaosmeta-platform components. Its main task is to lower the threshold for users to use and provide a visual interface to facilitate users' planning, orchestration, experiment configuration, experiment record details, and Agent management (pods/node of k8s clusters, cross-cluster objects, non-k8s physical machines/containers, etc.) and other platform capabilities.

Engine layer (Engine)

The Engine layer includes the core platform capabilities of ChaosMeta and the implementation of some cloud-native fault capabilities, including the following components:

  • chaosmeta-CRD: ChaosMeta's platform capabilities are developed based on the Operator framework, so each type of capability has a corresponding CRD, and then the corresponding Operator monitors the status and performs the corresponding operations. For example, the CRD of the fault injection capability is experiments.inject.chaosmeta.io and the corresponding monitoring operator is chaosmeta-inject-operator. Therefore, users can create corresponding CR instances through Kubectl or Kubernetes-Client to perform corresponding capabilities;

  • chaosmeta-inject-operator: Listens to CR instances related to fault injection created by users, compares the actual status of CR in the cluster with the expected status in the control loop to execute relevant fault injection logic and status transfer, and converts the actual status Tune into the desired state. Different operations are performed based on the fault type defined by the CR instance. For example: if it is a system resource fault, remote injection is required through chaosmeta-daemonset or HTTP or command channel; if it is a cloud native fault, injection will be based on Kubernetes APIServer. , and if it involves a dynamic admission failure, chaosmeta-webhook will also be requested to update the tampering rules and interception rules;

  • chaosmeta-webhook: The API processing process of each APIServer needs to go through authentication, authentication, and admission, and the admission stage will go through the Mutating Admission Webhook (tampering) and Validating Admission Webhook (verification) stages, chaosmeta -webhook will update the resource matching rules according to the fault definition, and intercept, tamper with, delay, and exception the user's Kubernetes resource creation request. This is very meaningful for failure drill scenarios related to Operator applications and Kubernetes' own cluster robustness.

  • chaosmeta-measure-operator: This is the component used to perform measurement capabilities, mainly used in two phases: failure measurement and recovery measurement. The fault metric is an effectiveness measure of the fault injection effect, while the recovery metric is an effectiveness measure of the resilience of the defense platform. Measurement capabilities are the key capabilities to achieve automation and intelligence in chaos engineering.

For example, the failure effect of a drill is expected to be that the number of successful requests for a certain service drops by 50%, and the corresponding defense platform is expected to be able to detect it within 5 minutes and recover within 10 minutes. The execution method is to achieve full CPU usage. Then the fault measurement phase must find the time point when the number of successful service requests drops by 50% compared to before the fault injection (fault effective point). In the recovery measurement phase, it is necessary to find the time point when the corresponding alarm is generated (fault discovery point), and also to find the time point after the fault discovery point to request a successful amount to restore the water level before the drill (fault recovery point). Finally, an analysis report of the exercise was generated, giving areas for improvement in the defense platform.

  • chaosmeta-workflow-operator: Provides fault orchestration capabilities. Because in reality, except for a single failure scenario. There are also demands for a large number of complex fault scenarios, which require simulation through serial and parallel combinations of different fault injection capabilities. And orchestration is not limited to fault injection, but can also include orchestration nodes with different capability types such as traffic injection, fault admission detection, fault measurement, recovery measurement, etc. This is also a key capability for automating drills.

  • chaosmeta-flow-operator: This is a component used to perform traffic injection, mainly used to mock the traffic of the target services. Because when we conduct fault drills, we often need to meet the flow rate to achieve the effect of the fault. For example, if you want to trigger a service delay alarm for a certain service, it is not enough to inject the delay into the container network of this service. If there is no traffic request, the corresponding monitoring alarm will not be triggered.

Kernel layer (Kernel)

The Kernel layer mainly includes the implementation of single-machine fault injection capability, mainly including the chaosmetad component, which provides the method of resident HTTP service and command line execution, and also encapsulates the corresponding daemonset component (chaosmeta-daemonset). The training platform can be flexibly matched with different needs.

Capabilities of the current version

The current version has released: user interface, fault injection scheduling engine, measurement engine, traffic injection

Related Skills

View on GitHub
GitHub Stars323
CategoryDevelopment
Updated1d ago
Forks61

Languages

Go

Security Score

100/100

Audited on Mar 25, 2026

No findings