Nezha
The implementation of multimodal observability data root cause analysis approach Nezha in FSE 2023
Install / Use
/learn @IntelligentDDS/NezhaREADME
Nezha
This repository is the basic implementation of our publication in FSE'23 conference paper Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data
Description
Nezha is an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multimodal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way.
Quick Start
Requirements
- Python3.6 is recommended to run the anomaly detection. Otherwise, any python3 version should be fine.
- Git is also needed.
Setup
Download Nezha first via git clone git@github.com:IntelligentDDS/Nezha.git
Enter Nezha content by cd Nezha
python3.6 -m pip install -r requirements.txt to install the dependency for Nezha
Running Nezha
OnlineBoutique at service level
python3.6 ./main.py --ns hipster --level service
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 96.428571 %
OnlineBoutique at inner service level
python3.6 ./main.py --ns hipster --level inner
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 96.428571 %
Trainticket at service level
python3.6 ./main.py --ns ts --level service
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 97.777778 %
Trainticket at inner service level
python3.6 ./main.py --ns ts --level inner
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 97.777778 %
The details of service level results and inner-service level results will be printed and recorded in ./log
Dataset
2022-08-22 and 2022-08-23 is the fault-suffering dataset of OnlineBoutique
2023-01-29 and 2023-01-30 is the fault-suffering dataset of Trainticket
Fault-free data
construct_data is the data of fault-free phase
root_cause_hipster.json is the inner-servie level label of root causes in OnlineBoutique
root_cause_ts.json is the inner-servie level label of root causes in Trainticket
As an example,
"checkoutservice": {
"return": "Start charge card_Charge successfully",
"exception": "Start charge card_Charge successfully",
"network_delay": "NetworkP90(ms)",
"cpu_contention": "CpuUsageRate(%)",
"cpu_consumed": "CpuUsageRate(%)"
},
The label of checkoutservice means that the label return fault of checkoutservice is core regions between log statement contains Start charge card and Charge successfully.
Fault-suffering Data
rca_data is the data of fault-suffering phase
2022-08-22-fault_list and 2022-08-23-fault_list is the servie level label of root causes in OnlineBoutique
2023-01-29-fault_list and 2022-01-30-fault_list is the servie level label of root causes in TrainTicket
Project Structure
.
├── LICENSE
├── README.md
├── construct_data
│ ├── 2022-08-22
│ │ ├── log
│ │ ├── metric
│ │ ├── trace
│ │ └── traceid
│ ├── 2022-08-23
│ ├── 2023-01-29
│ ├── 2023-01-30
│ ├── root_cause_hipster.json: label at inner-service level for OnlineBoutique
│ └── root_cause_ts.json: label at inner-service level for ts
├── rca_data
│ ├── 2022-08-22
│ │ ├── log
│ │ ├── metric
│ │ ├── trace
│ │ ├── traceid
│ │ └── 2022-08-22-fault_list.json: label at service level
│ ├── 2022-08-23
│ ├── 2023-01-29
│ └── 2023-01-30
├── log: RCA result
├── log_template: drain3 config
├── alarm.py: generate alarm
├── data_integrate.py: transform metric, log, and trace to event graph
├── log_parsing.py: parsing logs
├── log.py: record logs
├── pattern_miner.py: mine patterns from event graph
├── pattern_ranker.py: rank suspicious patterns
├── main.py: running nezha
└── requirements.txt
Reference
Please cite our FSE'23 paper if you find this work is helpful.
@inproceedings{nezha,
title={Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data},
author={Yu, Guangba and Chen, Pengfei and Li, Yufeng and Chen, Hongyang and Li, Xiaoyun and Zheng, Zibin},
booktitle={ESEC/FSE 2023},
pages={},
year={2023},
organization={ACM}
}
Related Skills
node-connect
337.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.1kCommit, push, and open a PR
