Sesame
[SIGMOD'23] Data Stream Clustering: An In-depth Empirical Study [ICDM'24] MOStream: A Modular and Self-Optimizing Data Stream Clustering Algorithm
Install / Use
/learn @intellistream/SesameREADME
Sesame
About
Sesame is scalable stream mining library on modern hardware written in C++
By now Sesame contains several representative real-world stream clustering algorithms and synthetic algorithms
Quick Start
Installation
pip3 install pysame
Python Example
#!python3
from pysame import Benne, Birch, BenneObj
X = [[0, 1], [0.3, 1], [-0.3, 1], [0, -1], [0.3, -1], [-0.3, -1]]
# run birch algorithm
brc = Birch(
n_clusters=2,
dim=2,
distance_threshold=0.5,
)
print(brc.partial_fit(X).predict(X))
# run benne algorithm
bne = Benne(
n_clusters=2,
dim=2,
distance_threshold=0.5,
obj=BenneObj.accuracy,
)
print(bne.partial_fit(X).predict(X))
Build Sesame
Prerequisites
Checkout Source Code
git clone https://github.com/intellistream/Sesame --recursive --depth=1
cd Sesame
Build
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
Run Tests
Download the datasets from Zenodo and put them in the datasets directory:
cd Sesame/datasets
pip3 install zenodo_get
zenodo_get 8210331
Run the tests:
cd Sesame/build/test
./google_test
Real-world algorithms
| Algorithm | Window Model | Outlier Detection | Summarizing Data Structure | Offline Refinement | | ---------- | ------------ | ----------------- | -------------------------- | ------------------ | | BIRCH | LandmarkWM | OutlierD | CFT | ❌ | | CluStream | LandmarkWM | OutlierD-T | MCs | ✅ | | DenStream | DampedWM | OutlierD-BT | MCs | ✅ | | DStream | DampedWM | OutlierD-T | Grids | ❌ | | StreamKM++ | LandmarkWM | NoOutlierD | CoreT | ✅ | | DBStream | DampedWM | OutlierD-T | MCs | ✅ | | EDMStream | DampedWM | OutlierD-BT | DPT | ❌ | | SL-KMeans | SlidingWM | NoOutlierD | AMS | ❌ |
Synthetic algorithms
| Algorithm | Window Model | Outlier Detection | Summarizing Data Structure | Offline Refinement | | ---------- | ---------------------------- | ----------------- | --------------------------| -------------------| | G1 | LandmarkWM | OutlierD | MCs | ✅ | | G2 | LandmarkWM | OutlierD | MCs | ✅ | | G3 | LandmarkWM | OutlierD | CFT | ❌ | | G4 | SlidingWM | OutlierD | MCs | ❌ | | G5 | DampedWM | OutlierD-B | MCs | ❌ | | G6 | LandmarkWM | NoOutlierD | MCs | ❌ | | G8 | LandmarkWM | OutlierD | MCs | ❌ | | G9 | LandmarkWM | OutlierD | Grids | ❌ | | G10 | LandmarkWM | OutlierD | DPT | ❌ | | G11 | LandmarkWM | OutlierD-T | MCs | ❌ | | G12 | LandmarkWM | OutlierD-B | MCs | ❌ | | G13 | LandmarkWM | OutlierD-BT | MCs | ❌ | | G14 | LandmarkWM | OutlierD | AMS | ❌ | | G15 | LandmarkWM | OutlierD | CoreT | ❌ |
Datasets
| DataSet | Length | Dimension | Cluster Number | | --------- | ------------------------------------- | --------- | -------------- | | CoverType | 581012 | 54 | 7 | | KDD-99 | 4898431 | 41 | 23 | | Insects | 905145 | 33 | 24 | | Sensor | 2219803 | 5 | 55 | | EDS | 45690, 100270, 150645, 200060, 245270 | 2 | 75, 145, 218, 289, 363 | | ODS | 94720,97360,100000 | 2 | 90, 90, 90 |
Datasets can download from zenodo: https://zenodo.org/records/8210331
How to Cite Sesame
- [SIGMOD 2023] Xin Wang and Zhengru Wang and Zhenyu Wu and Shuhao Zhang and Xuanhua Shi and Li Lu. Data Stream Clustering: An In-depth Empirical Study, SIGMOD, 2023
@inproceedings{wang2023sesame,
title = {Data Stream Clustering: An In-depth Empirical Study},
author = {Xin Wang and Zhengru Wang and Zhenyu Wu and Shuhao Zhang and Xuanhua Shi and Li Lu},
year = 2023,
booktitle = {Proceedings of the 2023 International Conference on Management of Data (SIGMOD)},
location = {Seattle, WA, USA},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
series = {SIGMOD '23},
abbr = {SIGMOD},
bibtex_show = {true},
selected = {true},
pdf = {papers/Sesame.pdf},
code = {https://github.com/intellistream/Sesame},
doi = {10.1145/3589307},
url = {https://doi.org/10.1145/3589307}
}
