DataMining

🔎Data Understanding, Visualization , Preparation & Cleaning - Clustering algorithms (unsupervised learning) - Classification algorithms (supervised learning) - Sequential Pattern Mining

Generate Convert Improve

Install / Use

/learn @dilettagoglia/DataMining

About this skill

Quality Score

0/100

README

Data Mining Project

Final project for the Data Mining Course A.Y. 2020/2021 @ University of Pisa The project consists in data analysis based on the use of data mining tools.

Learning Goals

Fundamental concepts of data knowledge and discovery.
Data understanding
Data preparation
Clustering
Classification & Regression
Pattern Mining and Association Rules
Outlier Detection
Time Series Analysis
Sequential Pattern Mining
Ethical Issues

Further info:

Final grade: 30/30

Project Description

Task 1 Data Understanding and Preparation

Explore the dataset with analytical tools and describe data semantics, assessing data quality, the distribution of the variables and correlations. Improve the quality of your data and prepare it by extracting new interesting features to describe the customer profile and his purchasing behavior. Defines additional indicators for the construction of a customer profile that can lead to an interesting analysis of customer segmentation.

Explore the new set of features for a statistical analysis (distributions, outliers, visualizations, correlations).

Subtasks

Data semantics
Distribution of the variables and statistics
Assessing data quality (duplictates, missing values, outliers)
Variables transformations & generation
Pairwise correlations and eventual elimination of redundant variables

Task 2: Clustering analysis

Based on the customer’s profile, explore the dataset using various clustering techniques. Carefully describe your decisions for each algorithm and which are the advantages provided by the different approaches.

Preprocessing:

High-correlated features elimination and Normalization

Clustering Analysis by K-means

Identification of the best value of k
Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset
Evaluation of the clustering results

Analysis by density-based clustering

Parametr tuning
Characterization and interpretation of the obtained clusters
Evaluation

Analysis by hierarchical clustering

Compare different clustering results got by using different version of the algorithm
Find the optimal cut
Show and discuss different dendrograms using different algorithms

Conclusions

Final evaluation of the best clustering approach and comparison of the clustering obtained

Optional task for clustering:

Explore the opportunity to use alternative clustering techniques in the library: https://github.com/annoviko/pyclustering

Task 3: Predictive Analysis

Consider the problem of predicting for each customer a label that defines if (s)he is a high-spending customer, medium-spending customer or low-spending customer.

Define a customer profile that enables the above customer classification, reasoning also on the suitability of the customer profile, defined for the clustering analysis.
Compute the label for any customer. Note that, the class to be predicted must be nominal.
Perform the predictive analysis comparing the performance of different models discussing the results and discussing the possible preprocessing that they applied to the data for managing possible problems identified that can make the prediction hard. The evaluation should be performed on both training and test set.

Task 4: Sequential Pattern Mining

Consider the problem of mining frequent sequential patterns. To address the task:

Model the customer as a sequence of baskets
Apply the sequential pattern mining algorithm
Discuss the resulting patterns

Optional Task:

Eextend the algorithm and analysis considering one or more temporal constraints.

Related Skills

feishu-drive

339.3k

things-mac

339.3k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)

clawhub

339.3k

Use the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com

yu-ai-agent

2.0k

编程导航 2025 年 AI 开发实战新项目，基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus，覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发（Manas Java 实现）、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽，帮你成为 AI 时代企业的香饽饽，给你的简历和求职大幅增加竞争力。

dilettagoglia

View profile

View on GitHub

GitHub Stars10

CategoryData

Updated2mo ago

Forks8

dilettagoglia/DataMining

Languages

Jupyter Notebook

Security Score

80/100

Audited on Jan 15, 2026

No findings