AutoStats

A libray for automatically cleaning, imputing and analyzing datasets with minimal coding

Generate Convert Improve

Install / Use

/learn @iMaatin/AutoStats

About this skill

Quality Score

0/100

README

AutoStats<img src='https://raw.githubusercontent.com/iMaatin/AutoStats/main/logo.png' align="right" height="139" />

AutoStats is a Python library designed to simplify the process of cleaning, imputing, and analyzing datasets with minimal coding effort. It provides tools for generating exploratory reports, handling missing data, and optimizing imputation methods, making it ideal for data scientists and analysts.

🚀 Features

Report Module

Auto Report: Automatically generates an initial exploratory report from your dataset, categorizing columns and visualizing data distributions.
Manual Report: Allows users to specify categorical, continuous, and discrete columns for a more customized report.

Impute Module

Data Preprocessing: Automatically preprocesses datasets by handling missing values, encoding categorical variables, and identifying column types (categorical, continuous, discrete).
Imputation Methods:
- KNN Imputation: Uses K-Nearest Neighbors to fill missing values.
- MICE Imputation: Implements Multiple Imputation by Chained Equations.
- MissForest Imputation: Uses Random Forests to impute missing values.
- MIDAS Imputation: Leverages deep learning for advanced imputation.
Hyperparameter Optimization: Automatically tunes imputation methods using Optuna for the best performance.
Best Method Selection: Evaluates multiple imputation methods and selects the best-performing one for each column.

📦 Installation

To install AutoStats, ensure you have Python 3.8 or higher and run the following command:

pip install AutoStats

You can also view the project on PyPI.

Usage

Auto Report

To generate an automated exploratory data analysis report:

from AutoStats.report import auto_report
import pandas as pd

# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Generate the report
auto_report(df, tresh=10, output_file="auto_report.pdf", df_name="Your Dataset")

Manual Report

To create a report with manually specified column types:

from AutoStats.report import manual_report
import pandas as pd

# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Specify column types
categorical_cols = ['col1', 'col2']
continuous_cols = ['col3', 'col4']
discrete_cols = ['col5']

# Generate the report
manual_report(df, categorical_cols, continuous_cols, discrete_cols, output_file="manual_report.pdf", df_name="Your Dataset")

Data Imputation

To run the complete missing data imputation pipeline:

from AutoStats.impute import run_full_pipeline
import pandas as pd

# Load your dataset (with missing values)
df = pd.read_csv("your_dataset.csv")

# Run the imputation pipeline
best_imputed_df, summary_table = run_full_pipeline(df, simulate=True, build=True)

# The pipeline returns the imputed dataframe and a summary of the best methods used.
print(best_imputed_df.head())
print(summary_table)

How the imputation pipeline works:

simulate=True: This mode is for datasets that already have missing values. The pipeline will identify the missing data patterns and find the best imputation method for each column. This is the most common use case.
simulate=False: This mode is for evaluating the imputation methods on a complete dataset. You must specify a missingness_value (e.g., missingness_value=0.10 for 10%). The pipeline will artificially introduce missing values into your complete dataset and then impute them, allowing you to assess the performance of the different methods.

For a detailed technical explanation of the imputation module, please refer to the Imputation Technical Report.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Skills

claude-opus-4-5-migration

90.0k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

343.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

feishu-drive

343.1k

things-mac

343.1k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)