CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding for tabular data

Generate Convert Improve

Install / Use

/learn @DenisVorotyntsev/CategoricalEncodingBenchmark

About this skill

Quality Score

0/100

README

CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding

Reproducibility of results

Requirements

pip install -r requirements.txt

Benchmark the dataset

To benchmark encoders for your dataset:

Install libraries in requirements
Process the dataset as shown in notebooks/1-prepare-datasets.ipynb
Add name of the dataset in dataset_list in src/run_experiment.py
python run_experiment.py
Run notebooks/2-show-results.ipynb

Used datasets and raw scores

All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.

Table 1.1 Used datasets

| Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Telecom | 7.0k | 4.2k | 2.8k | 20 | 16 | Churn prediction for telecom data | | Adult | 48.8k | 29.3k | 19.5k | 15 | 8 | Predict if persons' income is bigger 50k | | Employee | 32.7k | 19.6k | 13.1k | 10 | 9 | Predict an employee's access needs, given his/her job role| | Credit | 307.5k | 184.5k | 123k | 121 | 18 | Loan repayment | | Mortgages | 45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded | | Promotion | 54.8 | 32.8k | 21.9k | 13 | 5 | Predict if an employee will get a promotion | | Kick | 72.9k | 43.7k | 29.1k | 32 | 19 | Predict if a car purchased at auction is good/bad buy | | Kdd_upselling | 50k | 30k | 20k | 230 | 40 | Predict up-selling for a customer | | Taxi | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver | | Poverty_A | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not | | Poverty_B | 20.2k | 12.1k | 8.1k | 224 | 191 | Predict whether or not a given household for a given country is poor or not | | Poverty_C | 29.9k | 17.9k | 11.9k | 41 | 35 | Predict whether or not a given household for a given country is poor or not |

The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.

Table 1.2 ROC AUC scores for None Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |:--------------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | BackwardDifferenceEncoder | 0.6454 | 0.8555 | 0.5006 | 0.7442 | 0.5997 | 0.6482 | | | | 0.5149 | 0.5484 | 0.4945 | | CatBoostEncoder | 0.7666 | 0.868 | 0.5004 | 0.7478 | 0.6279 | 0.7811 | 0.6583 | 0.8549 | 0.5477 | 0.5179 | 0.5638 | 0.5427 | | FrequencyEncoder | 0.8405 | 0.9291 | 0.807 | 0.7593 | 0.6949 | 0.9052 | 0.7907 | 0.8643 | 0.5656 | 0.7276 | 0.6164 | 0.7177 | | HelmertEncoder | 0.8404 | 0.9297 | 0.83 | 0.7601 | 0.7001 | 0.9079 | | | | 0.7325 | 0.6343 | 0.7168 | | JamesSteinEncoder | 0.7195 | 0.8688 | 0.5003 | 0.7485 | 0.6049 | 0.7984 | 0.6592 | 0.8516 | 0.5432 | 0.4918 | 0.5304 | 0.4836 | | LeaveOneOutEncoder | 0.5 | 0.5214 | 0.6233 | 0.4957 | 0.5 | 0.5457 | 0.5027 | 0.5 | 0.5 | 0.5006 | 0.5002 | 0.4527 | | MEstimateEncoder | 0.6944 | 0.8617 | 0.4998 | 0.7368 | 0.6086 | 0.8156 | 0.653 | 0.8448 | 0.5091 | 0.5254 | 0.434 | 0.4528 | | OrdinalEncoder | 0.7409 | 0.8616 | 0.501 | 0.7445 | 0.6008 | 0.7124 | 0.6531 | 0.8448 | 0.5498 | 0.473 | 0.4683 | 0.5611 | | SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | | | | 0.7355 | 0.6206 | 0.7372 | | TargetEncoder | 0.7195 | 0.8696 | 0.5003 | 0.7483 | 0.6064 | 0.7971 | 0.6594 | 0.8483 | 0.5428 | 0.4955 | 0.5401 | 0.4751 | | WOEEncoder | 0.7056 | 0.8645 | 0.5012 | 0.7439 | 0.615 | 0.7345 | 0.6398 | 0.844 | 0.5485 | 0.478 | 0.5356 | 0.4671 |

Table 1.3 ROC AUC scores for Single Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |:--------------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | BackwardDifferenceEncoder | 0.8382 | 0.9293 | 0.7569 | 0.7595 | 0.6894 | 0.9064 | | | | 0.7323 | 0.6151 | 0.7108 | | CatBoostEncoder | 0.8392 | 0.9292 | 0.8498 | 0.7594 | 0.6951 | 0.8918 | 0.7901 | 0.8654 | 0.5844 | 0.7429 | 0.6902 | 0.7333 | | FrequencyEncoder | 0.8392 | 0.9293 | 0.8138 | 0.7592 | 0.6937 | 0.9055 | 0.7902 | 0.8634 | 0.582 | 0.7302 | 0.6128 | 0.7195 | | HelmertEncoder | 0.8404 | 0.9297 | 0.8344 | 0.7597 | 0.7027 | 0.9083 | | | | 0.7297 | 0.6374 | 0.7196 | | JamesSteinEncoder | 0.8388 | 0.9292 | 0.7817 | 0.7597 | 0.667 | 0.9053 | 0.5835 | 0.726 | 0.5898 | 0.7303 | 0.6764 | 0.7217 | | LeaveOneOutEncoder | 0.5 | 0.5182 | 0.6121 | 0.4997 | 0.5 | 0.5403 | 0.4682 | 0.5 | 0.5 | 0.5103 | 0.5 | 0.4959 | | MEstimateEncoder | 0.8394 | 0.929 | 0.7353 | 0.7593 | 0.6957 | 0.9054 | 0.5877 | 0.5953 | 0.5946 | 0.7302 | 0.6493 | 0.7076 | | OrdinalEncoder | 0.8404 | 0.9299 | 0.8274 | 0.7585 | 0.6917 | 0.9078 | 0.7809 | 0.8465 | 0.6034 | 0.7337 | 0.6635 | 0.742 | | SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | | | | 0.7355 | 0.6206 | 0.7372 | | TargetEncoder | 0.8388 | 0.9293 | 0.815 | 0.7599 | 0.6702 | 0.9057 | 0.7042 | 0.713 | 0.5894 | 0.7292 | 0.6742 | 0.7207 | | WOEEncoder | 0.8393 | 0.9294 | 0.8325 | 0.7599 | 0.6801 | 0.9056 | 0.7172 | 0.8391 | 0.5903 | 0.7279 | 0.6737 | 0.7224 |

Table 1.4 ROC AUC scores for Double Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |:-------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | CatBoostEncoder | 0.8394 | 0.9293 | 0.8529 | 0.7592 | 0.6967 | 0.9056 | 0.7899 | 0.8633 | 0.6031 | 0.7418 | 0.6902 | 0.7343 | | FrequencyEncoder | 0.8371 | 0.9221 | 0.5563 | 0.755 | 0.6582 | 0.8749 | 0.7655 | 0.8551 | 0.5657 | 0.6873 | 0.6037 | 0.6961 | | JamesSteinEncoder | 0.8398 | 0.9296 | 0.8489 | 0.7598 | 0.6981 | 0.905 | 0.7901 | 0.8628 | 0.6033 | 0.7412 | 0.6895 | 0.7366 | | LeaveOneOutEncoder | 0.8393 | 0.9295 | 0.8496 | 0.7595 | 0.6963 | 0.9055 | 0.7902 | 0.8635 | 0.602 | 0.7416 | 0.6931 | 0.7345 | | MEstimateEncoder | 0.8405 | 0.9292 | 0.8125 | 0.7597 | 0.6939 | 0.9063 | 0.7881 | 0.863 | 0.5984 | 0.7375 | 0.6801 | 0.7204 | | TargetEncoder | 0.8393 | 0.9294 | 0.8537 | 0.7596 | 0.6954 | 0.9057 | 0.7909 | 0.8643 | 0.6025 | 0.7415 | 0.6903 | 0.7352 | | WOEEncoder

Related Skills

node-connect

342.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.7k

Commit, push, and open a PR