SkillAgentSearch skills...

CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding for tabular data

Install / Use

/learn @DenisVorotyntsev/CategoricalEncodingBenchmark
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding

Reproducibility of results

Requirements

pip install -r requirements.txt

Benchmark the dataset

To benchmark encoders for your dataset:

  1. Install libraries in requirements

  2. Process the dataset as shown in notebooks/1-prepare-datasets.ipynb

  3. Add name of the dataset in dataset_list in src/run_experiment.py

  4. python run_experiment.py

  5. Run notebooks/2-show-results.ipynb

Used datasets and raw scores

All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.

Table 1.1 Used datasets

| Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Telecom | 7.0k | 4.2k | 2.8k | 20 | 16 | Churn prediction for telecom data | | Adult | 48.8k | 29.3k | 19.5k | 15 | 8 | Predict if persons' income is bigger 50k | | Employee | 32.7k | 19.6k | 13.1k | 10 | 9 | Predict an employee's access needs, given his/her job role| | Credit | 307.5k | 184.5k | 123k | 121 | 18 | Loan repayment | | Mortgages | 45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded | | Promotion | 54.8 | 32.8k | 21.9k | 13 | 5 | Predict if an employee will get a promotion | | Kick | 72.9k | 43.7k | 29.1k | 32 | 19 | Predict if a car purchased at auction is good/bad buy | | Kdd_upselling | 50k | 30k | 20k | 230 | 40 | Predict up-selling for a customer | | Taxi | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver | | Poverty_A | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not | | Poverty_B | 20.2k | 12.1k | 8.1k | 224 | 191 | Predict whether or not a given household for a given country is poor or not | | Poverty_C | 29.9k | 17.9k | 11.9k | 41 | 35 | Predict whether or not a given household for a given country is poor or not |

The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.

Table 1.2 ROC AUC scores for None Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |:--------------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | BackwardDifferenceEncoder | 0.6454 | 0.8555 | 0.5006 | 0.7442 | 0.5997 | 0.6482 | | | | 0.5149 | 0.5484 | 0.4945 | | CatBoostEncoder | 0.7666 | 0.868 | 0.5004 | 0.7478 | 0.6279 | 0.7811 | 0.6583 | 0.8549 | 0.5477 | 0.5179 | 0.5638 | 0.5427 | | FrequencyEncoder | 0.8405 | 0.9291 | 0.807 | 0.7593 | 0.6949 | 0.9052 | 0.7907 | 0.8643 | 0.5656 | 0.7276 | 0.6164 | 0.7177 | | HelmertEncoder | 0.8404 | 0.9297 | 0.83 | 0.7601 | 0.7001 | 0.9079 | | | | 0.7325 | 0.6343 | 0.7168 | | JamesSteinEncoder | 0.7195 | 0.8688 | 0.5003 | 0.7485 | 0.6049 | 0.7984 | 0.6592 | 0.8516 | 0.5432 | 0.4918 | 0.5304 | 0.4836 | | LeaveOneOutEncoder | 0.5 | 0.5214 | 0.6233 | 0.4957 | 0.5 | 0.5457 | 0.5027 | 0.5 | 0.5 | 0.5006 | 0.5002 | 0.4527 | | MEstimateEncoder | 0.6944 | 0.8617 | 0.4998 | 0.7368 | 0.6086 | 0.8156 | 0.653 | 0.8448 | 0.5091 | 0.5254 | 0.434 | 0.4528 | | OrdinalEncoder | 0.7409 | 0.8616 | 0.501 | 0.7445 | 0.6008 | 0.7124 | 0.6531 | 0.8448 | 0.5498 | 0.473 | 0.4683 | 0.5611 | | SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | | | | 0.7355 | 0.6206 | 0.7372 | | TargetEncoder | 0.7195 | 0.8696 | 0.5003 | 0.7483 | 0.6064 | 0.7971 | 0.6594 | 0.8483 | 0.5428 | 0.4955 | 0.5401 | 0.4751 | | WOEEncoder | 0.7056 | 0.8645 | 0.5012 | 0.7439 | 0.615 | 0.7345 | 0.6398 | 0.844 | 0.5485 | 0.478 | 0.5356 | 0.4671 |

Table 1.3 ROC AUC scores for Single Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |:--------------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | BackwardDifferenceEncoder | 0.8382 | 0.9293 | 0.7569 | 0.7595 | 0.6894 | 0.9064 | | | | 0.7323 | 0.6151 | 0.7108 | | CatBoostEncoder | 0.8392 | 0.9292 | 0.8498 | 0.7594 | 0.6951 | 0.8918 | 0.7901 | 0.8654 | 0.5844 | 0.7429 | 0.6902 | 0.7333 | | FrequencyEncoder | 0.8392 | 0.9293 | 0.8138 | 0.7592 | 0.6937 | 0.9055 | 0.7902 | 0.8634 | 0.582 | 0.7302 | 0.6128 | 0.7195 | | HelmertEncoder | 0.8404 | 0.9297 | 0.8344 | 0.7597 | 0.7027 | 0.9083 | | | | 0.7297 | 0.6374 | 0.7196 | | JamesSteinEncoder | 0.8388 | 0.9292 | 0.7817 | 0.7597 | 0.667 | 0.9053 | 0.5835 | 0.726 | 0.5898 | 0.7303 | 0.6764 | 0.7217 | | LeaveOneOutEncoder | 0.5 | 0.5182 | 0.6121 | 0.4997 | 0.5 | 0.5403 | 0.4682 | 0.5 | 0.5 | 0.5103 | 0.5 | 0.4959 | | MEstimateEncoder | 0.8394 | 0.929 | 0.7353 | 0.7593 | 0.6957 | 0.9054 | 0.5877 | 0.5953 | 0.5946 | 0.7302 | 0.6493 | 0.7076 | | OrdinalEncoder | 0.8404 | 0.9299 | 0.8274 | 0.7585 | 0.6917 | 0.9078 | 0.7809 | 0.8465 | 0.6034 | 0.7337 | 0.6635 | 0.742 | | SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | | | | 0.7355 | 0.6206 | 0.7372 | | TargetEncoder | 0.8388 | 0.9293 | 0.815 | 0.7599 | 0.6702 | 0.9057 | 0.7042 | 0.713 | 0.5894 | 0.7292 | 0.6742 | 0.7207 | | WOEEncoder | 0.8393 | 0.9294 | 0.8325 | 0.7599 | 0.6801 | 0.9056 | 0.7172 | 0.8391 | 0.5903 | 0.7279 | 0.6737 | 0.7224 |

Table 1.4 ROC AUC scores for Double Validation

| | telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |:-------------------|:----------:|:--------:|:-----------:|:---------:|:------------:|:------------:|:-------:|:----------------:|:-------:|:------------:|:------------:|:------------:| | CatBoostEncoder | 0.8394 | 0.9293 | 0.8529 | 0.7592 | 0.6967 | 0.9056 | 0.7899 | 0.8633 | 0.6031 | 0.7418 | 0.6902 | 0.7343 | | FrequencyEncoder | 0.8371 | 0.9221 | 0.5563 | 0.755 | 0.6582 | 0.8749 | 0.7655 | 0.8551 | 0.5657 | 0.6873 | 0.6037 | 0.6961 | | JamesSteinEncoder | 0.8398 | 0.9296 | 0.8489 | 0.7598 | 0.6981 | 0.905 | 0.7901 | 0.8628 | 0.6033 | 0.7412 | 0.6895 | 0.7366 | | LeaveOneOutEncoder | 0.8393 | 0.9295 | 0.8496 | 0.7595 | 0.6963 | 0.9055 | 0.7902 | 0.8635 | 0.602 | 0.7416 | 0.6931 | 0.7345 | | MEstimateEncoder | 0.8405 | 0.9292 | 0.8125 | 0.7597 | 0.6939 | 0.9063 | 0.7881 | 0.863 | 0.5984 | 0.7375 | 0.6801 | 0.7204 | | TargetEncoder | 0.8393 | 0.9294 | 0.8537 | 0.7596 | 0.6954 | 0.9057 | 0.7909 | 0.8643 | 0.6025 | 0.7415 | 0.6903 | 0.7352 | | WOEEncoder

Related Skills

View on GitHub
GitHub Stars173
CategoryDevelopment
Updated1d ago
Forks38

Languages

Jupyter Notebook

Security Score

80/100

Audited on Mar 30, 2026

No findings