Autoembedder

PyTorch autoencoder with additional embeddings layer for categorical data 🚘

Generate Convert Improve

Install / Use

/learn @chrislemke/Autoembedder

About this skill

Quality Score

0/100

README

The autoembedder

The Autoembedder

Introduction

The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.

Installation

If you are using Poetry, you can install the package with the following command:

poetry add autoembedder

If you are using pip, you can install the package with the following command:

pip install autoembedder

Installing dependencies

With Poetry:

poetry install

With pip:

pip install -r requirements.txt

Usage

0. Some imports

from autoembedder import Autoembedder, dataloader, fit

1. Create dataloaders

First, we create two dataloaders. One for training, and the other for validation data. As source they either accept a path to a Parquet file, to a folder of Parquet files or a Pandas/Dask DataFrame.

train_dl = dataloader(train_df)
valid_dl = dataloader(vaild_df)

2. Set parameters

Now, we need to set the parameters. They are going to be used for handling the data and training the model. In this example, only parameters for the training are set. Here you find a list of all possible parameters. This should do it:

parameters = {
    "hidden_layers": [[25, 20], [20, 10]],
    "epochs": 10,
    "lr": 0.0001,
    "verbose": 1,
}

3. Initialize the autoembedder

Then, we need to initialize the autoembedder. In this example, we are not using any categorical features. So we can skip the embedding_sizes argument.

model = Autoembedder(parameters, num_cont_features=train_df.shape[1])

4. Train the model

Everything is set up. Now we can fit the model.

fit(parameters, model, train_dl, valid_dl)

Example

Check out this Jupyter notebook for an applied example using the Credit Card Fraud Detection from Kaggle.

Parameters

This is a list of all parameters that can be passed to the Autoembedder for training. When using the training script the _ needs to be replaced with - and the parameters need to be passed as arguments. For boolean values please have a look at the Comment column for understanding how to pass them.

Run the training script

You can also simply use the training script::

python3 training.py \
--epochs 20 \
--train-input-path "path/to/your/train_data" \
--test-input-path "path/to/your/test_data" \
--hidden-layers "[[12, 6], [6, 3]]"

for help just run:

python3 training.py --help

| Argument | Type | -------------------- | ----- | batch_size | int | False | 32 | drop_last | bool | False | pin_memory | bool | False | num_workers | int | False | 0 | use_mps | bool | False | model_title | str | False | model_save_path | str | False | | n_save_checkpoints | int | False | | lr | float | False | amsgrad | bool | False | epochs | int | True | | dropout_rate | float | False | 0 | layer_bias | bool | False | weight_decay | float | False | l1_lambda | float | False | 0 | xavier_init | bool | False | activation | str | False | tensorboard_log_path | str | False | | trim_eval_errors | bool | False | Required | Default value | Comment | | -------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | | | True | --drop-last / --no-drop-last | | True | --pin-memory / --no-pin-memory | | 0 means that the data will be loaded in the main process | | False | --use-mps / --no-use-mps | | autoembedder_{datetime}.bin | | | | | | | 0.001 | | | False | --amsgrad / --no-amsgrad | | | Dropout rate for the dropout layers in the encoder and decoder. | | True | --layer-bias / --no-layer-bias | | | False | | | | | False | --xavier-init / --no-xavier-init | | tanh | Activation function; either tanh, relu, leaky_relu or elu | | | | False |--trim-eval-errors / --no-t

Related Skills

claude-opus-4-5-migration

106.4k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

345.9k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

feishu-drive

345.9k

things-mac

345.9k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)