Reginx

Reginx is short for recommendation engine X. I plan to build most parts of modern recommendation engine from scratch.
Initial plan including:

Popular machine learning models like CF, FM, XGBoost, TwoTower, W&D, DeepFM, DCN, MaskNet, SASRec, Bert4Rec, Transformer, etc.
Online inference service written by Golang, including candidate generator, ranking and re-ranking layers
Feature engineering and preprocessing, including both online and offline part
Diversity approaches, like MMR, DPP
Deduplication approaches, like LSH or BloomFilter
Training data pipeline
Model registry, monitoring and versioning

Supported models

Tensorflow 2 and Google Cloud is used for model training and performance tracking. The conda environment config is here.
I have a personal blog in substack explaining the models and I put the corresponding links in the table below.

| Model | Paper | Code | Blog | | ------------- | ------------- | ------------- | ------------- | | Factorization Machines | Factorization Machines | Code | Post | | DeepFM | DeepFM: A Factorization-Machine based Neural Network for CTR Prediction | Code | Post | | XDeepFM | xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems | Code| Post | | AutoInt | AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks | Code | Post | | DCN | Deep & Cross Network for Ad Click Predictions | Code | Post | | DCN V2 | DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems | Code | Post | | DLRM | Deep Learning Recommendation Model for Personalization and Recommendation Systems | Code | Post | | FinalMLP | FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction | DualMLP FinalMLP| Post | | MaskNet | MaskNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask | Code| Post | | TwoTower | Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations | Code| Post1 Post2 Post3 | | Wide and Deep | Wide & Deep Learning for Recommender Systems | Code| Post | | Transformer | Attention Is All You Need | Code| Post | | BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Code| Post | | SASRec | Self-Attentive Sequential Recommendation | Code| Post | | BERT4REC | BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer | Code| Post | | ESMM | Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate | Code| Post | | MMoE | Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts | Code| Post |

Local Training

Here is an example to train a two-tower model in local machine.

Setup Conda

Setup your conda environment using the conda config here.

conda env create -f environment.yml
conda activate tf

Set your PYTHONPATH to the root folder of this project. Or you can add it to your bashrc:

export PYTHONPATH=/your_project_folder/reginx

Prepare Movielens Training Data

You can run this script to generate meta and training data in your local directory. By default, it's using the movielens-1m from TensorFlow datasets.
And copy your dataset files to your local /tmp/train, /tmp/test, /tmp/item folder. Notice that the TwoTower model implementation require 3 kinds of files, train files for training, test files for test and item files for mixing global negative samples.
If you want to use your dataset other than movielens, please prepare your own dataset and save it to your local directory.

Check Config File

There is example config file for candidate-retriever training.
If you want to use your dataset other than movielens, please prepare your own query and candidate embedding class.

model:
  temperature: 0.05
  # specify training model under models folder
  base_model: TwoTower
  # specify query embedding model under models/features folder
  query_emb: MovieLensQueryEmb
  # specify candidate embedding model under models/features folder
  candidate_emb: MovieLensCandidateEmb
  # specify the unique key for candidates
  item_id_key: movie_id

train:
  # specify task under tasks folder
  task_name: CandidateRetrieverTrain
  epochs: 1
  batch_size: 256
  mixed_negative_batch_size: 128
  learning_rate: 0.05
  train_data: movielens/data/ratings_train
  test_data: movielens/data/ratings_test
  candidate_data: movielens/data/movies
  meta_data: trainer/meta/movie_lens.json
  model_dir: trainer/saved_models/movielens_cr
  log_dir: logs

Training

Simply run the script below and specify your the config file in you activated conda environment.

python trainer/local_train.py -c movielens_candidate_retriever

By default, the training metrics show once per 1000 training steps for faster training. You can modify the setting by tuning the steps_per_execution hyperparameter while compiling model.
After the training, evaluation will be run on the test dataset. You should see metrics like:

391/391 [==============================] - 50s 129ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0036 - factorized_top_k/top_5_categorical_accuracy: 0.0181 - factorized_top_k/top_10_categorical_accuracy: 0.0349 - factorized_top_k/top_50_categorical_accura

Reginx

Install / Use

README

Reginx

Supported models

Local Training

Setup Conda

Prepare Movielens Training Data

Check Config File

Training