Omikuji

An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Generate Convert Improve

Install / Use

/learn @tomtung/Omikuji

About this skill

Quality Score

0/100

README

Omikuji

An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, written in Rust🦀 with love💖.

Features & Performance

Omikuji has has been tested on datasets from the Extreme Classification Repository. All tests below are run on a quad-core Intel® Core™ i7-6700 CPU, and we allowed as many cores to be utilized as possible. We measured training time, and calculated precisions at 1, 3, and 5. (Note that, due to randomness, results might vary from run to run, especially for smaller datasets.)

Parabel, better parallelized

Omikuji provides a more parallelized implementation of Parabel (Prabhu et al., 2018) that trains faster when more CPU cores are available. Compared to the original implementation written in C++, which can only utilize the same number of CPU cores as the number of trees (3 by default), Omikuji maintains the same level of precision but trains 1.3x to 1.7x faster on our quad-core machine. Further speed-up is possible if more CPU cores are available.

| Dataset | Metric | Parabel | Omikuji (balanced, cluster.k=2) | |----------------- |------------ |--------- |------------------------------------------ | | EURLex-4K | P@1 | 82.2 | 82.1 | | | P@3 | 68.8 | 68.8 | | | P@5 | 57.6 | 57.7 | | | Train Time | 18s | 14s | | Amazon-670K | P@1 | 44.9 | 44.8 | | | P@3 | 39.8 | 39.8 | | | P@5 | 36.0 | 36.0 | | | Train Time | 404s | 234s | | WikiLSHTC-325K | P@1 | 65.0 | 64.8 | | | P@3 | 43.2 | 43.1 | | | P@5 | 32.0 | 32.1 | | | Train Time | 959s | 659s |

Regular k-means for shallow trees

Following Bonsai (Khandagale et al., 2019), Omikuji supports using regular k-means instead of balanced 2-means clustering for tree construction, which results in wider, shallower and unbalanced trees that train slower but have better precision. Comparing to the original Bonsai implementation, Omikuji also achieves the same precisions while training 2.6x to 4.6x faster on our quad-core machine. (Similarly, further speed-up is possible if more CPU cores are available.)

| Dataset | Metric | Bonsai | Omikuji (unbalanced, cluster.k=100, max_depth=3) | |----------------- |------------ |--------- |-------------------------------------------------------------- | | EURLex-4K | P@1 | 82.8 | 83.0 | | | P@3 | 69.4 | 69.5 | | | P@5 | 58.1 | 58.3 | | | Train Time | 87s | 19s | | Amazon-670K | P@1 | 45.5* | 45.6 | | | P@3 | 40.3* | 40.4 | | | P@5 | 36.5* | 36.6 | | | Train Time | 5,759s | 1,753s | | WikiLSHTC-325K | P@1 | 66.6* | 66.6 | | | P@3 | 44.5* | 44.4 | | | P@5 | 33.0* | 33.0 | | | Train Time | 11,156s | 4,259s |

*Precision numbers as reported in the paper; our machine doesn't have enough memory to run the full prediction with their implementation.

Balanced k-means for balanced shallow trees

Sometimes it's desirable to have shallow and wide trees that are also balanced, in which case Omikuji supports the balanced k-means algorithm used by HOMER (Tsoumakas et al., 2008) for clustering as well.

| Dataset | Metric | Omikuji (balanced, cluster.k=100) | |----------------- |------------ |------------------------------------------ | | EURLex-4K | P@1 | 82.1 | | | P@3 | 69.4 | | | P@5 | 58.1 | | | Train Time | 19s | | Amazon-670K | P@1 | 45.4 | | | P@3 | 40.3 | | | P@5 | 36.5 | | | Train Time | 1,153s | | WikiLSHTC-325K | P@1 | 65.6 | | | P@3 | 43.6 | | | P@5 | 32.5 | | | Train Time | 3,028s |

Layer collapsing for balanced shallow trees

An alternative way for building balanced, shallow and wide trees is to collapse adjacent layers, similar to the tree compression step used in AttentionXML (You et al., 2019): intermediate layers are removed, and their children replace them as the children of their parents. For example, with balanced 2-means clustering, if we collapse 5 layers after each layer, we can increase the tree arity from 2 to 2⁵⁺¹ = 64.

| Dataset | Metric | Omikuji (balanced, cluster.k=2, collapse 5 layers) | |----------------- |------------ |--------------------------------------------------------------- | | EURLex-4K | P@1 | 82.4 | | | P@3 | 69.3 | | | P@5 | 58.0 | | | Train Time | 16s | | Amazon-670K | P@1 | 45.3 | | | P@3 | 40.2 | | | P@5 | 36.4 | | | Train Time | 460s | | WikiLSHTC-325K | P@1 | 64.9 | | | P@3 | 43.3 | | | P@5 | 32.3 | | | Train Time | 1,649s |

Build & Install

Omikuji can be easily built & installed with Cargo as a CLI app:

cargo install omikuji --features cli --locked

Or install from the latest source:

cargo install --git https://github.com/tomtung/omikuji.git --features cli --locked

The CLI app will be available as omikuji. For example, to reproduce the results on the EURLex-4K dataset:

omikuji train eurlex_train.txt --model_path ./model
omikuji test ./model eurlex_test.txt --out_path predictions.txt

Python Binding

A simple Python binding is also available for training and prediction. It can be install via pip:

pip install omikuji

Note that you might still need to install Cargo should compilation become necessary.

You can also install from the latest source:

pip install git+https://github.com/tomtung/omikuji.git -v

The following script demonstrates how to use the Python binding to train a model and make predictions:

import omikuji

# Train
hyper_param = omikuji.Model.default_hyper_param()
# Adjust hyper-parameters as needed
hyper_param.n_trees = 5
model = omikuji.Model.train_on_data("./eurlex_train.txt", hyper_param)

# Serialize & de-serialize
model.save("./model")
model = omikuji.Model.load("./model")
# Optionally densify model weights to trade off between prediction speed and memory usage
model.densify_weights(0.05)

# Predict
feature_value_pairs = [
    (0, 0.101468),
    (1, 0.554374),
    (2, 0.235760),
    (3, 0.065255),
    (8, 0.152305),
    (10, 0.155051),
    # ...
]
label_score_pairs =  model.predict(feature_value_pairs)

Usage

$ omikuji train --help
Train a new omikuj

Related Skills

himalaya

346.8k

CLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).

taskflow

346.8k

name: taskflow description: Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layer

coding-agent

346.8k

Delegate coding tasks to Codex, Claude Code, or Pi agents via background process

tavily

346.8k

Tavily web search, content extraction, and research tools.