LimiX

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence https://arxiv.org/abs/2509.03505

Generate Convert Improve

Install / Use

/learn @limix-ldm-ai/LimiX

About this skill

Quality Score

0/100

README

:boom: News

2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
2025-08-29: LimiX V1.0 Released.

⚡ Latest Results Compared with SOTA Models

➤ Overview

<div align="center"> <img src="./doc/LimiX_Summary.png" alt="LimiX summary" width="89%"> </div> We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.

LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.

For details, please refer to the technical report at the link: LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence or LimiX_Technical_Report.pdf.

➤ Superior Performance

The LimiX model achieved SOTA performance across multiple tasks.

➩ Classification

➩ Regression

➩ Missing Values Imputation

➤ Tutorials

➩ Installation

Option 1 (recommended): Use the Dockerfile

Download Dockerfile

docker build --network=host -t limix/infe:v1 --build-arg FROM_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .

Option 2: Build manually

Download the prebuilt flash_attn files

wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

Install Python dependencies

pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install scikit-learn  einops  huggingface-hub matplotlib networkx numpy pandas  scipy tqdm typing_extensions xgboost kditransform hyperopt

Download source code

git clone https://github.com/limix-ldm/LimiX.git
cd LimiX

➤ Inference

LimiX supports tasks such as classification, regression, and missing value imputation

➩ Model download

| Model size | Download link | Tasks supported | | --- | --- | --- | | LimiX-16M | LimiX-16M.ckpt | ✅ classification ✅regression ✅missing value imputation | | LimiX-2M | LimiX-2M.ckpt | ✅ classification ✅regression |

➩ Interface description

Model Creation

class LimiXPredictor:
    def __init__(self,
                 device:torch.device,
                 model_path:str,
                 mix_precision:bool=True,
                 inference_config: list|str,
                 categorical_features_indices:List[int]|None=None,
                 outlier_remove_std: float=12,
                 softmax_temperature:float=0.9,
                 task_type: Literal['Classification', 'Regression']='Classification',
                 mask_prediction:bool=False,
                 inference_with_DDP: bool = False,
                 seed:int=0)

| Parameter | Data Type | Description | |--------|----------|----------| | device | torch.device | The hardware that loads the model | | model_path | str | The path to the model that needs to be loaded | | mix_precision | bool | Whether to enable the mixed precision inference | | inference_config | list/str | Configuration file used for inference | | categorical_features_indices | list | The indices of categorical columns in the tabular data | | outlier_remove_std | float | The threshold is employed to remove outliers, defined as values that are multiples of the standard deviation | | softmax_temperature | float | The temperature used to control the behavior of softmax operator | | task_type | str | The task type which can be either "Classification" or "Regression" | | mask_prediction | bool | Whether to enable missing value imputation | | inference_with_DDP | bool | Whether to enable DDP during inference | | seed | int | The seed to control random states |

Predict

def predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) -> np.ndarray:

| Parameter | Data Type | Description | | ------- | ---------- | ----------------- | | x_train | np.ndarray | The input features of the training set | | y_train | np.ndarray | The target variable of the training set | | x_test | np.ndarray | The input features of the test set |

Inference Configuration File Description

| Configuration File Name | Description | Difference | | ------- | ---------- | ----- | | cls_default_retrieval.json | Default classification task inference configuration file with retrieval | Better classification performance | | cls_default_noretrieval.json | Default classification task inference configuration file without retrieval | Faster speed, lower memory requirements | | reg_default_retrieval.json | Default regression task inference configuration file with retrieval | Better regression performance | | reg_default_noretrieval.json | Default regression task inference configuration file without retrieval | Faster speed, lower memory requirements | | reg_default_noretrieval_MVI.json | Default inference configuration file for missing value imputation task | |

➩ Ensemble Inference Based on Sample Retrieval

For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the technical report.

Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.

Classification Task

python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data

Regression Task

python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data

Customizing Data Preprocessing for Inference Tasks

First, Generate the Inference Configuration File

generate_inference_config()

Classification Task

Single GPU or CPU

python  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data

Multi-GPU Distributed Inference

torchrun --nproc_per_node=8  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP

Regression Task

Single GPU or CPU

python  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data

Multi-GPU Distributed Inference

torchrun --nproc_per_node=8  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP

Retrieval Optimization Project

This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.

Installation

Ensure you have the required dependencies installed:

pip install optuna

Usage

For standard inference using pre-optimized parameters, refer to the code below:

searchInference = RetrievalSearchHyperparameters(
           dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
)
config, result = searchInference.search(n_trials=10, metric="AUC",
              inference_config='config/cls_default_retrieval.json',task_type="cls")

This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.

➩ Classification

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from skl

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

research_rules

Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

limix-ldm-ai

View profile

View on GitHub

GitHub Stars3.4k

CategoryEducation

Updated1d ago

Forks293

limix-ldm-ai/LimiX

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings