SplitLight
An Exploratory Toolkit for Recommender Systems Datasets and Splits
Install / Use
/learn @monkey0head/SplitLightREADME

🌟 SplitLight: Explore Your RecSys Dataset and Split
<a href="https://arxiv.org/abs/2602.19339"><img src="https://img.shields.io/badge/arXiv-2602.19339-b31b1b.svg" height=22.5><a>
SplitLight is a lightweight framework for auditing recommender-system datasets and evaluating splitting results. Its main goal is to help you produce trustworthy data preprocessing and splits and justify split choices via transparent, data-driven diagnostics. SplitLight can be used in Jupyter/Python scripts for comprehensive analysis and offers an easy-to-use Streamlit UI for interactive exploration, health checks, and side-by-side comparisons.
Why SplitLight?
- Trustworthy evaluation — Poor or inconsistent train/validation/test splits lead to overoptimistic metrics and non-reproducible research. SplitLight helps you detect leakage, cold-start issues, and distribution shifts before training.
- Transparent diagnostics — Instead of treating the split as a black box, you get concrete stats: shared interactions, temporal overlap, leaked targets, cold user/item shares, and temporal deltas between input and target.
- Flexible workflow — Use the Streamlit app for ad-hoc audits, or call
src/statsandsrc/splitsfrom your own pipelines and notebooks (see the demo notebook).
[!NOTE] See short video walkthrough of SplitLight motivation and usage.
Quick Start
pip install -r requirements.txt
export PYTHONPATH="$(pwd):$PYTHONPATH"
export SEQ_SPLITS_DATA_PATH=$(pwd)/data
- Requirements file:
requirements.txt - Your datasets live under
data/(see layout below).
Install the requirements and set the environment variables. Then, run the Streamlit as described here to get the data overview or start jupyter notebook and explore the data and splits in depth (see the demo notebook).
Data Layout
SplitLight expects each dataset under data/<DatasetName>/ with either a raw.csv (original schema) or preprocessed.csv (standard schema).
raw.csv(optional): original column names are defined inruns/configs/dataset/<DatasetName>.yamlpreprocessed.csv: standardized columns:user_id,item_id,timestamp(seconds)- After splitting, a per-split subfolder contains:
train.csv,validation_input.csv,validation_target.csv,test_input.csv,test_target.csv
Example:
data/
├── Beauty/
│ ├── raw.csv # optional
│ ├── preprocessed.csv
│ └── leave-one-out/ # example split folder
│ ├── train.csv
│ ├── validation_input.csv
│ ├── validation_target.csv
│ ├── test_input.csv
│ └── test_target.csv
└── Diginetica/
├── preprocessed.csv
└── GTS-q09-val_by_time-target_last/
├── train.csv
├── validation_input.csv
├── validation_target.csv
├── test_input.csv
└── test_target.csv
Streamlit UI
Launch the app for interactive dataset and split audits.
export PYTHONPATH="$(pwd):$PYTHONPATH"
export SEQ_SPLITS_DATA_PATH=$(pwd)/data
streamlit run SplitLight.py
For better experience, zoom out the page to adjust to your screen size.
What you can explore:
- Core and temporal statistics per subset and vs. reference
- Interactions distribution over time
- Repeated consumption patterns (non-unique and consecutive repeats)
- Temporal leakage: shared interactions, overlap, and “leakage from future”
- Cold-start exposure of users and items
- Compare splits side-by-side and analyze time-gap deltas between input and target
What SplitLight Checks
| Category | Description | | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Dataset and Subsets | Analyze raw and preprocessed data in terms of core and temporal statistics and compare. Identify repeated consumption patterns. Visualize interactions distribution over time. | | Subsets and Splits | Analyze split data in terms of core and temporal statistics and compare subsets with full data. Identify and visualize presence of data leakage. Quantify and visualize user and item cold start. | | Compare splits | Compare different splits in terms of core and temporal statistics. Identify distribution shifts for target subset. |
You can also run these checks manually using functions from the
src/statsmodule for custom analyses or integration into your own pipelines (seedemo notebook).
Streamlit Summary Page
The Summary page in the Streamlit UI provides a high-level overview of dataset and split health. It aggregates key diagnostics into a single dashboard, helping you quickly identify quality issues and distribution imbalances.
What It Provides
- Instant snapshot of dataset quality and split integrity
- Compact visualization of core, temporal, and leakage statistics
- Color-coded signals to highlight potential issues at a glance
Each metric is assigned a health status based on configurable thresholds:
- 🟢 OK — within expected bounds
- 🟡 Need Attention — mild irregularity detected
- 🔴 Warning — potential data issue or leakage risk
Configuration
Thresholds and color rules for the Summary view can be customized in
streamlit_ui/config/summary.yml.
Project Structure (Key Parts)
src/stats/— Core diagnostics:base(core/temporal stats),leaks,cold,duplicates,temporal,plots. Use these in scripts or notebooks for custom analyses.streamlit_ui/pages/— Streamlit pages for load, Summary, core/temporal stats, repeated consumption, leakage, cold start, and split comparison.runs/— CLI entrypoints and Hydra configs:preprocess.py,split.py,train_rs.py; configs underruns/configs/(dataset, split, preprocess, train_rs, model).
FAQ
- Q: Can I use Parquet files?
A: Yes. Both.csvand.parquetare supported. On the UI home page, choose the file format (e.g..parquetor both). - Q: Do I need
raw.csv?
A: No. You can provide onlypreprocessed.csvin the standard schema (user_id,item_id,timestamp).raw.csvis optional when you want to run the preprocessing pipeline from raw logs. - Q: What time unit is
timestamp?
A: Seconds since epoch (Unix time). The preprocess step and all stats assume this; convert your timestamps before use if needed. - Q: I only have raw interaction logs. How do I start?
A: (1) Add a dataset config underruns/configs/dataset/<Name>.yamlmapping your columns touser_id,item_id,timestamp. (2) Putraw.csv(or raw data) underdata/<DatasetName>/. (3) Run your own preprocessing script or use examplepython runs/preprocess.py +dataset=<DatasetName>to getpreprocessed.csv. (4) Run your split script or use examplepython runs/split.pyto create a split, then open the Streamlit app or jupyter notebook (see demo notebook) to audit dataset and split. - Q: How do I use SplitLight in my own Python code?
A: Use the stats API: import functions fromsrc.stats(e.g.leaks.get_leaks,cold.share_of_cold, `base.base_stats) and call them on your DataFrames. See the demo notebook for examples. - Q: Why should I care about split quality?
A: The split defines what you are actually evaluating. Leaky or inconsistent splits lead to overestimated metrics and results that don’t transfer to real deployment. SplitLight helps you document and justify your split choice and catch issues early.
CLI Utilities For Experimenting
These CLI tools are provided to illustrate a complete pipeline for preprocessing and splitting datasets. The results of the preprocessing and splitting could be audited using the SplitLight. To train a sequential model on the split data and evaluate, how different data p
Related Skills
feishu-drive
352.2k|
things-mac
352.2kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
352.2kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
codebase-memory-mcp
1.3kHigh-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
