SkillAgentSearch skills...

KuaiRand

An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

Install / Use

/learn @chongminggao/KuaiRand
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

LICENSE DOI

KuaiRand is an unbiased sequential recommendation dataset collected from the recommendation logs of the video-sharing mobile app, Kuaishou (快手). It is the first recommendation dataset with millions of intervened interactions of randomly exposed items inserted in the standard recommendation feeds!

Other related open-sourced datasets are KuaiRec and KuaiSAR.

Overview:

The following figure gives an example of the dataset. It illustrates a user interaction sequence along with the user's rich feedback signals.

kuaidata

These feedback signals are collected from the two main user interfaces (UI) in the Kuaishou APP shown as follows:

<img src="figs/kuaishou-app.png" alt="kuaishou-app" style="display: block;margin: auto;zoom:35%;" />

In the random exposure stage, each recommended video in the dataset has an equal probability of being replaced by a random video sampled from an item pool. About $0.37%$ Interactions are replaced in the final results.

Advantages:

Compared with other datasets with random exposure, KuaiRand has the following advantages:

  • ✅ It is the first sequential recommendation dataset with millions of intervened interactions of randomly exposed items inserted in the standard recommendation feeds.
  • ✅ It has the most comprehensive side information including explicit user IDs, interaction timestamps, and rich features for users and items.
  • ✅ It has 15 policies with each catered for a special recommendation scenario in the Kuaishou App.
  • ✅ We introduce 12 feedback signals (e.g., click, like, and view time) for each interaction to describe the user's comprehensive feedback.
  • ✅ Each user has thousands of historical interactions on average.
  • ✅ It has three versions to support various research directions in recommendation.

If you find it helpful, please cite our paper: LINK PDF

@inproceedings{gao2022kuairand,
  title = {KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos},
  author = {Gao, Chongming and Li, Shijun and Zhang, Yuan and Chen, Jiawei and Li, Biao and Lei, Wenqiang and Jiang, Peng and He, Xiangnan},
  url = {https://doi.org/10.1145/3511808.3557624},
  doi = {10.1145/3511808.3557624},
  booktitle = {Proceedings of the 31st ACM International Conference on Information and Knowledge Management},
  series = {CIKM '22},
  location = {Atlanta, GA, USA},
  numpages = {5},
  year = {2022},
  pages = {3953–3957}
}

Download the data:

KuaiRand has been shared at https://zenodo.org/records/10439422.

OPTION 1. Download via your browser:

You can download the dataset from the zenodo site, or this Chinese site.

OPTION 2: Download via the 'wget' command tool:

For the KuaiRand-Pure dataset:

wget https://zenodo.org/records/10439422/files/KuaiRand-Pure.tar.gz # (md5:0820331067a3784d9691136f772b35a7)
tar -xzvf KuaiRand-Pure.tar.gz

For the KuaiRand-1K dataset:

wget https://zenodo.org/records/10439422/files/KuaiRand-1K.tar.gz #  (md5:6b0b9c8222d67fcd4c676218edca3f1f)
tar -xzvf KuaiRand-1K.tar.gz

For the KuaiRand-27K dataset:

wget https://zenodo.org/records/10439422/files/KuaiRand-27K.tar.gz # (md5:3e3c799a24e2d23a4d2c757fbf9adf59)
tar -xzvf KuaiRand-27K.tar.gz

Attention: We further provide two supplementary files that enrich the KuaiRand dataset with video-level content-side semantic information, namely kuairand_video_categories.csv and kuairand_video_captions.csv. Detailed descriptions and download links for these two files are provided here.


Three Versions and Suggestions:

We release three versions of KuaiRand for different uses:

  • KuaiRand-27K (23GB logs +23GB features): the complete KuaiRand dataset that has over 27K users and 32 million videos.
  • KuaiRand-1K (829MB logs + 3.5GB features): randomly sample 1,000 users from KuaiRand-27K, then remove all irrelevant videos. There are 4 million videos rest.
  • KuaiRand-Pure (184MB logs + 10MB features): only keeps the logs for the 7583 videos in the candidate pool.

The user_id and video_id are re-indexed. A visualization of their ID spaces is shown as follows.

kuaidata

The basic statistics of the three versions are summarized as follows:

| Dataset | Collection Policy | #Users | #Items | #Interactions | #User features | #Item features | Feedback | Timestamp | | ---------------- | :----------------: | :----: | :--------: | :-----------: | :--------------: | :--------------: | :-------------: | :---------: | | KuaiRand-27K | 15 policies | 27,285 | 32,038,725 | 322,278,385 | 30 | 62 | 12 signals | ✔️ | | | Random policy | 27,285 | 7,583 | 1,186,059 | 30 | 62 | 12 signals | ✔️ | | KuaiRand-1K | 15 policies | 1,000 | 4,369,953 | 11,713,045 | 30 | 62 | 12 signals | ✔️ | | | Random policy | 1,000 | 7,388 | 43,028 | 30 | 62 | 12 signals | ✔️ | | KuaiRand-Pure | 15 policies | 27,285 | 7,551 | 1,436,609 | 30 | 62 | 12 signals | ✔️ | | | Random policy | 27,285 | 7,583 | 1,186,059 | 30 | 62 | 12 signals | ✔️ |

Which version should I use?

  • Reasons to use KuaiRand-27K or KuaiRand-1K:
    • Your research needs rigorous sequential logs, such as off-policy evaluation (OPE), Reinforcement learning (RL), or long sequential recommendation.
  • Reasons to use KuaiRand-Pure:
    • The sequential information is not necessary for your research OR If you are OK with the incomplete sequential logs. For example, if you are studying debiasing in collaborative filtering models or multi-task modeling in recommendation.
    • If your model can only run with small-size data.

Data Descriptions

The file structure of the three datasets is listed as follows:

KuaiRand-27K (46GB)
KuaiRand-27K
├── data (46GB)
│   ├── log_random_4_22_to_5_08_27k.csv (83MB)
│   ├── log_standard_4_08_to_4_21_27k_part1.csv (4.8GB)
│   ├── log_standard_4_08_to_4_21_27k_part2.csv (4.8GB)
│   ├── log_standard_4_22_to_5_08_27k_part1.csv (6.6GB)
│   ├── log_standard_4_22_to_5_08_27k_part2.csv (6.6GB)
│   ├── user_features_27k.csv (3.4MB)
│   ├── video_features_basic_27k.csv (2.6GB)
│   ├── video_features_statistic_27k_part1.csv (6.7GB)
│   ├── video_features_statistic_27k_part2.csv (6.7GB)
│   └── video_features_statistic_27k_part3.csv (6.7GB)
└── load_data_27k.py
KuaiRand-1K (4.3GB)
KuaiRand-1K
├── data (4.3GB)
│   ├── log_random_4_22_to_5_08_1k.csv (2.9MB)
│   ├── log_standard_4_08_to_4_21_1k.csv (368MB)
│   ├── log_standard_4_22_to_5_08_1k.csv (481MB)
│   ├── user_features_1k.csv (132KB)
│   ├── video_features_basic_1k.csv (368MB)
│   └── video_features_statistic_1k.csv (3.1GB)
└── load_data_1k.py
KuaiRand-Pure (194MB)
KuaiRand-Pure
├── data (194MB)
│   ├── log_random_4_22_to_5_08_pure.csv (83MB)
│   ├── log_standard_4_08_to_4_21_pure.csv (80MB)
│   ├── log_standard_4_22_to_5_08_pure.csv (21MB)
│   ├── user_features_pure.csv (3.4MB)
│   ├── video_features_basic_pure.csv (612KB)
│   └── video_features_statistic_pure.csv (6.3MB)
└── load_data_pure.py

1️⃣ Description of the fields in log_xxx.csv

There are three log files:

  • log_random_4_22_to_5_08.csv contains all interactions resulting from random intervention.
  • log_standard_4_22_to_5_08.csv contains all interactions of standard recommendation.
  • log_standard_4_08_to_4_21.csv contains all interactions of standard recommendation for the same users in the previous two weeks (2022.04.08 ~ 2022.04.21).

| Field Name: | Description | Type | Example | | ----------------- | ------------------------------------------------------------ | ------- | --------- | | user_id | The ID of the video. | int64 | 17387| | video_id | The ID of the video. | int64 | 1123453| | date | The date of this interaction | int64 | 20220421| | hourmin | The time of this interaction (format: HHSS). | int64 | 400| | time_ms | The timestamp of this interaction in milliseconds. | int64 |1650485801301| | is_click | A binary feedback signal.

View on GitHub
GitHub Stars124
CategoryContent
Updated5d ago
Forks11

Languages

HTML

Security Score

95/100

Audited on Apr 3, 2026

No findings