CoWPiRec

The official implementation for Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation.

Generate Convert Improve

Install / Use

/learn @ysh-1998/CoWPiRec

About this skill

Quality Score

0/100

README

CoWPiRec

This is the official implementation for Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation.

Requirements

python==3.9.7
recbole==1.0.1
torch==1.10.0
cudatoolkit==11.3
transformers==4.21.2

Quick Start

We provide a preprocessed dataset of Scientific dataset in dataset/Scientific. You can download the pretrained model from Google Drive and place it in saved/model. Then You can repreduce our experiment results following the steps below.

Git clone this repository.

git clone https://github.com/ysh-1998/CoWPiRec.git

Get the text-based item embedding with CoWPiRec.

python get_emb.py --gpu_id=0 --dataset Scientific

Train and evaluate on downstream datasets Scientific.

python downstream/finetune.py --gpu_id=0 -d Scientific

Pipeline

The pipeline to repreduce CoWPiRec is shown below.

Datasets

To preprocessing datasets, you should prepare tow raw_data files:

user interaction data data_name.csv, format: user_id,item_id,time
item metadata meta_data_name.csv, format: item_id,item_text

Place these two files in ./dataset/raw_data/interaction and ./dataset/raw_data/metadata, respectively, then run script below:

python ./dataset/preprocessing/process_dataset.py --dataset dataset_name
python ./dataset/preprocessing/tokenizer.py --dataset dataset_name

The dataset path generated contains:

dataset/
  dataset_name/
    data_name.train.inter
    data_name.dev.inter
    data_name.test.inter
    data_name.text
    data_name.tokenize.json
    data_name.item2index
    data_name.user2index

Word Graph

The word graph is a key component in pre-training of CoWPiRec and is constructed based on co-click items. Obtain the co-click item pairs first.

python ./WordGraph/get_coclick_item_pairs.py --dataset dataset_name

Process co-click item pairs to get co-click word graph.

python ./WordGraph/get_coclick_word_graph.py \
  --dataset dataset_name --num_workers 40 --max_len 64

You can perform multiprocessing by modifing the argument --number_workers. The max length of item text is controlled by the argument --max_len.

Process co-click word graph, filter neighbors based on tf-idf.

python ./WordGraph/get_word_graph.py --dataset dataset_name --topN 30

argument --topN denotes the number of neighbors after filtering.

Pretraining

Pre-train on single GPU.

python pretrain.py --gpu_id=0 -d pretrain_dataset_name

Pre-train with multi GPUs.

CUDA_VISIBLE_DEVICES=0,1,2,3 python ddp_pretrain.py

Downstream

Get item embedding using CoWPiRec.

python get_emb.py --gpu_id=0 --dataset dataset_name \
    --load_pretrain_model --pretrain_model_path saved/xxx.pth

Downstream recommendation.

python downstream/finetune.py --gpu_id=0 -d dataset_name

Related Skills

node-connect

348.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

348.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

348.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。