CoWPiRec
The official implementation for Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation.
Install / Use
/learn @ysh-1998/CoWPiRecREADME
CoWPiRec
This is the official implementation for Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation.
<div align=center> <img src="CoWPiRec.png" alt="CoWPiRec.png" width="100%" /> </div>Requirements
python==3.9.7
recbole==1.0.1
torch==1.10.0
cudatoolkit==11.3
transformers==4.21.2
Quick Start
We provide a preprocessed dataset of Scientific dataset in dataset/Scientific. You can download the pretrained model from Google Drive and place it in saved/model. Then You can repreduce our experiment results following the steps below.
Git clone this repository.
git clone https://github.com/ysh-1998/CoWPiRec.git
Get the text-based item embedding with CoWPiRec.
python get_emb.py --gpu_id=0 --dataset Scientific
Train and evaluate on downstream datasets Scientific.
python downstream/finetune.py --gpu_id=0 -d Scientific
Pipeline
The pipeline to repreduce CoWPiRec is shown below.
Datasets
To preprocessing datasets, you should prepare tow raw_data files:
- user interaction data
data_name.csv, format:user_id,item_id,time - item metadata
meta_data_name.csv, format:item_id,item_text
Place these two files in ./dataset/raw_data/interaction and ./dataset/raw_data/metadata, respectively, then run script below:
python ./dataset/preprocessing/process_dataset.py --dataset dataset_name
python ./dataset/preprocessing/tokenizer.py --dataset dataset_name
The dataset path generated contains:
dataset/
dataset_name/
data_name.train.inter
data_name.dev.inter
data_name.test.inter
data_name.text
data_name.tokenize.json
data_name.item2index
data_name.user2index
Word Graph
The word graph is a key component in pre-training of CoWPiRec and is constructed based on co-click items. Obtain the co-click item pairs first.
python ./WordGraph/get_coclick_item_pairs.py --dataset dataset_name
Process co-click item pairs to get co-click word graph.
python ./WordGraph/get_coclick_word_graph.py \
--dataset dataset_name --num_workers 40 --max_len 64
You can perform multiprocessing by modifing the argument --number_workers. The max length of item text is controlled by the argument --max_len.
Process co-click word graph, filter neighbors based on tf-idf.
python ./WordGraph/get_word_graph.py --dataset dataset_name --topN 30
argument --topN denotes the number of neighbors after filtering.
Pretraining
Pre-train on single GPU.
python pretrain.py --gpu_id=0 -d pretrain_dataset_name
Pre-train with multi GPUs.
CUDA_VISIBLE_DEVICES=0,1,2,3 python ddp_pretrain.py
Downstream
Get item embedding using CoWPiRec.
python get_emb.py --gpu_id=0 --dataset dataset_name \
--load_pretrain_model --pretrain_model_path saved/xxx.pth
Downstream recommendation.
python downstream/finetune.py --gpu_id=0 -d dataset_name
Related Skills
node-connect
348.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
