Ftrl
follow-the-regularized-leader implemented by java, with an example using criteo dataset.
Install / Use
/learn @chenhuang-learn/FtrlREADME
A Java Implementation of FTRL and An Example with Criteo Dataset
Get The Dataset
refer to: https://github.com/guestwalk/kaggle-2014-criteo/blob/master/README, 'Get The Dataset' part.
Run main_script.sh
- transform data code in main_script--------- ./feature_engineer/count.py train.csv > fc.trva.t10.txt ./feature_engineer/parallel_td.py -s 15 ./feature_engineer/transform_data.py train.csv train_trans.csv
features(I1-I13) greater than 2 are transformed by: int(log(v)^2)->v features(C1-C26) appear less than 10 times are transformed to a special value
- generate samples code in main_script--------- python feature_engineer/stat_field_info.py train_trans.csv st_info_file > /dev/null ./feature_engineer/parallel_gs.py -s 15 -m 1 ./feature_engineer/get_samples.py train_trans.csv train_samples_m1 ./feature_engineer/parallel_gs.py -s 15 -m 2 ./feature_engineer/get_samples.py train_trans.csv train_samples_m2
when -m 1, using a hash trick, features are mapped to [1, 1000000] when -m 2, no hash trick, features are mapped to [1, 1086921]
- train ftrl model and progress validation main-class: criteo.FTRLLocalTrain.java usage: java -jar ftrl_train.jar <input file> <L1> <L2> <alpha> <data_max_index> code in main_script--------- java -jar ftrl_train.jar train_samples_m1 1.0 1.0 0.1 1000000 > m1_result java -jar ftrl_train.jar train_samples_m2 1.0 1.0 0.1 1086921 > m2_result
use ftrl to train the model, use progressive validation progressive validation: get a sample -> predict -> loss -> update model -> get a new sample -> ... every 50k samples are processed, the program will print used time and average logloss of latest 250k samples global average logloss: about 0.457
- plot the average logloss python plot_progress_validation.py
Use feature combination
code in main_script--------- ./feature_engineer/parallel_gs.py -s 15 -m 3 ./feature_engineer/get_samples.py train_trans.csv train_samples_m3 java -jar ftrl_train.jar train_samples_m3 1.0 1.0 0.1 6086921 > m3_result
need large disk space, take care! because my feature combination method is too simple, feature space is very large, overfitting will happen, there are also many collisions when using hash trick, so the performance is bad, here is some results in my experiment: L2 change, alpha->0.1, L1->1.0 0.5 0.49062 1.5 0.48979 3.0 0.48862 10.0 0.48419 L1 change, alpha->0.1, L2->1.0 0.5 0.49628 1.5 0.48534 3.0 0.47539 10.0 0.45985 15.0 0.45689 20.0 0.45558 30.0 0.45473 50.0 0.45498
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
