# daisyRec **Repository Path**: fupan/daisyRec ## Basic Information - **Project Name**: daisyRec - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-06-07 - **Last Updated**: 2021-06-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ![DaisyRec](pics/logo.png) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/scikit-daisy) [![Version](https://img.shields.io/badge/version-v1.1.2-orange)](https://github.com/AmazingDD/daisyRec) ![GitHub repo size](https://img.shields.io/github/repo-size/amazingdd/daisyrec) ![GitHub](https://img.shields.io/github/license/amazingdd/daisyrec) ## Overview DaisyRec is a Python toolkit dealing with rating prediction and item ranking issue. The name DAISY (roughly :) ) stands for Multi-**D**imension f**AI**rly comp**A**r**I**son for recommender **SY**stem. The whole framework of Daisy is showed below: Make sure you have a **CUDA** enviroment to accelarate since these deep-learning models could be based on it. We will consistently update this repo. ## Datasets You can download experiment data, and put them into the `data` folder. All data are available in links below: - [MovieLens 100K](https://grouplens.org/datasets/movielens/100k/) - [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) - [MovieLens 10M](https://grouplens.org/datasets/movielens/10m/) - [MovieLens 20M](https://grouplens.org/datasets/movielens/20m/) - [Netflix Prize Data](https://archive.org/download/nf_prize_dataset.tar) - [Last.fm](https://grouplens.org/datasets/hetrec-2011/) - [Book Crossing](https://grouplens.org/datasets/book-crossing/) - [Epinions](http://www.cse.msu.edu/~tangjili/trust.html) - [CiteULike](https://github.com/js05212/citeulike-a) - [Amazon-Book](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Books.csv) - [Amazon-Electronic](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Electronics.csv) - [Amazon-Cloth](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Clothing_Shoes_and_Jewelry.csv) - [Amazon-Music](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Digital_Music.csv) - [Yelp Challenge](https://kaggle.com/yelp-dataset/yelp-dataset) ## How to run 1. Make sure running command `python setup.py build_ext --inplace` to compile dependent extensions before running the other code. After that, you will find file \*.so or \*.pyd file generated in path `daisy/model/` 2. In order to reproduce results, you need to run `python data_generator.py` to create `experiment_data` folder with certain public dataset listed in our paper. If you just want to research one certain dataset, you need to modify the code in `data_generator.py` to indicate your demands and let this code yield train and test datasets as you want. In the default situation, `data_generator.py` will generate all kinds of datasets (raw data, 5-core data and 10-core data) with different data splitting methods, including `tloo`, `loo`, `tfo` and `fo`. The meaning of these split methods will be explained in the `Important Commands` of `README`. 3. There are seperate codes for validation and test, and they are stored in the folders of `nested_tune_kit` and `test_kit`, respectively. Each of the code in these folders should be moved into the root path, just the same directory as `data_generator.py`, so as to successfully run these code. Furthermore, if you have an IDE toolkit, you can simply set your work path and run in any folder path. 4. The validation dataset is used for parameter tuning, so we provide *split_validation* interfact inside the code in the `nested_tune_kit` folder. Further and more detail parameter settings information about validation split method is depicted in `daisy/utils/loader.py`. After finishing validation, the results will be stored in the automatically generated folder `tune_log/`. 5. Based on the best parameter determined by the validation, run the test code that you moved into the root path before and the results will be stored in the automatically generated folder `res/`. ## Examples to run: Taking the following case as an example: if we want to reproduce the top-20 results for *BPR-MF* on ML-1M-10core dataset. 1. Assume we have already run `data_generator.py` and get the training and test datasets by `tfo` (i.e., time-aware split by ratio method). We should get files named `train_ml-1m_10core_tfo.dat`, `test_ml-1m_10core_tfo.dat` in `./experiment_data/`. 2. The whole procedure contains validation and test. Therefore, we first need to run `hp_tune_pair_mf.py` to get the best parameter settings. Besides, we may change the parameter search space in the `hp_tune_pair_mf.py`. Command to run: ``` python hp_tune_pair_mf.py --dataset=ml-1m --prepro=10core --val_method=tfo --test_method=tfo --topk=20 --loss_type=BPR --sample_method=uniform --gpu=0 ``` 3. After finishing step 2, we will get the best paramter settings from `tune_log/`. Then we can run the test code by following the command as below: ``` python run_pair_mf.py --dataset=ml-1m --prepro=10core --test_method=tfo --topk=20 --loss_type=BPR --num_ng=2 --factors=34 --epochs=50 --lr=0.0005 --lamda=0.0016 --sample_method=uniform --gpu=0 ``` More details of arguments are available in help message, try: ``` python run_pair_mf.py --help ``` 4. Once step 3 terminated, we can obtain the results w.r.t. top-20 from the dynamically generated result file `./res/ml-1m/10core_tfo_pairmf_BPR_uniform.csv` ## More Ranking Results More ranking results for different methods on different datasets across various settings of top-N (N=1,5,10,20,30) are available in the file of `ranking_results.md`. ## Important Commands The description of all common parameter settings used by code inside `examples` are listed below: | Commands | Description on Commands |           Choices           | Description on Choices | | ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | dataset | the selected datasets | ml-100k;
ml-1m;
ml-10m;
ml-20m;
lastfm;
bx;
amazon-cloth;
amazon-electronic;
amazon-book;
amazon-music;
epinions;
yelp;
citeulike;
netflix | all choices are the names of datasets | | prepro | the data pre-processing method | origin;
Ncore | 'origin' means using the raw data;
'Ncore' means only preserving users and items that have interactions more than **N**. Notice **N** could be any integer value | |val_method
test_method | train-validation splitting;
train-test splitting | ufo
fo
tfo
loo
tloo
cv | split-by-ratio-with-user-level
split-by-ratio
time-aware split-by-ratio
leave one out
time-aware leave one out
cross validation (only apply to val_method) | | topk | the length of recommendation list | | | | test_size | ratio of test set size | | | | fold_num | the number of fold used for validation (only apply to 'cv', 'fo'). | | | | cand_num | the number of candidate items used for ranking | | | | sample_method | negative sampling method | uniform
item-ascd
item-desc | uniformly sampling;
sampling popular items with low rank;
sampling popular item with high rank | | num_ng | the number of negative samples | | |