To understand recommender systems, and to have a significant, creative experience exploring a large dataset in a competition-style setting.
For this lab, you will construct a movie recommendation engine, using a simple publicly available dataset. For this lab, you will turn in three things:
We will run a small “competition” on your predictions: the three students with the best predictions will get 10% extra credit on this lab.
You may use any strategy you want to construct your predictions, except for attempting to determine the values of the missing entries by analyzing the original dataset.
Your entry will be graded on the following elements:
This lab is designed to help you be creative in finding your own way to solve a significant data analysis problem. You may use any of the techniques we have discussed in class, techniques from other classes, or you may invent your own new techniques.
The training set you will use can be downloaded here:
You will need to make predictions for a set of user,movie
pairs. These can be downloaded here:
A complete description of the data can be found in the readme.txt
file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie.
You should start by looking at the user_ratedmovies_train.dat
file. It is a CSV file containing user,movie,timestamp tuples that form the core training data. Everything else is auxiliary data that may or may not be useful.
Turning in your submissions
As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with two columns and 85,000 rows. Each row has the form
testID,predicted_rating
The testID
field uniquely identifies each user,movie
prediction pair in the predictions set.
Evaluating your submissions
Performance of your prediction engine will be based on RMSE:
$$ \mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$
Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!
Project writeup
Because you are being given full freedom in choosing your implementation strategy, you will not be graded on it (except to ensure that your implementation matches what you describe in your writeup!). Instead, you will be graded solely on a writeup describing your implementation.
This writeup must include three main sections:
import matplotlib.pyplot as plt import seaborn import pandas import numpy as np ur = pandas.read_csv('user_ratedmovies_train.dat','\t') plt.hist( ur['rating'] ) # create a test/train split all_inds = np.random.permutation( range(0,len(ur)) ) test_inds = all_inds[0:85000] train_inds = all_inds[85000:len(ur)] ur_test = ur.iloc[ test_inds ] ur_train = ur.iloc[ train_inds ]
And Here is some code that writes out your prediction file that you will submit:
import numpy as np import pandas as pd pred_array = pd.read_table('predictions.dat') test_ids = pred_array[["testID"]] pred_array.head() N = pred_array.shape[0] my_preds = np.zeros((N,1)) for id in range(N): ### Prediction loop predicted_rating = 3 my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3 sfile = open( 'predictions.csv', 'w' ) sfile.write( '"testID","predicted_rating"\n' ) for id in range( 0, N ): sfile.write( '%d,%.2f\n' % (test_ids.iloc[id], my_preds[id] ) ) sfile.close()