This shows you the differences between two versions of the page.
cs401r_w2016:lab12 [2016/03/28 16:07] admin |
cs401r_w2016:lab12 [2021/06/30 23:42] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====Objective:==== | ||
- | To understand recommender systems, and to have a significant, creative experience exploring a large dataset in a competition-style setting. | ||
- | |||
- | ---- | ||
- | ====Deliverable:==== | ||
- | |||
- | For this lab, you will construct a movie recommendation engine, using a simple publicly available dataset. For this lab, you will turn in three things: | ||
- | |||
- | - A notebook containing your code, but we will not run it. | ||
- | - A set of predictions for a specific list of <user,movie> pairs. | ||
- | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. | ||
- | |||
- | We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | ||
- | |||
- | You may use any strategy you want to construct your predictions, except for attempting to determine the values of the missing entries by analyzing the original dataset. | ||
- | |||
- | ---- | ||
- | ====Grading standards:==== | ||
- | |||
- | Your entry will be graded on the following elements: | ||
- | |||
- | * 100% Project writeup | ||
- | * 10% extra credit for the three top predictions | ||
- | |||
- | ---- | ||
- | ====Description:==== | ||
- | |||
- | This lab is designed to help you be creative in finding your own way to solve a significant data analysis problem. You may use any of the techniques we have discussed in class, techniques from other classes, or you may invent your own new techniques. | ||
- | |||
- | The training set you will use can be downloaded here: | ||
- | |||
- | [[http://hatch.cs.byu.edu/courses/stat_ml/movie_training_data.tar.gz|Movie ratings training data]] | ||
- | |||
- | A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. | ||
- | |||
- | You should start by looking at the ''user_ratedmovies_train.dat'' file. It is a CSV file containing user,movie,timestamp tuples that form the core training data. Everything else is auxiliary data that may or may not be useful. | ||
- | |||
- | **Turning in your submissions** | ||
- | |||
- | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with three columns and 85,000 rows. Each row has the form | ||
- | |||
- | ''test_id,predicted rating'' | ||
- | |||
- | The ''test_id'' field | ||
- | |||
- | **Evaluating your submissions** | ||
- | |||
- | Performance of your prediction engine will be based on RMSE: | ||
- | |||
- | $$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ | ||
- | |||
- | **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | ||
- | |||
- | ---- | ||
- | ====Hints:==== | ||
- | |||
- | <code python> | ||
- | import matplotlib.pyplot as plt | ||
- | import seaborn | ||
- | import pandas | ||
- | |||
- | ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ||
- | |||
- | plt.hist( ur['rating'] ) | ||
- | |||
- | </code> |