User Tools

Site Tools


cs401r_w2016:lab12

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cs401r_w2016:lab12 [2016/03/28 16:07]
admin
cs401r_w2016:lab12 [2021/06/30 23:42]
Line 1: Line 1:
-====Objective:​==== 
  
-To understand recommender systems, and to have a significant,​ creative experience exploring a large dataset in a competition-style setting. 
- 
----- 
-====Deliverable:​==== 
- 
-For this lab, you will construct a movie recommendation engine, using a simple publicly available dataset. ​ For this lab, you will turn in three things: 
- 
-  - A notebook containing your code, but we will not run it. 
-  - A set of predictions for a specific list of <​user,​movie>​ pairs. 
-  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. 
- 
-We will run a small "​competition"​ on your predictions:​ the three students with the best predictions will get 10% extra credit on this lab. 
- 
-You may use any strategy you want to construct your predictions,​ except for attempting to determine the values of the missing entries by analyzing the original dataset. 
- 
----- 
-====Grading standards:​==== 
- 
-Your entry will be graded on the following elements: 
- 
-  * 100% Project writeup 
-  * 10% extra credit for the three top predictions 
- 
----- 
-====Description:​==== 
- 
-This lab is designed to help you be creative in finding your own way to solve a significant data analysis problem. ​ You may use any of the techniques we have discussed in class, techniques from other classes, or you may invent your own new techniques. 
- 
-The training set you will use can be downloaded here: 
- 
-[[http://​hatch.cs.byu.edu/​courses/​stat_ml/​movie_training_data.tar.gz|Movie ratings training data]] 
- 
-A complete description of the data can be found in the ''​readme.txt''​ file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information,​ as well as a set of tags that users may have used when rating each movie. 
- 
-You should start by looking at the ''​user_ratedmovies_train.dat''​ file.  It is a CSV file containing user,​movie,​timestamp tuples that form the core training data.  Everything else is auxiliary data that may or may not be useful. 
- 
-**Turning in your submissions** 
- 
-As part of this lab, you must submit a set of predictions. ​ You must provide predictions as a simple CSV file with three columns and 85,000 rows.  Each row has the form 
- 
-''​test_id,​predicted rating''​ 
- 
-The ''​test_id''​ field  
- 
-**Evaluating your submissions** 
- 
-Performance of your prediction engine will be based on RMSE: 
- 
-$$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ 
- 
-**Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** 
- 
----- 
-====Hints:​==== 
- 
-<code python> 
-import matplotlib.pyplot as plt 
-import seaborn 
-import pandas 
- 
-ur = pandas.read_csv('​user_ratedmovies_train.dat','​\t'​) 
- 
-plt.hist( ur['​rating'​] ) 
- 
-</​code>​ 
cs401r_w2016/lab12.txt ยท Last modified: 2021/06/30 23:42 (external edit)