Differences

This shows you the differences between two versions of the page.

--- cs401r_w2016:lab12 [2016/03/28 16:07]
admin
+++ cs401r_w2016:lab12 [2021/06/30 23:42]
@@ Line 1: / Line 1: @@
-====Objective:====
-To understand recommender systems, and to have a significant, creative experience exploring a large dataset in a competition-style setting.
-----
-====Deliverable:====
-For this lab, you will construct a movie recommendation engine, using a simple publicly available dataset.  For this lab, you will turn in three things:
-  - A notebook containing your code, but we will not run it.
-  - A set of predictions for a specific list of <user,movie> pairs.
-  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.
-We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab.
-You may use any strategy you want to construct your predictions, except for attempting to determine the values of the missing entries by analyzing the original dataset.
-----
-====Grading standards:====
-Your entry will be graded on the following elements:
-  * 100% Project writeup
-  * 10% extra credit for the three top predictions
-----
-====Description:====
-This lab is designed to help you be creative in finding your own way to solve a significant data analysis problem.  You may use any of the techniques we have discussed in class, techniques from other classes, or you may invent your own new techniques.
-The training set you will use can be downloaded here:
-[[http://hatch.cs.byu.edu/courses/stat_ml/movie_training_data.tar.gz|Movie ratings training data]]
-A complete description of the data can be found in the ''readme.txt'' file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie.
-You should start by looking at the ''user_ratedmovies_train.dat'' file.  It is a CSV file containing user,movie,timestamp tuples that form the core training data.  Everything else is auxiliary data that may or may not be useful.
-**Turning in your submissions**
-As part of this lab, you must submit a set of predictions.  You must provide predictions as a simple CSV file with three columns and 85,000 rows.  Each row has the form
-''test_id,predicted rating''
-The ''test_id'' field
-**Evaluating your submissions**
-Performance of your prediction engine will be based on RMSE:
-$$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$
-**Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!**
-----
-====Hints:====
-<code python>
-import matplotlib.pyplot as plt
-import seaborn
-import pandas
-ur = pandas.read_csv('user_ratedmovies_train.dat','\t')
-plt.hist( ur['rating'] )
-</code>

BYU CS classes

User Tools

Site Tools

Differences

Page Tools