User Tools

Site Tools


cs401r_w2016:lab12

This is an old revision of the document!


Objective:

To understand recommender systems, and to have a significant, creative experience exploring a large dataset in a competition-style setting.


Deliverable:

For this lab, you will construct a movie recommendation engine, using a simple publicly available dataset. For this lab, you will turn in two things:

  1. A notebook containing your code, but we will not run it.
  2. A set of predictions for a specific list of <user,movie> pairs.

We will run a small “competition” on your predictions: the three students with the best predictions will get 10% extra credit on this lab.

You may use any strategy you want to construct your predictions, except for attempting to determine the values of the missing entries by analyzing the original dataset.


Grading standards:

Your entry will be graded on the following elements:

  • 25% Correct implementation of Metropolis Hastings inference
  • 5% Correct calculation of gradients
  • 45% Correct implementation of Hamiltonian MCMC
  • 15% A small write-up comparing and contrasting MH, HMC, and the different proposal distributions
  • 10% Final plot is tidy and legible
  • 10% extra credit for the three top predictions

Description:

This lab is designed to help you be creative in finding your own way to solve a significant data analysis problem. You may use any of the techniques we have discussed in class, techniques from other classes, or you may invent your own new techniques.

The training set you will use can be downloaded here:

Movie ratings training data

A complete description of the data can be found in the readme.txt file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie.

You should start by looking at the user_ratedmovies_train.dat file. It is a CSV file containing user,movie,timestamp tuples that form the core training data. Everything else is auxiliary data that may or may not be useful.


Hints:

import matplotlib.pyplot as plt
import seaborn
import pandas
 
ur = pandas.read_csv('user_ratedmovies_train.dat','\t')
 
plt.hist( ur['rating'] )
cs401r_w2016/lab12.1459180777.txt.gz · Last modified: 2021/06/30 23:40 (external edit)