User Tools

Site Tools


cs401r_w2016:lab12

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs401r_w2016:lab12 [2016/03/28 16:07]
admin
cs401r_w2016:lab12 [2021/06/30 23:42] (current)
Line 9: Line 9:
  
   - A notebook containing your code, but we will not run it.   - A notebook containing your code, but we will not run it.
-  - A set of predictions for a specific list of <​user,​movie>​ pairs. +  - A set of predictions for a specific list of <​user,​movie>​ pairs, in a CSV file
-  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.+  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.  ​Markdown format, please!!
  
 We will run a small "​competition"​ on your predictions:​ the three students with the best predictions will get 10% extra credit on this lab. We will run a small "​competition"​ on your predictions:​ the three students with the best predictions will get 10% extra credit on this lab.
Line 21: Line 21:
 Your entry will be graded on the following elements: Your entry will be graded on the following elements:
  
-  * 100% Project writeup+  * 85% Project writeup 
 +    * 30% Exploratory data analysis 
 +    * 30% Description of technical approach 
 +    * 25% Analysis of performance of method 
 +  * 15% Submission of predictions csv file
   * 10% extra credit for the three top predictions   * 10% extra credit for the three top predictions
  
Line 31: Line 35:
 The training set you will use can be downloaded here: The training set you will use can be downloaded here:
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​movie_training_data.tar.gz|Movie ratings training data]]+[[https://www.dropbox.com/s/7r4rebn1ytugi4g/​movie_training_data.tar.gz?dl=0|Movie ratings training data]]
  
-A complete description of the data can be found in the ''​readme.txt''​ file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information,​ as well as a set of tags that users may have used when rating each movie.+You will need to make predictions for a set of ''​user,​movie''​ pairs. ​ These can be downloaded here: 
 + 
 +[[https://​www.dropbox.com/​s/​2xvsrju9w3fqfon/​predictions.dat?​dl=0|Movie predictions data]] 
 + 
 +A complete description of the data can be found in the ''​readme.txt''​ file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, ​a corresponding IMDB ID, some RottenTomatoes information,​ as well as a set of tags that users may have used when rating each movie.
  
 You should start by looking at the ''​user_ratedmovies_train.dat''​ file.  It is a CSV file containing user,​movie,​timestamp tuples that form the core training data.  Everything else is auxiliary data that may or may not be useful. You should start by looking at the ''​user_ratedmovies_train.dat''​ file.  It is a CSV file containing user,​movie,​timestamp tuples that form the core training data.  Everything else is auxiliary data that may or may not be useful.
Line 39: Line 47:
 **Turning in your submissions** **Turning in your submissions**
  
-As part of this lab, you must submit a set of predictions. ​ You must provide predictions as a simple CSV file with three columns and 85,000 rows.  Each row has the form+As part of this lab, you must submit a set of predictions. ​ You must provide predictions as a simple CSV file with two columns and 85,000 rows.  Each row has the form
  
-''​test_id,predicted rating''​+''​testID,predicted_rating''​
  
-The ''​test_id''​ field +The ''​testID''​ field uniquely identifies each ''​user,​movie''​ prediction pair in the predictions set.
  
 **Evaluating your submissions** **Evaluating your submissions**
Line 49: Line 57:
 Performance of your prediction engine will be based on RMSE: Performance of your prediction engine will be based on RMSE:
  
-$$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$+$$ \mathrm{RMSE} = \sqrt{ ​\frac{1}{N} ​\sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$
  
 **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!**
 +
 +**Project writeup**
 +
 +Because you are being given full freedom in choosing your implementation strategy, you will not be graded on it (except to ensure that your implementation matches what you describe in your writeup!). ​ Instead, you will be graded solely on a writeup describing your implementation.
 +
 +This writeup must include three main sections:
 +
 +  - **A discussion of your exploration of the dataset**.
 +    - Before you start coding, you should look at the data.  What does it include? ​ What patterns do you see?
 +    - Any visualizations about the data you deem relevant
 +  - **A clear, technical description of your approach.** ​ This section should include:
 +    - Background on the approach
 +    - Description of the model you use
 +    - Description of the inference / training algorithm you use
 +    - Description of how you partitioned your data into a test/​training split
 +  - **An analysis of how your approach worked on the dataset**
 +    - What was your final RMSE on your private test/​training split?
 +    - Did you overfit? ​ How do you know?
 +    - Was your first algorithm the one you ultimately used for your submission? ​ Why did you (or didn't you) iterate your design?
 +
 +
  
 ---- ----
Line 60: Line 89:
 import seaborn import seaborn
 import pandas import pandas
 +import numpy as np
  
 ur = pandas.read_csv('​user_ratedmovies_train.dat','​\t'​) ur = pandas.read_csv('​user_ratedmovies_train.dat','​\t'​)
  
 plt.hist( ur['​rating'​] ) plt.hist( ur['​rating'​] )
 +
 +# create a test/train split
 +
 +all_inds = np.random.permutation( range(0,​len(ur)) )
 +test_inds = all_inds[0:​85000]
 +train_inds = all_inds[85000:​len(ur)]
 +
 +ur_test = ur.iloc[ test_inds ]
 +ur_train = ur.iloc[ train_inds ]
 +
  
 </​code>​ </​code>​
 +
 +And Here is some code that writes out your prediction file that you will submit:
 +
 +<code python>
 +
 +import numpy as np
 +import pandas as pd
 +
 +pred_array = pd.read_table('​predictions.dat'​)
 +test_ids = pred_array[["​testID"​]]
 +pred_array.head()
 +
 +N = pred_array.shape[0]
 +my_preds = np.zeros((N,​1))
 +
 +for id in range(N): ### Prediction loop
 +    predicted_rating = 3 
 +    my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3
 +
 +sfile = open( '​predictions.csv',​ '​w'​ )
 +sfile.write( '"​testID","​predicted_rating"​\n'​ )
 +for id in range( 0, N ):
 +    sfile.write( '​%d,​%.2f\n'​ % (test_ids.iloc[id],​ my_preds[id] ) ) 
 +sfile.close()
 +
 +</​code>​
 +
cs401r_w2016/lab12.1459181275.txt.gz · Last modified: 2021/06/30 23:40 (external edit)