User Tools

Site Tools


cs401r_w2016:lab12

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs401r_w2016:lab12 [2016/03/28 16:25]
admin
cs401r_w2016:lab12 [2021/06/30 23:42] (current)
Line 9: Line 9:
  
   - A notebook containing your code, but we will not run it.   - A notebook containing your code, but we will not run it.
-  - A set of predictions for a specific list of <​user,​movie>​ pairs. +  - A set of predictions for a specific list of <​user,​movie>​ pairs, in a CSV file
-  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.+  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.  ​Markdown format, please!!
  
 We will run a small "​competition"​ on your predictions:​ the three students with the best predictions will get 10% extra credit on this lab. We will run a small "​competition"​ on your predictions:​ the three students with the best predictions will get 10% extra credit on this lab.
Line 21: Line 21:
 Your entry will be graded on the following elements: Your entry will be graded on the following elements:
  
-  * 100% Project writeup +  * 85% Project writeup 
-    * 35% Exploratory data analysis +    * 30% Exploratory data analysis 
-    * 35% Description of technical approach +    * 30% Description of technical approach 
-    * 30% Analysis of performance of method+    * 25% Analysis of performance of method 
 +  * 15% Submission of predictions csv file
   * 10% extra credit for the three top predictions   * 10% extra credit for the three top predictions
  
Line 34: Line 35:
 The training set you will use can be downloaded here: The training set you will use can be downloaded here:
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​movie_training_data.tar.gz|Movie ratings training data]]+[[https://www.dropbox.com/s/7r4rebn1ytugi4g/​movie_training_data.tar.gz?dl=0|Movie ratings training data]]
  
 You will need to make predictions for a set of ''​user,​movie''​ pairs. ​ These can be downloaded here: You will need to make predictions for a set of ''​user,​movie''​ pairs. ​ These can be downloaded here:
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​predictions.dat|Movie predictions data]]+[[https://www.dropbox.com/s/2xvsrju9w3fqfon/​predictions.dat?dl=0|Movie predictions data]]
  
 A complete description of the data can be found in the ''​readme.txt''​ file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information,​ a corresponding IMDB ID, some RottenTomatoes information,​ as well as a set of tags that users may have used when rating each movie. A complete description of the data can be found in the ''​readme.txt''​ file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information,​ a corresponding IMDB ID, some RottenTomatoes information,​ as well as a set of tags that users may have used when rating each movie.
Line 46: Line 47:
 **Turning in your submissions** **Turning in your submissions**
  
-As part of this lab, you must submit a set of predictions. ​ You must provide predictions as a simple CSV file with three columns and 85,000 rows.  Each row has the form+As part of this lab, you must submit a set of predictions. ​ You must provide predictions as a simple CSV file with two columns and 85,000 rows.  Each row has the form
  
-''​testID,​predicted rating''​+''​testID,​predicted_rating''​
  
 The ''​testID''​ field uniquely identifies each ''​user,​movie''​ prediction pair in the predictions set. The ''​testID''​ field uniquely identifies each ''​user,​movie''​ prediction pair in the predictions set.
Line 56: Line 57:
 Performance of your prediction engine will be based on RMSE: Performance of your prediction engine will be based on RMSE:
  
-$$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$+$$ \mathrm{RMSE} = \sqrt{ ​\frac{1}{N} ​\sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$
  
 **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!**
Line 66: Line 67:
 This writeup must include three main sections: This writeup must include three main sections:
  
-  - A discussion of your exploration of the dataset+  - **A discussion of your exploration of the dataset**.
     - Before you start coding, you should look at the data.  What does it include? ​ What patterns do you see?     - Before you start coding, you should look at the data.  What does it include? ​ What patterns do you see?
     - Any visualizations about the data you deem relevant     - Any visualizations about the data you deem relevant
-  - A clear, technical description of your approach. ​ This section should include:+  - **A clear, technical description of your approach.**  This section should include:
     - Background on the approach     - Background on the approach
     - Description of the model you use     - Description of the model you use
     - Description of the inference / training algorithm you use     - Description of the inference / training algorithm you use
     - Description of how you partitioned your data into a test/​training split     - Description of how you partitioned your data into a test/​training split
-  - An analysis of how your approach worked on the dataset+  - **An analysis of how your approach worked on the dataset**
     - What was your final RMSE on your private test/​training split?     - What was your final RMSE on your private test/​training split?
     - Did you overfit? ​ How do you know?     - Did you overfit? ​ How do you know?
Line 88: Line 89:
 import seaborn import seaborn
 import pandas import pandas
 +import numpy as np
  
 ur = pandas.read_csv('​user_ratedmovies_train.dat','​\t'​) ur = pandas.read_csv('​user_ratedmovies_train.dat','​\t'​)
  
 plt.hist( ur['​rating'​] ) plt.hist( ur['​rating'​] )
 +
 +# create a test/train split
 +
 +all_inds = np.random.permutation( range(0,​len(ur)) )
 +test_inds = all_inds[0:​85000]
 +train_inds = all_inds[85000:​len(ur)]
 +
 +ur_test = ur.iloc[ test_inds ]
 +ur_train = ur.iloc[ train_inds ]
 +
  
 </​code>​ </​code>​
 +
 +And Here is some code that writes out your prediction file that you will submit:
 +
 +<code python>
 +
 +import numpy as np
 +import pandas as pd
 +
 +pred_array = pd.read_table('​predictions.dat'​)
 +test_ids = pred_array[["​testID"​]]
 +pred_array.head()
 +
 +N = pred_array.shape[0]
 +my_preds = np.zeros((N,​1))
 +
 +for id in range(N): ### Prediction loop
 +    predicted_rating = 3 
 +    my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3
 +
 +sfile = open( '​predictions.csv',​ '​w'​ )
 +sfile.write( '"​testID","​predicted_rating"​\n'​ )
 +for id in range( 0, N ):
 +    sfile.write( '​%d,​%.2f\n'​ % (test_ids.iloc[id],​ my_preds[id] ) ) 
 +sfile.close()
 +
 +</​code>​
 +
cs401r_w2016/lab12.1459182305.txt.gz · Last modified: 2021/06/30 23:40 (external edit)