Differences

This shows you the differences between two versions of the page.

--- cs401r_w2016:lab12 [2016/03/28 16:21]
admin
+++ cs401r_w2016:lab12 [2021/06/30 23:42] (current)
@@ Line 9: / Line 9: @@
   - A notebook containing your code, but we will not run it.
-  - A set of predictions for a specific list of <user,movie> pairs.
+  - A set of predictions for a specific list of <user,movie> pairs, in a CSV file.
-  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.
+  - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data.  Markdown format, please!!
 We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab.
@@ Line 21: / Line 21: @@
 Your entry will be graded on the following elements:
-  * 100% Project writeup
+  * 85% Project writeup
-    * 50% Description of method
+    * 30% Exploratory data analysis
-    * 25% Correct implementation of proposed method
+    * 30% Description of technical approach
     * 25% Analysis of performance of method
+  * 15% Submission of predictions csv file
   * 10% extra credit for the three top predictions
@@ Line 34: / Line 35: @@
 The training set you will use can be downloaded here:
-[[http://hatch.cs.byu.edu/courses/stat_ml/movie_training_data.tar.gz|Movie ratings training data]]
+[[https://www.dropbox.com/s/7r4rebn1ytugi4g/movie_training_data.tar.gz?dl=0|Movie ratings training data]]
 You will need to make predictions for a set of ''user,movie'' pairs.  These can be downloaded here:
-[[http://hatch.cs.byu.edu/courses/stat_ml/predictions.dat|Movie predictions data]]
+[[https://www.dropbox.com/s/2xvsrju9w3fqfon/predictions.dat?dl=0|Movie predictions data]]
-A complete description of the data can be found in the ''readme.txt'' file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie.
+A complete description of the data can be found in the ''readme.txt'' file.  This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie.
 You should start by looking at the ''user_ratedmovies_train.dat'' file.  It is a CSV file containing user,movie,timestamp tuples that form the core training data.  Everything else is auxiliary data that may or may not be useful.
@@ Line 46: / Line 47: @@
 **Turning in your submissions**
-As part of this lab, you must submit a set of predictions.  You must provide predictions as a simple CSV file with three columns and 85,000 rows.  Each row has the form
+As part of this lab, you must submit a set of predictions.  You must provide predictions as a simple CSV file with two columns and 85,000 rows.  Each row has the form
-''testID,predicted rating''
+''testID,predicted_rating''
 The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set.
@@ Line 56: / Line 57: @@
 Performance of your prediction engine will be based on RMSE:
-$$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$
+$$ \mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$
 **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!**
@@ Line 66: / Line 67: @@
 This writeup must include three main sections:
-* A discussion of your exploration of the dataset
+  - **A discussion of your exploration of the dataset**.
-  * Before you start coding, you should look at the data.  What does it include?  What patterns do you see?
+    - Before you start coding, you should look at the data.  What does it include?  What patterns do you see?
-  * Any visualizations about the data you deem relevant
+    - Any visualizations about the data you deem relevant
-* A clear, technical description of your approach.  This section should include:
+  - **A clear, technical description of your approach.**  This section should include:
-  * Background on the approach
+    - Background on the approach
-  * Description of the model you use
+    - Description of the model you use
-  * Description of the inference / training algorithm you use
+    - Description of the inference / training algorithm you use
-  * Description of how you partitioned your data into a test/training split
+    - Description of how you partitioned your data into a test/training split
-* An analysis of how your approach worked on the dataset
+  - **An analysis of how your approach worked on the dataset**
-  * What was your final RMSE on your private test/training split?
+    - What was your final RMSE on your private test/training split?
-  * Did you overfit?  How do you know?
+    - Did you overfit?  How do you know?
-  * Was your first algorithm the one you ultimately used for your submission?  Why did you (or didn't you) iterate your design?
+    - Was your first algorithm the one you ultimately used for your submission?  Why did you (or didn't you) iterate your design?
@@ Line 88: / Line 89: @@
 import seaborn
 import pandas
+import numpy as np
 ur = pandas.read_csv('user_ratedmovies_train.dat','\t')
 plt.hist( ur['rating'] )
+# create a test/train split
+all_inds = np.random.permutation( range(0,len(ur)) )
+test_inds = all_inds[0:85000]
+train_inds = all_inds[85000:len(ur)]
+ur_test = ur.iloc[ test_inds ]
+ur_train = ur.iloc[ train_inds ]
 </code>
+And Here is some code that writes out your prediction file that you will submit:
+<code python>
+import numpy as np
+import pandas as pd
+pred_array = pd.read_table('predictions.dat')
+test_ids = pred_array[["testID"]]
+pred_array.head()
+N = pred_array.shape[0]
+my_preds = np.zeros((N,1))
+for id in range(N): ### Prediction loop
+    predicted_rating = 3
+    my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3
+sfile = open( 'predictions.csv', 'w' )
+sfile.write( '"testID","predicted_rating"\n' )
+for id in range( 0, N ):
+    sfile.write( '%d,%.2f\n' % (test_ids.iloc[id], my_preds[id] ) )
+sfile.close()
+</code>

BYU CS classes

User Tools

Site Tools

Differences

Page Tools