This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
cs401r_w2016:lab12 [2016/03/28 16:21] admin |
cs401r_w2016:lab12 [2021/06/30 23:42] (current) |
||
|---|---|---|---|
| Line 9: | Line 9: | ||
| - A notebook containing your code, but we will not run it. | - A notebook containing your code, but we will not run it. | ||
| - | - A set of predictions for a specific list of <user,movie> pairs. | + | - A set of predictions for a specific list of <user,movie> pairs, in a CSV file. |
| - | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. | + | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. Markdown format, please!! |
| We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | ||
| Line 21: | Line 21: | ||
| Your entry will be graded on the following elements: | Your entry will be graded on the following elements: | ||
| - | * 100% Project writeup | + | * 85% Project writeup |
| - | * 50% Description of method | + | * 30% Exploratory data analysis |
| - | * 25% Correct implementation of proposed method | + | * 30% Description of technical approach |
| * 25% Analysis of performance of method | * 25% Analysis of performance of method | ||
| + | * 15% Submission of predictions csv file | ||
| * 10% extra credit for the three top predictions | * 10% extra credit for the three top predictions | ||
| Line 34: | Line 35: | ||
| The training set you will use can be downloaded here: | The training set you will use can be downloaded here: | ||
| - | [[http://hatch.cs.byu.edu/courses/stat_ml/movie_training_data.tar.gz|Movie ratings training data]] | + | [[https://www.dropbox.com/s/7r4rebn1ytugi4g/movie_training_data.tar.gz?dl=0|Movie ratings training data]] |
| You will need to make predictions for a set of ''user,movie'' pairs. These can be downloaded here: | You will need to make predictions for a set of ''user,movie'' pairs. These can be downloaded here: | ||
| - | [[http://hatch.cs.byu.edu/courses/stat_ml/predictions.dat|Movie predictions data]] | + | [[https://www.dropbox.com/s/2xvsrju9w3fqfon/predictions.dat?dl=0|Movie predictions data]] |
| - | A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. | + | A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. |
| You should start by looking at the ''user_ratedmovies_train.dat'' file. It is a CSV file containing user,movie,timestamp tuples that form the core training data. Everything else is auxiliary data that may or may not be useful. | You should start by looking at the ''user_ratedmovies_train.dat'' file. It is a CSV file containing user,movie,timestamp tuples that form the core training data. Everything else is auxiliary data that may or may not be useful. | ||
| Line 46: | Line 47: | ||
| **Turning in your submissions** | **Turning in your submissions** | ||
| - | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with three columns and 85,000 rows. Each row has the form | + | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with two columns and 85,000 rows. Each row has the form |
| - | ''testID,predicted rating'' | + | ''testID,predicted_rating'' |
| The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set. | The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set. | ||
| Line 56: | Line 57: | ||
| Performance of your prediction engine will be based on RMSE: | Performance of your prediction engine will be based on RMSE: | ||
| - | $$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ | + | $$ \mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ |
| **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | ||
| Line 66: | Line 67: | ||
| This writeup must include three main sections: | This writeup must include three main sections: | ||
| - | * A discussion of your exploration of the dataset | + | - **A discussion of your exploration of the dataset**. |
| - | * Before you start coding, you should look at the data. What does it include? What patterns do you see? | + | - Before you start coding, you should look at the data. What does it include? What patterns do you see? |
| - | * Any visualizations about the data you deem relevant | + | - Any visualizations about the data you deem relevant |
| - | * A clear, technical description of your approach. This section should include: | + | - **A clear, technical description of your approach.** This section should include: |
| - | * Background on the approach | + | - Background on the approach |
| - | * Description of the model you use | + | - Description of the model you use |
| - | * Description of the inference / training algorithm you use | + | - Description of the inference / training algorithm you use |
| - | * Description of how you partitioned your data into a test/training split | + | - Description of how you partitioned your data into a test/training split |
| - | * An analysis of how your approach worked on the dataset | + | - **An analysis of how your approach worked on the dataset** |
| - | * What was your final RMSE on your private test/training split? | + | - What was your final RMSE on your private test/training split? |
| - | * Did you overfit? How do you know? | + | - Did you overfit? How do you know? |
| - | * Was your first algorithm the one you ultimately used for your submission? Why did you (or didn't you) iterate your design? | + | - Was your first algorithm the one you ultimately used for your submission? Why did you (or didn't you) iterate your design? |
| Line 88: | Line 89: | ||
| import seaborn | import seaborn | ||
| import pandas | import pandas | ||
| + | import numpy as np | ||
| ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ||
| plt.hist( ur['rating'] ) | plt.hist( ur['rating'] ) | ||
| + | |||
| + | # create a test/train split | ||
| + | |||
| + | all_inds = np.random.permutation( range(0,len(ur)) ) | ||
| + | test_inds = all_inds[0:85000] | ||
| + | train_inds = all_inds[85000:len(ur)] | ||
| + | |||
| + | ur_test = ur.iloc[ test_inds ] | ||
| + | ur_train = ur.iloc[ train_inds ] | ||
| + | |||
| </code> | </code> | ||
| + | |||
| + | And Here is some code that writes out your prediction file that you will submit: | ||
| + | |||
| + | <code python> | ||
| + | |||
| + | import numpy as np | ||
| + | import pandas as pd | ||
| + | |||
| + | pred_array = pd.read_table('predictions.dat') | ||
| + | test_ids = pred_array[["testID"]] | ||
| + | pred_array.head() | ||
| + | |||
| + | N = pred_array.shape[0] | ||
| + | my_preds = np.zeros((N,1)) | ||
| + | |||
| + | for id in range(N): ### Prediction loop | ||
| + | predicted_rating = 3 | ||
| + | my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3 | ||
| + | |||
| + | sfile = open( 'predictions.csv', 'w' ) | ||
| + | sfile.write( '"testID","predicted_rating"\n' ) | ||
| + | for id in range( 0, N ): | ||
| + | sfile.write( '%d,%.2f\n' % (test_ids.iloc[id], my_preds[id] ) ) | ||
| + | sfile.close() | ||
| + | |||
| + | </code> | ||
| + | |||