This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
cs401r_w2016:lab12 [2016/03/28 16:29] admin |
cs401r_w2016:lab12 [2021/06/30 23:42] (current) |
||
|---|---|---|---|
| Line 10: | Line 10: | ||
| - A notebook containing your code, but we will not run it. | - A notebook containing your code, but we will not run it. | ||
| - A set of predictions for a specific list of <user,movie> pairs, in a CSV file. | - A set of predictions for a specific list of <user,movie> pairs, in a CSV file. | ||
| - | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. PDF format, please! | + | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. Markdown format, please!! |
| We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | ||
| Line 21: | Line 21: | ||
| Your entry will be graded on the following elements: | Your entry will be graded on the following elements: | ||
| - | * 100% Project writeup | + | * 85% Project writeup |
| - | * 35% Exploratory data analysis | + | * 30% Exploratory data analysis |
| - | * 35% Description of technical approach | + | * 30% Description of technical approach |
| - | * 30% Analysis of performance of method | + | * 25% Analysis of performance of method |
| + | * 15% Submission of predictions csv file | ||
| * 10% extra credit for the three top predictions | * 10% extra credit for the three top predictions | ||
| Line 34: | Line 35: | ||
| The training set you will use can be downloaded here: | The training set you will use can be downloaded here: | ||
| - | [[http://hatch.cs.byu.edu/courses/stat_ml/movie_training_data.tar.gz|Movie ratings training data]] | + | [[https://www.dropbox.com/s/7r4rebn1ytugi4g/movie_training_data.tar.gz?dl=0|Movie ratings training data]] |
| You will need to make predictions for a set of ''user,movie'' pairs. These can be downloaded here: | You will need to make predictions for a set of ''user,movie'' pairs. These can be downloaded here: | ||
| - | [[http://hatch.cs.byu.edu/courses/stat_ml/predictions.dat|Movie predictions data]] | + | [[https://www.dropbox.com/s/2xvsrju9w3fqfon/predictions.dat?dl=0|Movie predictions data]] |
| A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. | A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. | ||
| Line 46: | Line 47: | ||
| **Turning in your submissions** | **Turning in your submissions** | ||
| - | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with three columns and 85,000 rows. Each row has the form | + | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with two columns and 85,000 rows. Each row has the form |
| - | ''testID,predicted rating'' | + | ''testID,predicted_rating'' |
| The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set. | The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set. | ||
| Line 56: | Line 57: | ||
| Performance of your prediction engine will be based on RMSE: | Performance of your prediction engine will be based on RMSE: | ||
| - | $$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ | + | $$ \mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ |
| **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | ||
| Line 88: | Line 89: | ||
| import seaborn | import seaborn | ||
| import pandas | import pandas | ||
| + | import numpy as np | ||
| ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ||
| plt.hist( ur['rating'] ) | plt.hist( ur['rating'] ) | ||
| + | |||
| + | # create a test/train split | ||
| + | |||
| + | all_inds = np.random.permutation( range(0,len(ur)) ) | ||
| + | test_inds = all_inds[0:85000] | ||
| + | train_inds = all_inds[85000:len(ur)] | ||
| + | |||
| + | ur_test = ur.iloc[ test_inds ] | ||
| + | ur_train = ur.iloc[ train_inds ] | ||
| + | |||
| </code> | </code> | ||
| + | |||
| + | And Here is some code that writes out your prediction file that you will submit: | ||
| + | |||
| + | <code python> | ||
| + | |||
| + | import numpy as np | ||
| + | import pandas as pd | ||
| + | |||
| + | pred_array = pd.read_table('predictions.dat') | ||
| + | test_ids = pred_array[["testID"]] | ||
| + | pred_array.head() | ||
| + | |||
| + | N = pred_array.shape[0] | ||
| + | my_preds = np.zeros((N,1)) | ||
| + | |||
| + | for id in range(N): ### Prediction loop | ||
| + | predicted_rating = 3 | ||
| + | my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3 | ||
| + | |||
| + | sfile = open( 'predictions.csv', 'w' ) | ||
| + | sfile.write( '"testID","predicted_rating"\n' ) | ||
| + | for id in range( 0, N ): | ||
| + | sfile.write( '%d,%.2f\n' % (test_ids.iloc[id], my_preds[id] ) ) | ||
| + | sfile.close() | ||
| + | |||
| + | </code> | ||
| + | |||