This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cs401r_w2016:lab12 [2016/03/28 16:25] admin |
cs401r_w2016:lab12 [2021/06/30 23:42] (current) |
||
---|---|---|---|
Line 9: | Line 9: | ||
- A notebook containing your code, but we will not run it. | - A notebook containing your code, but we will not run it. | ||
- | - A set of predictions for a specific list of <user,movie> pairs. | + | - A set of predictions for a specific list of <user,movie> pairs, in a CSV file. |
- | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. | + | - A report discussing your approach, how well it worked (in terms of RMSE), and any visualizations or patterns you found in the data. Markdown format, please!! |
We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | We will run a small "competition" on your predictions: the three students with the best predictions will get 10% extra credit on this lab. | ||
Line 21: | Line 21: | ||
Your entry will be graded on the following elements: | Your entry will be graded on the following elements: | ||
- | * 100% Project writeup | + | * 85% Project writeup |
- | * 35% Exploratory data analysis | + | * 30% Exploratory data analysis |
- | * 35% Description of technical approach | + | * 30% Description of technical approach |
- | * 30% Analysis of performance of method | + | * 25% Analysis of performance of method |
+ | * 15% Submission of predictions csv file | ||
* 10% extra credit for the three top predictions | * 10% extra credit for the three top predictions | ||
Line 34: | Line 35: | ||
The training set you will use can be downloaded here: | The training set you will use can be downloaded here: | ||
- | [[http://hatch.cs.byu.edu/courses/stat_ml/movie_training_data.tar.gz|Movie ratings training data]] | + | [[https://www.dropbox.com/s/7r4rebn1ytugi4g/movie_training_data.tar.gz?dl=0|Movie ratings training data]] |
You will need to make predictions for a set of ''user,movie'' pairs. These can be downloaded here: | You will need to make predictions for a set of ''user,movie'' pairs. These can be downloaded here: | ||
- | [[http://hatch.cs.byu.edu/courses/stat_ml/predictions.dat|Movie predictions data]] | + | [[https://www.dropbox.com/s/2xvsrju9w3fqfon/predictions.dat?dl=0|Movie predictions data]] |
A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. | A complete description of the data can be found in the ''readme.txt'' file. This dataset is richer than the Netflix competition dataset; for each movie, you also have a director and genre information, a corresponding IMDB ID, some RottenTomatoes information, as well as a set of tags that users may have used when rating each movie. | ||
Line 46: | Line 47: | ||
**Turning in your submissions** | **Turning in your submissions** | ||
- | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with three columns and 85,000 rows. Each row has the form | + | As part of this lab, you must submit a set of predictions. You must provide predictions as a simple CSV file with two columns and 85,000 rows. Each row has the form |
- | ''testID,predicted rating'' | + | ''testID,predicted_rating'' |
The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set. | The ''testID'' field uniquely identifies each ''user,movie'' prediction pair in the predictions set. | ||
Line 56: | Line 57: | ||
Performance of your prediction engine will be based on RMSE: | Performance of your prediction engine will be based on RMSE: | ||
- | $$ \mathrm{RMSE} = \sqrt{ \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ | + | $$ \mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i} (\mathrm{prediction_i} - \mathrm{truth_i})^2 } $$ |
**Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | **Note: it is strongly encouraged that you first partition your dataset into a training and a validation set, to assess the generalization performance of your rating algorithm!** | ||
Line 88: | Line 89: | ||
import seaborn | import seaborn | ||
import pandas | import pandas | ||
+ | import numpy as np | ||
ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ur = pandas.read_csv('user_ratedmovies_train.dat','\t') | ||
plt.hist( ur['rating'] ) | plt.hist( ur['rating'] ) | ||
+ | |||
+ | # create a test/train split | ||
+ | |||
+ | all_inds = np.random.permutation( range(0,len(ur)) ) | ||
+ | test_inds = all_inds[0:85000] | ||
+ | train_inds = all_inds[85000:len(ur)] | ||
+ | |||
+ | ur_test = ur.iloc[ test_inds ] | ||
+ | ur_train = ur.iloc[ train_inds ] | ||
+ | |||
</code> | </code> | ||
+ | |||
+ | And Here is some code that writes out your prediction file that you will submit: | ||
+ | |||
+ | <code python> | ||
+ | |||
+ | import numpy as np | ||
+ | import pandas as pd | ||
+ | |||
+ | pred_array = pd.read_table('predictions.dat') | ||
+ | test_ids = pred_array[["testID"]] | ||
+ | pred_array.head() | ||
+ | |||
+ | N = pred_array.shape[0] | ||
+ | my_preds = np.zeros((N,1)) | ||
+ | |||
+ | for id in range(N): ### Prediction loop | ||
+ | predicted_rating = 3 | ||
+ | my_preds[ id, 0 ] = predicted_rating ### This Predicts everything as 3 | ||
+ | |||
+ | sfile = open( 'predictions.csv', 'w' ) | ||
+ | sfile.write( '"testID","predicted_rating"\n' ) | ||
+ | for id in range( 0, N ): | ||
+ | sfile.write( '%d,%.2f\n' % (test_ids.iloc[id], my_preds[id] ) ) | ||
+ | sfile.close() | ||
+ | |||
+ | </code> | ||
+ |