User Tools

Site Tools


cs401r_w2016:lab14

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs401r_w2016:lab14 [2016/02/03 15:24]
admin
cs401r_w2016:lab14 [2021/06/30 23:42] (current)
Line 11: Line 11:
 ====Deliverable:​==== ====Deliverable:​====
  
-You will turn in an ipython notebook that uses large-scale Gaussian process regression to make predictions for the Rossmann Store Sales Kaggle competition. ​ Your notebook must do the following:+You will turn in an ipython notebook that uses large-scale Gaussian process regression to make predictions for the [[https://​www.kaggle.com/​c/​rossmann-store-sales|Rossmann Store Sales Kaggle competition]]  
 +. Your notebook must do the following:
  
   - Construct a custom kernel that measures similarity between two different training data points   - Construct a custom kernel that measures similarity between two different training data points
Line 20: Line 21:
 Finally, you should construct an actual prediction for the Kaggle competition,​ submit it, and have your notebook print the overall prediction MSE and ranking you would have received. Finally, you should construct an actual prediction for the Kaggle competition,​ submit it, and have your notebook print the overall prediction MSE and ranking you would have received.
  
 +**Note:** because this lab is computationally intensive, you might not be able to use the full dataset. ​ Feel free to subsample the training data to whatever size is manageable (you still need to submit predictions to Kaggle for the entire test set) - but I encourage you to, at some point, run your code on the largest dataset you can!
  
 ---- ----
Line 35: Line 37:
 ====Description:​==== ====Description:​====
  
-The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle:+The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle.  It is posted on the class Dropbox, and direct links are posted here for convenience:
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_train.csv|Rossman store sales training data]]+[[https://www.dropbox.com/s/wnew5x52zb13oqe/​store_train.csv?dl=0|Rossman store sales training data]]
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_test.csv|Rossman store sales test data]]+[[https://www.dropbox.com/s/fzqnndl4dkxah0q/​store_test.csv?dl=0|Rossman store sales test data]] 
 + 
 +[[https://​www.dropbox.com/​s/​be5fakw1etwdjja/​store_info.csv?​dl=0|Rossman store sales auxiliary ​data]]
  
 Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large! Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large!
Line 72: Line 76:
 ===Part 3: Display of kernel matrix=== ===Part 3: Display of kernel matrix===
  
-You should display an $mxm$ image of your $K$ matrix, where $K_{ij} = kernel(x_i,​x_j)$. ​ You are welcome to reuse code from the MNIST KDE lab in order to accomplish this.  **Note:** make sure that you include a colorbar so that we can see the scale of the entries in the matrix.+You should display an $m\times m$ image of your $K$ matrix, where $K_{ij} = kernel(x_i,​x_j)$, where $i$ and $j$ should range over all of your landmarks.  You are welcome to reuse code from the MNIST KDE lab in order to do the displaying.  **Note:** make sure that you include a colorbar so that we can see the scale of the entries in the matrix.
  
 ===Part 4: Implementation of Subset of Regressors=== ===Part 4: Implementation of Subset of Regressors===
  
-**Note: the slides in Monday'​s presentation were slightly incorrect.**  ​Please follow this description of the subset of regressors approach. ​ In particular, on Monday we discussed how you should partition your dataset into $m$ landmarks, and the $n$ rest of your data points. ​ Don't do that.  Instead, think of the $m$ landmarks as reusing points in your dataset -- so $m+n>​n$. ​ In your dataset, you have $n$ training points, with $n$ xvalues ​and $n$ yvalues.  Depending on your landmark selection algorithm, the $m$ landmarks could be the same as some of the training points. ​ So, for example: if you have $n=1000$ training points, and you randomly pick $m=5$ landmark points, you will effectively have $n+m=1005$ points, but $5$ of those are re-used.+Please follow this description of the subset of regressors approach. ​ In particular, on Monday we discussed how you should partition your dataset into $m$ landmarks, and the $n$ rest of your data points. ​ Don't do that.  Instead, think of the $m$ landmarks as reusing points in your dataset -- so $m+n>​n$. ​ In your dataset, you have $n$ training points, with $n$ x-values ​and $n$ y-values.  Depending on your landmark selection algorithm, the $m$ landmarks could be the same as some of the training points. ​ So, for example: if you have $n=1000$ training points, and you randomly pick $m=5$ landmark points, you will effectively have $n+m=1005$ points, but $5$ of those are re-used.
  
 So: in all of the math below, the number $n$ refers to **all** of your training data. So: in all of the math below, the number $n$ refers to **all** of your training data.
  
-Given your set of $m$ landmarks, and for each test point $x_t$, you will need to compute the expected prediction $\mu'$: +Given your set of $m$ landmarks, and for each test point $x_t$, you will need to compute the expected prediction $\mu_t'$: 
-$$\mu' = K_{tm}\left( K_{mn} K_{nm} + \sigma^2 K_{mm}\right)^{-1} K_{mn} y$$+$$\mu_t' = K_{tm}\left( K_{mn} K_{nm} + \sigma^2 K_{mm}\right)^{-1} K_{mn} y$$
 where where
   * $K_{tm}$ is the $1\times m$ kernel matrix between $x_t$ and every landmark $x_m$   * $K_{tm}$ is the $1\times m$ kernel matrix between $x_t$ and every landmark $x_m$
Line 104: Line 108:
 ====Hints:​==== ====Hints:​====
  
-Here is an example python script that will compute a simple prediction. ​ It creates a file called "​mean_sub.csv",​ which you could upload to Kaggle.+Here is an example python script that will compute a simple prediction. ​ It creates a file called "​mean_sub.csv",​ which you could upload to Kaggle.  **Note:** the ids in the prediction file use 1-based indexing, not 0-based indexing.
  
 <code python> <code python>
Line 144: Line 148:
  
     # a little "​progress bar"     # a little "​progress bar"
-    print "%.2f (%d/​%d)"​ % ( (1.0*id)/​(1.0*N),​ id, N )+    print("%.2f (%d/​%d)"​ % ( (1.0*id)/​(1.0*N),​ id, N ))
  
  
-sfile = open( '​mean_sub.csv',​ 'wb' )+sfile = open( '​mean_sub.csv',​ 'w' )
 sfile.write( '"​Id","​Sales"​\n'​ ) sfile.write( '"​Id","​Sales"​\n'​ )
 for id in range( 0, N ): for id in range( 0, N ):
-    sfile.write( '​%d,​%.2f\n'​ % ( id+1, my_preds[id] ) )+    sfile.write( '​%d,​%.2f\n'​ % ( id+1, my_preds[id] ) )  # add one for one-based indexing
 sfile.close() sfile.close()
 </​code>​ </​code>​
cs401r_w2016/lab14.1454513072.txt.gz · Last modified: 2021/06/30 23:40 (external edit)