User Tools

Site Tools


cs401r_w2016:lab14

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs401r_w2016:lab14 [2016/02/02 04:52]
admin
cs401r_w2016:lab14 [2021/06/30 23:42] (current)
Line 11: Line 11:
 ====Deliverable:​==== ====Deliverable:​====
  
-You will turn in an ipython notebook that uses large-scale Gaussian process regression to make predictions for the Rossmann Store Sales Kaggle competition. ​ Your notebook must do the following:+You will turn in an ipython notebook that uses large-scale Gaussian process regression to make predictions for the [[https://​www.kaggle.com/​c/​rossmann-store-sales|Rossmann Store Sales Kaggle competition]]  
 +. Your notebook must do the following:
  
   - Construct a custom kernel that measures similarity between two different training data points   - Construct a custom kernel that measures similarity between two different training data points
Line 20: Line 21:
 Finally, you should construct an actual prediction for the Kaggle competition,​ submit it, and have your notebook print the overall prediction MSE and ranking you would have received. Finally, you should construct an actual prediction for the Kaggle competition,​ submit it, and have your notebook print the overall prediction MSE and ranking you would have received.
  
 +**Note:** because this lab is computationally intensive, you might not be able to use the full dataset. ​ Feel free to subsample the training data to whatever size is manageable (you still need to submit predictions to Kaggle for the entire test set) - but I encourage you to, at some point, run your code on the largest dataset you can!
  
 ---- ----
Line 35: Line 37:
 ====Description:​==== ====Description:​====
  
-The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle:+The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle.  It is posted on the class Dropbox, and direct links are posted here for convenience:
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_train.csv|Rossman store sales training data]]+[[https://www.dropbox.com/s/wnew5x52zb13oqe/​store_train.csv?dl=0|Rossman store sales training data]]
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_test.csv|Rossman store sales test data]]+[[https://www.dropbox.com/s/fzqnndl4dkxah0q/​store_test.csv?dl=0|Rossman store sales test data]] 
 + 
 +[[https://​www.dropbox.com/​s/​be5fakw1etwdjja/​store_info.csv?​dl=0|Rossman store sales auxiliary ​data]]
  
 Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large! Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large!
Line 47: Line 51:
 ===Part 1: Constructing a composite kernel=== ===Part 1: Constructing a composite kernel===
  
-You should begin by defining a kernel. ​ Remember that this must measure similarity between training data points. ​ If you loaded the data using ```pandas```, then the data points you pass in will be single data frames. ​ My kernel function looks like this:+You should begin by defining a kernel. ​ Remember that this must measure similarity between training data points. ​ If you loaded the data using ''​pandas''​, then the data points you pass in will be single data frames. ​ My kernel function looks like this:
  
 <code python> <code python>
Line 59: Line 63:
 # this is our training data # this is our training data
 data = pandas.read_csv( '​store_train.csv'​ ) data = pandas.read_csv( '​store_train.csv'​ )
- 
 kval = kernel( data.iloc[42],​ data.iloc[16] ) kval = kernel( data.iloc[42],​ data.iloc[16] )
- 
 </​code>​ </​code>​
  
 ===Part 2: Landmark selection=== ===Part 2: Landmark selection===
  
-For this part you must implement a landmark selection algorithm. ​ You are welcome to use a simple random selection of m data points, or you can do something more sophisticated. ​ Be creative!+For this part you must implement a landmark selection algorithm. ​ You are welcome to use a simple random selection of $mdata points, or you can do something more sophisticated. ​ Be creative!
  
-**Note:** Your landmark data points don't actually have to come from the data set.  You could, for example, create new landmark datapoints that involve averages. Such hypothetical data points could arise, for example, if you used a clustering algorithm to find the landmarks. ​+$m$ should be a prominent parameter in your code, so that it is easily changed. ​ You should experiment with multiple values of $m$; you may want to use small values while you're debugging, and the largest value your computer can stomach for your final Kaggle submission. 
 + 
 +//**Note:** Your landmark data points don't actually have to come from the data set.  You could, for example, create new landmark datapoints that involve averages. Such hypothetical data points could arise, for example, if you used a clustering algorithm to find the landmarks.// 
  
 ===Part 3: Display of kernel matrix=== ===Part 3: Display of kernel matrix===
  
-You should display an $mxm$ image of your $K$ matrix, where $K_{ij} = kernel(x_i,​x_j)$. ​ You are welcome to reuse code from the MNIST KDE lab in order to accomplish this.  **Note:** make sure that you include a colorbar so that we can see the scale of the entries in the matrix.+You should display an $m\times m$ image of your $K$ matrix, where $K_{ij} = kernel(x_i,​x_j)$, where $i$ and $j$ should range over all of your landmarks.  You are welcome to reuse code from the MNIST KDE lab in order to do the displaying.  **Note:** make sure that you include a colorbar so that we can see the scale of the entries in the matrix.
  
 ===Part 4: Implementation of Subset of Regressors=== ===Part 4: Implementation of Subset of Regressors===
  
-**Note: the slides in Monday'​s presentation were slightly incorrect.**  ​Please follow this description of the subset of regressors approach.+Please follow this description of the subset of regressors approach.  In particular, on Monday we discussed how you should partition your dataset into $m$ landmarks, and the $n$ rest of your data points. ​ Don't do that.  Instead, think of the $m$ landmarks as reusing points in your dataset -- so $m+n>​n$. ​ In your dataset, you have $n$ training points, with $n$ x-values and $n$ y-values. ​ Depending on your landmark selection algorithm, the $m$ landmarks could be the same as some of the training points. ​ So, for example: if you have $n=1000$ training points, and you randomly pick $m=5$ landmark points, you will effectively have $n+m=1005$ points, but $5$ of those are re-used.
  
-Given your set of $m$ landmarksand for each test point $x_t$, you will need to compute the expected prediction $\mu'​$:​ +So: in all of the math belowthe number ​$nrefers ​to **all** of your training data.
- +
-$$\mu' = K_{tm}\left( K_{mn} K_{nm} + \sigma^2 K_{mm}\right)^{-1} K_{mn} y$$+
  
 +Given your set of $m$ landmarks, and for each test point $x_t$, you will need to compute the expected prediction $\mu_t'​$:​
 +$$\mu_t'​ = K_{tm}\left( K_{mn} K_{nm} + \sigma^2 K_{mm}\right)^{-1} K_{mn} y$$
 where where
- 
   * $K_{tm}$ is the $1\times m$ kernel matrix between $x_t$ and every landmark $x_m$   * $K_{tm}$ is the $1\times m$ kernel matrix between $x_t$ and every landmark $x_m$
   * $K_{mn}$ is the $m\times n$ kernel matrix between every landmark $x_m$ and every datapoint $x_1, ..., x_n$   * $K_{mn}$ is the $m\times n$ kernel matrix between every landmark $x_m$ and every datapoint $x_1, ..., x_n$
   * $K_{mm}$ is the $m\times m$ kernel matrix between every landmark $x_m$ and every other landmark   * $K_{mm}$ is the $m\times m$ kernel matrix between every landmark $x_m$ and every other landmark
-  * $y$ is a column vector with the training data set (ie, a vector of 1,017,209 sales numbers)+  * $y$ is a column vector with the training data of sales numbers ​(ie, a vector of 1,017,209 sales numbers)
   * $\sigma^2$ is a parameter you may choose as you like   * $\sigma^2$ is a parameter you may choose as you like
 +
 +//Hint: think a bit about what depends on $x_t$ and what does not.  What calculations can you do once, and cache?//
  
 Note that the predictive variances are not used; there'​s no way for Kaggle to accept them. Note that the predictive variances are not used; there'​s no way for Kaggle to accept them.
Line 103: Line 108:
 ====Hints:​==== ====Hints:​====
  
-Here is an example python script that will compute a simple prediction. ​ It creates a file called "​mean_sub.csv",​ which you could upload to Kaggle.+Here is an example python script that will compute a simple prediction. ​ It creates a file called "​mean_sub.csv",​ which you could upload to Kaggle.  **Note:** the ids in the prediction file use 1-based indexing, not 0-based indexing.
  
 <code python> <code python>
Line 143: Line 148:
  
     # a little "​progress bar"     # a little "​progress bar"
-    print "%.2f (%d/​%d)"​ % ( (1.0*id)/​(1.0*N),​ id, N )+    print("%.2f (%d/​%d)"​ % ( (1.0*id)/​(1.0*N),​ id, N ))
  
  
-sfile = open( '​mean_sub.csv',​ 'wb' )+sfile = open( '​mean_sub.csv',​ 'w' )
 sfile.write( '"​Id","​Sales"​\n'​ ) sfile.write( '"​Id","​Sales"​\n'​ )
 for id in range( 0, N ): for id in range( 0, N ):
-    sfile.write( '​%d,​%.2f\n'​ % ( id+1, my_preds[id] ) )+    sfile.write( '​%d,​%.2f\n'​ % ( id+1, my_preds[id] ) )  # add one for one-based indexing
 sfile.close() sfile.close()
 </​code>​ </​code>​
cs401r_w2016/lab14.1454388774.txt.gz · Last modified: 2021/06/30 23:40 (external edit)