User Tools

Site Tools


cs401r_w2016:lab14

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs401r_w2016:lab14 [2018/02/09 23:25]
sadler [Hints:]
cs401r_w2016:lab14 [2018/02/12 22:47]
wingated [Description:]
Line 37: Line 37:
 ====Description:​==== ====Description:​====
  
-The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle:+The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle.  It is posted on the class Dropbox, and direct links are posted here for convenience:
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_train.csv|Rossman store sales training data]]+[[https://www.dropbox.com/s/wnew5x52zb13oqe/​store_train.csv?dl=0|Rossman store sales training data]]
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_test.csv|Rossman store sales test data]]+[[https://www.dropbox.com/s/fzqnndl4dkxah0q/​store_test.csv?dl=0|Rossman store sales test data]]
  
-[[http://hatch.cs.byu.edu/courses/stat_ml/​store_info.csv|Rossman store sales auxiliary data]]+[[https://www.dropbox.com/s/be5fakw1etwdjja/​store_info.csv?dl=0|Rossman store sales auxiliary data]]
  
 Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large! Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large!
Line 80: Line 80:
 ===Part 4: Implementation of Subset of Regressors=== ===Part 4: Implementation of Subset of Regressors===
  
-**Note: the slides in Monday'​s presentation were slightly incorrect.**  ​Please follow this description of the subset of regressors approach. ​ In particular, on Monday we discussed how you should partition your dataset into $m$ landmarks, and the $n$ rest of your data points. ​ Don't do that.  Instead, think of the $m$ landmarks as reusing points in your dataset -- so $m+n>​n$. ​ In your dataset, you have $n$ training points, with $n$ x-values and $n$ y-values. ​ Depending on your landmark selection algorithm, the $m$ landmarks could be the same as some of the training points. ​ So, for example: if you have $n=1000$ training points, and you randomly pick $m=5$ landmark points, you will effectively have $n+m=1005$ points, but $5$ of those are re-used.+Please follow this description of the subset of regressors approach. ​ In particular, on Monday we discussed how you should partition your dataset into $m$ landmarks, and the $n$ rest of your data points. ​ Don't do that.  Instead, think of the $m$ landmarks as reusing points in your dataset -- so $m+n>​n$. ​ In your dataset, you have $n$ training points, with $n$ x-values and $n$ y-values. ​ Depending on your landmark selection algorithm, the $m$ landmarks could be the same as some of the training points. ​ So, for example: if you have $n=1000$ training points, and you randomly pick $m=5$ landmark points, you will effectively have $n+m=1005$ points, but $5$ of those are re-used.
  
 So: in all of the math below, the number $n$ refers to **all** of your training data. So: in all of the math below, the number $n$ refers to **all** of your training data.
cs401r_w2016/lab14.txt ยท Last modified: 2021/06/30 23:42 (external edit)