To understand how to cope with large amounts of data, and how to compositionally construct custom kernels that mix several different types of data.
To complete this lab, you will need to create a Kaggle account.
You will turn in an ipython notebook that uses large-scale Gaussian process regression to make predictions for the Rossmann Store Sales Kaggle competition . Your notebook must do the following:
Finally, you should construct an actual prediction for the Kaggle competition, submit it, and have your notebook print the overall prediction MSE and ranking you would have received.
Note: because this lab is computationally intensive, you might not be able to use the full dataset. Feel free to subsample the training data to whatever size is manageable (you still need to submit predictions to Kaggle for the entire test set) - but I encourage you to, at some point, run your code on the largest dataset you can!
Your notebook will be graded on the following:
The data you will use for this lab comes from the Rossmann Store Sales dataset from Kaggle. It is posted on the class Dropbox, and direct links are posted here for convenience:
Rossman store sales training data
Rossman store sales auxiliary data
Since there are 1,017,209 training points, we cannot use naive Gaussian process regression; we cannot construct or invert a matrix that large!
Instead, you will implement the “subset of regressors” algorithm from Lecture 12. This is much more computationally efficient.
You should begin by defining a kernel. Remember that this must measure similarity between training data points. If you loaded the data using pandas
, then the data points you pass in will be single data frames. My kernel function looks like this:
# xi and xj are both data frames consisting of a single element def kernel( xi, xj ): return store_kernel( xi.Store, xj.Store ) + date_kernel( xi.Date, xj.Date ) + dow_kernel( xi.DayOfWeek, xj.DayOfWeek ) + ...
And if you wanted to call the function on data points #42 and #16, it might look like this:
# this is our training data data = pandas.read_csv( 'store_train.csv' ) kval = kernel( data.iloc[42], data.iloc[16] )
For this part you must implement a landmark selection algorithm. You are welcome to use a simple random selection of $m$ data points, or you can do something more sophisticated. Be creative!
$m$ should be a prominent parameter in your code, so that it is easily changed. You should experiment with multiple values of $m$; you may want to use small values while you're debugging, and the largest value your computer can stomach for your final Kaggle submission.
Note: Your landmark data points don't actually have to come from the data set. You could, for example, create new landmark datapoints that involve averages. Such hypothetical data points could arise, for example, if you used a clustering algorithm to find the landmarks.
You should display an $m\times m$ image of your $K$ matrix, where $K_{ij} = kernel(x_i,x_j)$, where $i$ and $j$ should range over all of your landmarks. You are welcome to reuse code from the MNIST KDE lab in order to do the displaying. Note: make sure that you include a colorbar so that we can see the scale of the entries in the matrix.
Please follow this description of the subset of regressors approach. In particular, on Monday we discussed how you should partition your dataset into $m$ landmarks, and the $n$ rest of your data points. Don't do that. Instead, think of the $m$ landmarks as reusing points in your dataset – so $m+n>n$. In your dataset, you have $n$ training points, with $n$ x-values and $n$ y-values. Depending on your landmark selection algorithm, the $m$ landmarks could be the same as some of the training points. So, for example: if you have $n=1000$ training points, and you randomly pick $m=5$ landmark points, you will effectively have $n+m=1005$ points, but $5$ of those are re-used.
So: in all of the math below, the number $n$ refers to all of your training data.
Given your set of $m$ landmarks, and for each test point $x_t$, you will need to compute the expected prediction $\mu_t'$: $$\mu_t' = K_{tm}\left( K_{mn} K_{nm} + \sigma^2 K_{mm}\right)^{-1} K_{mn} y$$ where
Hint: think a bit about what depends on $x_t$ and what does not. What calculations can you do once, and cache?
Note that the predictive variances are not used; there's no way for Kaggle to accept them.
Here is some further reading on the subset of regressors method, along with much more information about scalable Gaussian processes in general.
In order to submit to Kaggle, you must prepare a simple CSV file that contains each prediction. It should contain 41,088 predictions, plus a single header line. To help you prepare this, a full example script is given in the hints section.
Once you have submitted your predictions, record the resulting MSE and your resulting ranking. Your notebook must print these out.
Here is an example python script that will compute a simple prediction. It creates a file called “mean_sub.csv”, which you could upload to Kaggle. Note: the ids in the prediction file use 1-based indexing, not 0-based indexing.
# # Predict the mean of each store/day-of-week combo. # # This results in a MSE of 0.23633, and would have resulted in placing # about 2785th (out of 3303) in the competition. Note that this code is super inefficient. # # Can you do better? # import numpy import pandas # this is our training data data = pandas.read_csv( 'store_train.csv' ) # these are what we need to make predictions for testd = pandas.read_csv( 'store_test.csv' ) N = testd.shape[0] my_preds = numpy.zeros(( N, 1 )) for id in range( 0, N ): # grab a data element one at a time pval = testd.iloc[id] sid = pval.Store dow = pval.DayOfWeek if dow == 7 or pval.Open == 0: # stores are closed on Sunday. Awesome! my_preds[ id, 0 ] = 0 else: # slurp out all data that matches Store and DayOfWeek tmp = data[ (data.Store == sid ) & ( data.DayOfWeek == dow ) ] # super inefficient. Cache these! my_preds[ id, 0 ] = numpy.mean( tmp.Sales ) # a little "progress bar" print("%.2f (%d/%d)" % ( (1.0*id)/(1.0*N), id, N )) sfile = open( 'mean_sub.csv', 'w' ) sfile.write( '"Id","Sales"\n' ) for id in range( 0, N ): sfile.write( '%d,%.2f\n' % ( id+1, my_preds[id] ) ) # add one for one-based indexing sfile.close()