User Tools

Site Tools


cs501r_f2016:lab2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs501r_f2016:lab2 [2016/08/31 18:23]
wingated
cs501r_f2016:lab2 [2021/06/30 23:42] (current)
Line 1: Line 1:
 ====Objective:​==== ====Objective:​====
  
-To gain experience with python, numpy, and linear classification. ​ Oh, and to remember all of that linear algebra stuff. ​ ;)+To gain experience with python, numpy, and linear classification.  ​ 
 + 
 +Oh, and to remember all of that linear algebra stuff. ​ ;)
  
 ---- ----
 ====Deliverable:​==== ====Deliverable:​====
  
-You should turn in an iPython notebook that implements the perceptron algorithm on the Iris dataset.+You should turn in an iPython notebook that implements the perceptron algorithm on two different datasets: ​the Iris dataset, and the CIFAR-10 dataset. ​ Because the perceptron is a binary classifier, we will preprocess the data and "​squash"​ it to create two classes. 
 + 
 +Your notebook should also generate a visualization that shows classification accuracy at each iteration, along with the log of the l2 norm of the weight vector, for two different values of the perceptron'​s step size.  Examples of both are shown at the right (for the CIFAR-10 dataset). ​ Since there are two datasets, and there are two visualizations per dataset, your notebook should produce a total of 4 plots. 
 + 
 +**Please cleanly label your axes!** 
 + 
 +{{ :​cs501r_f2016:​lab2_cacc.png?​direct&​200|}} 
 + 
 +{{ :​cs501r_f2016:​lab2_l2norm.png?​direct&​200|}} 
 + 
 +The Iris dataset can be downloaded at the UCI ML repository, or you can download a slightly simpler version here: 
 +[[http://​liftothers.org/​Fisher.csv|http://​liftothers.org/​Fisher.csv]]
  
-Your notebook should also generate a visualization that shows the loss function at each iteration. ​ This can be generated as a single plot, and shown in the notebook.+The CIFAR-10 dataset ​can be downloaded at 
 +[[https://​www.cs.toronto.edu/​~kriz/​cifar.html|https://​www.cs.toronto.edu/​~kriz/​cifar.html]]
  
-The dataset can be downloaded at +**Notemake sure to download the python version of the data it will simplify your life!**
-[[https://​archive.ics.uci.edu/​ml/​machine-learning-databases/​iris/​iris.data|The UCI ML repository]]+
  
 ---- ----
Line 19: Line 32:
  
   * 70% Correct implementation of perceptron algorithm   * 70% Correct implementation of perceptron algorithm
-  * 20% Tidy and legible visualization of loss function +  * 20% Tidy and legible visualization of weight norm 
-  * 10% Tidy and legible ​final classification ​rate+  * 10% Tidy and legible ​plot of classification ​accuracy over time
  
 ---- ----
 ====Description:​==== ====Description:​====
  
 +The purpose of this lab is to help you become familiar with ''​numpy'',​ to remember the basics of classification,​ and to implement the perceptron algorithm. ​ The perceptron algorithm is a simple method of learning a separating hyperplane. ​ It is guaranteed to converge iff the dataset is linearly separable - otherwise, you have to cross your fingers!
  
-For this lab, you will be experimenting with Kernel Density Estimators (see MLAPP 14.7.2). ​ These are a simple, nonparametric alternative ​to Gaussian mixture models, but which form an important part of the machine learning toolkit.+You should implement the perceptron algorithm according ​to the description in Wikipedia:
  
-At several points during this lab, you will need to construct density estimates that are "​class-conditional"​ For example, in order to classify a test point $x_j$, you need to compute+[[https://​en.wikipedia.org/​wiki/​Perceptron|Perceptron]]
  
-$$p\mathrm{class}=k | x_j, \mathrm{data} ​\propto p( x_j | \mathrm{class}=k\mathrm{data} ) p(\mathrm{class}=k | \mathrm{data} ) $$+As you implement this lab, you will (hopefully!learn the difference between numpy'​s matricesnumpy'​s vectors, and lists. ​ In particular, note that a list is not the same a vector, and a ''​n x 1''​ matrix is not the same as a vector of length ''​n''​.
  
-where+You may find the functions ''​np.asmatrix'',​ ''​np.atleast_2d'',​ and ''​np.reshape''​ helpful to convert between them.
  
-$$p( x_j | \mathrm{class}=k\mathrm{data} )$$+Alsoyou may find the function ''​np.dot''​ helpful to compute matrix-vector products, or vector-vector products. You can transpose a matrix or a vector by calling the ''​.T''​ method.
  
-is given by a kernel density estimator derived from all data of class $k$.+Hint: you should start with the Iris dataset, then once you have your perceptron working, you should move to the CIFAR-10 dataset.
  
 +**Preparing the data:**
  
-The data that you will analyzing ​is the famous [[http://​yann.lecun.com/​exdb/​mnist/​|MNIST handwritten digits ​dataset]].  You can download some pre-processed MATLAB data files below:+Both datasets are natively multiclass, but we need to convert them to binary classification problems. ​ To show you how we're going to do this, and to give you a bit of code to get started, here is how I loaded and converted ​the Iris dataset:
  
-[[http://​hatch.cs.byu.edu/​courses/​stat_ml/​mnist_train.mat|MNIST training data vectors and labels]] +<code python>​ 
- +data = pandas.read_csv( '​Fisher.csv' ) 
-[[http://​hatch.cs.byu.edu/​courses/​stat_ml/​mnist_test.mat|MNIST test data vectors and labels]]+m = data.as_matrix() 
 +labels ​= m[:,0] 
 +labels[ labels==2 ​= 1  # squash class 2 into class 1 
 +labels = np.atleast_2d( labels ).T 
 +features = m[:,1:5] 
 +</code>
  
-These can be loaded using the scipy.io.loadmat function, as follows:+and the CIFAR-10 dataset:
  
 <code python> <code python>
-import ​scipy.io+def unpickle( file ): 
 +    ​import ​cPickle 
 +    fo = open(file, '​rb'​) 
 +    dict = cPickle.load(fo) 
 +    fo.close() 
 +    return dict
  
-train_mat ​scipy.io.loadmat('mnist_train.mat') +data unpickle( 'cifar-10-batches-py/​data_batch_1' ) 
-train_data ​train_mat['images'] + 
-train_labels ​train_mat['​labels'​]+features ​data['data'] 
 +labels ​data['​labels'​] 
 +labels = np.atleast_2d( labels ).T 
 + 
 +# squash classes 0-4 into class 0, and squash classes 5-9 into class 1 
 +labels[ labels < 5 ] = 0 
 +labels[ labels >= 5 ] = 1
  
-test_mat = scipy.io.loadmat('​mnist_test.mat'​) 
-test_data = test_mat['​t10k_images'​] 
-test_labels = test_mat['​t10k_labels'​] 
 </​code>​ </​code>​
  
-The training data vectors are now in ''​train_data'',​ a numpy array of size 784x60000, with corresponding labels in ''​train_labels'',​ a numpy array of size 60000x1.+** Running the perceptron algorithm**
  
----- +Remember that if a data instance is classified correctly, there is no change in the weight vector.
-====Hints:​====+
  
-Here is simple way to visualize ​digit.  ​Suppose our digit is in variable ​''​X''​, which has dimensions 784x1:+In the wikipedia description of the perceptron algorithm, notice the function ''​f''​. ​ That's the Heaviside step function. ​ What does it do? 
 + 
 +You should run the perceptron for at least 100 steps. ​ Note that your perceptron will probably converge in much fewer on the Iris dataset! 
 + 
 +You should also test different step sizes. Wikipedia doesn'​t discuss how to do this, but it should be straightforward for you to figure out; the algorithm description in the lecture notes includes the step size.  (But try to figure it out: consider the update equation for weight, and ask yourself: where should I put a stepsize parameter, ​to be able to adjust the magnitude of the weight update?)  
 + 
 +For the Iris dataset, you should test at least ''​c=1'',​ ''​c=0.1'',​ ''​c=0.01''​. 
 + 
 +For the CIFAR-10 dataset, you should test at least ''​c=0.001'',​ ''​c=0.00001''​. 
 + 
 + 
 +** Computing the l2 norm of the weight vector ** 
 + 
 +It is interesting to watch the weight vector as the algorithm progresses. ​  
 + 
 +This should only take single line of code.  ​Hint: can you rewrite the l2 norm in terms of dot products? 
 + 
 +** Plotting results ** 
 + 
 +You may use any notebook compatible plotting function you like, but I recommend ​''​matplotlib''​.  This is commonly imported as
  
 <code python> <code python>
 import matplotlib.pyplot as plt import matplotlib.pyplot as plt
-plt.imshow( X.reshape(28,​28).T,​ interpolation='​nearest',​ cmap=matplotlib.cm.gray) 
 </​code>​ </​code>​
 +
 +To create a new figure, call ''​plt.figure''​. ​ To plot a line, call ''​plt.plot''​. ​ Note that if you pass a matrix into ''​plt.plot'',​ it will plot multiple lines at once, each with a different color; each column will generate a new line.
 +
 +Note that if you use matplotlib, you may have to call ''​plt.show''​ to actually construct and display the plot.
 +
 +Don't forget to label your axes!
 +
 +You may find [[http://​matplotlib.org/​users/​pyplot_tutorial.html|this tutorial on pyplot]] helpful.
 +
 +----
 +====Hints:​====
 +
 +An easy way to load a CSV datafile is with the ''​pandas''​ package.
  
 Here are some functions that may be helpful to you: Here are some functions that may be helpful to you:
  
 <code python> <code python>
 +
 +np.random.randn
  
 import matplotlib.pyplot as plt import matplotlib.pyplot as plt
-plt.subplot 
- 
-numpy.argmax 
- 
-numpy.exp 
  
-numpy.mean+plt.figure 
 +plt.plot 
 +plt.xlabel 
 +plt.ylabel 
 +plt.legend 
 +plt.show
  
-numpy.bincount 
  
 </​code>​ </​code>​
cs501r_f2016/lab2.1472667824.txt.gz · Last modified: 2021/06/30 23:40 (external edit)