User Tools

Site Tools


cs501r_f2016:lab2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cs501r_f2016:lab2 [2016/08/31 18:24]
wingated
cs501r_f2016:lab2 [2016/09/02 04:45]
wingated
Line 1: Line 1:
 ====Objective:​==== ====Objective:​====
  
-To gain experience with python, numpy, and linear classification. ​ Oh, and to remember all of that linear algebra stuff. ​ ;)+To gain experience with python, numpy, and linear classification.  ​ 
 + 
 +Oh, and to remember all of that linear algebra stuff. ​ ;)
  
 ---- ----
 ====Deliverable:​==== ====Deliverable:​====
  
-You should turn in an iPython notebook that implements the perceptron algorithm on the Iris dataset.+You should turn in an iPython notebook that implements the perceptron algorithm on two different datasets: ​the Iris dataset, and the CIFAR-10 dataset. ​ Because the perceptron is a binary classifier, we will preprocess the data and "​squash"​ it to create two classes. 
 + 
 +Your notebook should also generate a visualization that shows classification accuracy at each iteration, along with the log of the l2 norm of the weight vector, for two different values of the perceptron'​s step size.  Examples of both are shown at the right. ​ Since there are two datasets, and there are two visualizations per dataset, your notebook should produce a total of 4 plots. 
 + 
 +**Please cleanly label your axes!** 
 + 
 +{{ :​cs501r_f2016:​lab2_cacc.png?​direct&​200|}} 
 + 
 +{{ :​cs501r_f2016:​lab2_l2norm.png?​direct&​200|}} 
 + 
 +The Iris dataset can be downloaded at the UCI ML repository, or you can download a slightly simpler version here: 
 +[[http://​liftothers.org/​Fisher.csv|http://​liftothers.org/​Fisher.csv]]
  
-Your notebook should also generate a visualization that shows the loss function at each iteration. ​ This can be generated as a single plot, and shown in the notebook.+The CIFAR-10 dataset ​can be downloaded at 
 +[[https://​www.cs.toronto.edu/​~kriz/​cifar.html|https://​www.cs.toronto.edu/​~kriz/​cifar.html]]
  
-The dataset can be downloaded at +**Notemake sure to download the python version of the data it will simplify your life!**
-[[https://​archive.ics.uci.edu/​ml/​machine-learning-databases/​iris/​iris.data|The UCI ML repository]]+
  
 ---- ----
Line 20: Line 33:
   * 70% Correct implementation of perceptron algorithm   * 70% Correct implementation of perceptron algorithm
   * 20% Tidy and legible visualization of loss function   * 20% Tidy and legible visualization of loss function
-  * 10% Tidy and legible ​final classification ​rate+  * 10% Tidy and legible ​plot of classification ​accuracy over time
  
 ---- ----
 ====Description:​==== ====Description:​====
  
 +The purpose of this lab is to help you become familiar with ''​numpy'',​ to remember the basics of classification,​ and to implement the perceptron algorithm. ​ The perceptron algorithm is a simple method of learning a separating hyperplane. ​ It is guaranteed to converge iff the dataset is linearly separable - otherwise, you have to cross your fingers!
  
-For this lab, you will be experimenting with Kernel Density Estimators (see MLAPP 14.7.2). ​ These are a simple, nonparametric alternative ​to Gaussian mixture models, but which form an important part of the machine learning toolkit.+You should implement the perceptron algorithm according ​to the description in Wikipedia:
  
-At several points during this lab, you will need to construct density estimates that are "​class-conditional"​ For example, in order to classify a test point $x_j$, you need to compute+[[https://​en.wikipedia.org/​wiki/​Perceptron|Perceptron]]
  
-$$p\mathrm{class}=k | x_j, \mathrm{data} ​\propto p( x_j | \mathrm{class}=k\mathrm{data} ) p(\mathrm{class}=k | \mathrm{data} ) $$+As you implement this lab, you will (hopefully!learn the difference between numpy'​s matricesnumpy'​s vectors, and lists. ​ In particular, note that a list is not the same a vector, and a ''​n x 1''​ matrix is not the same as a vector of length ''​n''​.
  
-where+You may find the functions ''​np.asmatrix'',​ ''​np.atleast_2d'',​ and ''​np.reshape''​ helpful to convert between them.
  
-$$p( x_j | \mathrm{class}=k\mathrm{data} )$$+Alsoyou may find the function ''​np.dot''​ helpful to compute matrix-vector products, or vector-vector products. You can transpose a matrix or a vector by calling the ''​.T''​ method.
  
-is given by a kernel density estimator derived from all data of class $k$.+Hint: you should start with the Iris dataset, then once you have your perceptron working, you should move to the CIFAR-10 dataset.
  
 +**Preparing the data:**
  
-The data that you will analyzing ​is the famous [[http://​yann.lecun.com/​exdb/​mnist/​|MNIST handwritten digits ​dataset]].  You can download some pre-processed MATLAB data files below:+Both datasets are natively multiclass, but we need to convert them to binary classification problems. ​ To show you how we're going to do this, and to give you a bit of code to get started, here is how I loaded and converted ​the Iris dataset:
  
-[[http://​hatch.cs.byu.edu/​courses/​stat_ml/​mnist_train.mat|MNIST training ​data vectors and labels]]+<code python>​ 
 +data = pandas.read_csv( '​Fisher.csv' ) 
 +m = data.as_matrix() 
 +labels ​= m[:,0] 
 +labels[ labels==2 ​= 1  # squash class 2 into class 1 
 +labels = np.atleast_2d( labels ).T 
 +features = m[:,1:5] 
 +</​code>​
  
-[[http://​hatch.cs.byu.edu/​courses/​stat_ml/​mnist_test.mat|MNIST test data vectors ​and labels]] +and the CIFAR-10 dataset:
- +
-These can be loaded using the scipy.io.loadmat function, as follows:+
  
 <code python> <code python>
-import ​scipy.io+def unpickle( file ): 
 +    ​import ​cPickle 
 +    fo = open(file, '​rb'​) 
 +    dict = cPickle.load(fo) 
 +    fo.close() 
 +    return dict 
 + 
 +data = unpickle( '​cifar-10-batches-py/​data_batch_1'​ ) 
 + 
 +features = data['​data'​] 
 +labels = data['​labels'​] 
 +labels = np.atleast_2d( labels ).T
  
-train_mat = scipy.io.loadmat('​mnist_train.mat'​) +# squash classes 0-4 into class 0, and squash classes 5-9 into class 1 
-train_data = train_mat['​images'​] +labelslabels < 5 ] = 
-train_labels ​train_mat['labels']+labels[ labels ​>= 5 = 1
  
-test_mat = scipy.io.loadmat('​mnist_test.mat'​) 
-test_data = test_mat['​t10k_images'​] 
-test_labels = test_mat['​t10k_labels'​] 
 </​code>​ </​code>​
  
-The training data vectors are now in ''​train_data'',​ a numpy array of size 784x60000, with corresponding labels in ''​train_labels'',​ a numpy array of size 60000x1.+** Running the perceptron algorithm**
  
----- +Remember that if a data instance is classified correctly, there is no change in the weight vector.
-====Hints:​====+
  
-An easy way to load a CSV datafile is with the ''​pandas'' ​package.+In the wikipedia description of the perceptron algorithm, notice the function ​''​f''​. ​ ​That'​s the Heaviside step function. ​ What does it do?
  
-Here is simple way to visualize ​digit.  ​Suppose our digit is in variable ​''​X''​, which has dimensions 784x1:+You should run the perceptron for at least 100 steps. 
 + 
 +You should also test different step sizes. Wikipedia doesn'​t discuss how to do this, but it should be straightforward for you to figure out; the algorithm description in the lecture notes includes the step size.  (But try to figure it out: consider the update equation for weight, and ask yourself: where should I put a stepsize parameter, ​to be able to adjust the magnitude of the weight update?)  
 + 
 +For the Iris dataset, you should test at least ''​c=1'',​ ''​c=0.1'',​ ''​c=0.01''​. 
 + 
 +For the CIFAR-10 dataset, you should test at least ''​c=0.001'',​ ''​c=0.00001''​. 
 + 
 + 
 +** Computing the l2 norm of the weight vector ** 
 + 
 +It is interesting to watch the weight vector as the algorithm progresses. ​  
 + 
 +This should only take single line of code.  ​Hint: can you rewrite the l2 norm in terms of dot products? 
 + 
 +** Plotting results ** 
 + 
 +You may use any notebook compatible plotting function you like, but I recommend ​''​matplotlib''​.  This is commonly imported as
  
 <code python> <code python>
 import matplotlib.pyplot as plt import matplotlib.pyplot as plt
-plt.imshow( X.reshape(28,​28).T,​ interpolation='​nearest',​ cmap=matplotlib.cm.gray) 
 </​code>​ </​code>​
 +
 +To create a new figure, call ''​plt.figure''​. ​ To plot a line, call ''​plt.plot''​. ​ Note that if you pass a matrix into ''​plt.plot'',​ it will plot multiple lines at once, each with a different color; each column will generate a new line.
 +
 +Note that if you use matplotlib, you may have to call ''​plt.show''​ to actually construct and display the plot.
 +
 +Don't forget to label your axes!
 +
 +You may find [[http://​matplotlib.org/​users/​pyplot_tutorial.html|this tutorial on pyplot]] helpful.
 +
 +----
 +====Hints:​====
 +
 +An easy way to load a CSV datafile is with the ''​pandas''​ package.
  
 Here are some functions that may be helpful to you: Here are some functions that may be helpful to you:
  
 <code python> <code python>
 +
 +np.random.randn
  
 import matplotlib.pyplot as plt import matplotlib.pyplot as plt
-plt.subplot 
- 
-numpy.argmax 
- 
-numpy.exp 
  
-numpy.mean+plt.figure 
 +plt.plot 
 +plt.xlabel 
 +plt.ylabel 
 +plt.legend 
 +plt.show
  
-numpy.bincount 
  
 </​code>​ </​code>​
cs501r_f2016/lab2.txt · Last modified: 2021/06/30 23:42 (external edit)