User Tools

Site Tools


cs501r_f2016:lab3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
cs501r_f2016:lab3 [2016/09/09 17:11]
wingated
cs501r_f2016:lab3 [2016/09/09 17:15]
wingated
Line 59: Line 59:
 Note a couple of things about this code: first, it is fully vectorized. ​ Second, the ''​numerical_gradient''​ function accepts a parameter called ''​loss_function''​ -- ''​numerical_gradient''​ is a higher-order function that accepts another function as an input. ​ This numerical gradient calculator could be used to calculate gradients for any function. Third, you may wonder why my ''​loss_function''​ doesn'​t need the data!  Since the data never changes, I curried it into my loss function, resulting in a function that only takes one parameter -- the matrix ''​W''​. Note a couple of things about this code: first, it is fully vectorized. ​ Second, the ''​numerical_gradient''​ function accepts a parameter called ''​loss_function''​ -- ''​numerical_gradient''​ is a higher-order function that accepts another function as an input. ​ This numerical gradient calculator could be used to calculate gradients for any function. Third, you may wonder why my ''​loss_function''​ doesn'​t need the data!  Since the data never changes, I curried it into my loss function, resulting in a function that only takes one parameter -- the matrix ''​W''​.
  
-You should run your code for 1000 epochs.+You should run your code for 1000 epochs. (Here, by epoch, I mean "step in the gradient descent algorithm."​). ​ Note, however, that for each step, you have to calculate the gradient, and in order to calculate the gradient, you will need to evaluate the loss function many times.
  
-You should plot both the loss function and the classification accuracy.+You should plot both the loss function and the classification accuracy ​at each step.
  
 **Preparing the data:** **Preparing the data:**
Line 105: Line 105:
 You should use a linear score function, as discussed in class. ​ This should only be one line of code! You should use a linear score function, as discussed in class. ​ This should only be one line of code!
  
-You should use the log softmax loss function, as discussed in class. ​ For each training instance, you should compute the probability that the instance is classified as class ''​k'',​ using ''​p(instance i = class k) = exp( s_ik ) / sum_j exp( s_ij )''​ (where ''​s_ij''​ is the score of the i'th instance on the j'th class) and then calculate ''​L_i''​ as the log of the probability of the correct class.+You should use the log softmax loss function, as discussed in class. ​ For each training instance, you should compute the probability that the instance ​''​i'' ​is classified as class ''​k'',​ using ''​p(instance i = class k) = exp( s_ik ) / sum_j exp( s_ij )''​ (where ''​s_ij''​ is the score of the i'th instance on the j'th class)and then calculate ''​L_i''​ as the log of the probability of the correct class.  Your overall loss is then the mean of the individual ''​L_i''​ terms.
  
 **Note: you should be careful about numerical underflow!** To help combat that, you should use the **log-sum-exp** trick (or the **exp-normalize** trick): **Note: you should be careful about numerical underflow!** To help combat that, you should use the **log-sum-exp** trick (or the **exp-normalize** trick):
Line 121: Line 121:
 I used a delta of 0.000001. I used a delta of 0.000001.
  
-Please feel free to search around online for resources to +Please feel free to search around online for resources to understand this better. ​ For example:
  
 [[http://​www2.math.umd.edu/​~dlevy/​classes/​amsc466/​lecture-notes/​differentiation-chap.pdf|These lecture notes]] (see eq. 5.1) [[http://​www2.math.umd.edu/​~dlevy/​classes/​amsc466/​lecture-notes/​differentiation-chap.pdf|These lecture notes]] (see eq. 5.1)
cs501r_f2016/lab3.txt ยท Last modified: 2021/06/30 23:42 (external edit)