User Tools

Site Tools


cs501r_f2016:lab14

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cs501r_f2016:lab14 [2017/11/20 18:05]
jszendre [Deliverable:]
cs501r_f2016:lab14 [2017/11/20 20:08]
jszendre [Notes:]
Line 92: Line 92:
 **Part 3:** Implementing the Decoder **Part 3:** Implementing the Decoder
  
-For the first timestep embed a tensor containing the SOS index. That and the context vector will serve as the input and initial hidden state. Call GRUCell n_layers times like before, but for proceeding time steps use the prediction of the previous time step as the initial input. Like the autoencoder the initial hidden state at each time step will be the last hidden state from the previous time step.+Again implement a standard GRU using GRUCell with the exception that for the first timestep embed a tensor containing the SOS index. That and the context vector will serve as the input and initial hidden state. ​
  
-Use a linear layer and then softmax to convert the output at each time step to a tensor of probabilities over all words in your target ​corpus ​and use those probabilities to create ​the prediction ​for the next word.+Unlike the encoder, for each time step take the output (GRUCell calls it h') and run it through ​a linear layer and then softmax to get probabilities over the english ​corpus. Use the word with the highest probability as the input for the next timestep.
  
-Stop the first time that EOS is predicted. Return the probabilities at each time step and the indices ​of predicted words.+You may want to consider using a method called teacher forcing to begin connecting source/​reference words together. If you decide to use this, for a set probability at each iteration input the embedding of the correct word it should translate instead of the prediction from the previous ​time step. 
 + 
 +Compute and return the prediction probabilities in either case to be used by the loss function. 
 + 
 +Continue running the decoder GRU until the max sentence length or EOS is first predicted. Return the probabilities at each time step regardless ​of whether teacher forcing was used
  
 **Part 4:** Loss, test metrics **Part 4:** Loss, test metrics
Line 104: Line 108:
 Calculate accuracy by something similar to (target==reference).data.numpy(),​ but make sure to compensate for when the target and reference sequences are of different lengths. Calculate accuracy by something similar to (target==reference).data.numpy(),​ but make sure to compensate for when the target and reference sequences are of different lengths.
  
-Perplexity is a standard ​measure ​for NMT and Language Modelling and NMT.+Consider using perplexity in addition to cross entropy as test metric. It'​s ​standard ​practice ​for NMT and Language Modelling and is 2^cross_entropy.
  
 **Part 5:** Optimizer **Part 5:** Optimizer
Line 115: Line 119:
 loss.backward() loss.backward()
  
-if j % == 0:    ​+if j % batch_size ​== 0:    ​
     for p in all_parameters:​     for p in all_parameters:​
         p.grad.div_(n) # in-place         p.grad.div_(n) # in-place
Line 151: Line 155:
 Debugging in PyTorch is significantly more straightforward than in TensorFlow. Tensors are available at any time to print or log. Debugging in PyTorch is significantly more straightforward than in TensorFlow. Tensors are available at any time to print or log.
  
-Better hyperparameters to come. Started to converge after two hours on a K80.+Better hyperparameters to come. Started to converge after two hours on a K80 using Adam.
 <code python> <code python>
-learning_rate = .01 # decayed, lowest .0001+learning_rate = .01 # decayed
 batch_size = 40 # effective batch size batch_size = 40 # effective batch size
-max_seq_length = 40 # ambitious +max_seq_length = 30 
-hidden_dim = 1024 # can use larger+hidden_dim = 1024
 </​code>​ </​code>​
  
cs501r_f2016/lab14.txt · Last modified: 2021/06/30 23:42 (external edit)