User Tools

Site Tools


cs501r_f2016:lab14

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cs501r_f2016:lab14 [2017/11/17 23:16]
jszendre [Deliverable:]
cs501r_f2016:lab14 [2017/11/28 17:42]
wingated
Line 27: Line 27:
  
 In order to debug this lab, you will probably want to translate **Spanish to English.** That way, you will be able to judge the quality of the sentences that are coming out! In order to debug this lab, you will probably want to translate **Spanish to English.** That way, you will be able to judge the quality of the sentences that are coming out!
 +
 +----
 +====Scaffolding code:====
 +
 +Some starter code is available for download via Dropbox:
 +
 +[[https://​www.dropbox.com/​s/​g967xqzkmydatxd/​nmt_scaffold_v2.py?​dl=0|nmt_scaffold_v2.py]]
  
 ---- ----
Line 43: Line 50:
 Some of the resources for this lab include [[https://​arxiv.org/​pdf/​1409.3215.pdf|Sequence to Sequence Learning with Neural Networks]] and [[https://​arxiv.org/​pdf/​1409.0473.pdf|D Bahdanau, 2015]]. The former will be of more use in implementing the lab. State of the art NMT systems use Badanau'​s attention mechanism, but context alone should be enough for our dataset. Some of the resources for this lab include [[https://​arxiv.org/​pdf/​1409.3215.pdf|Sequence to Sequence Learning with Neural Networks]] and [[https://​arxiv.org/​pdf/​1409.0473.pdf|D Bahdanau, 2015]]. The former will be of more use in implementing the lab. State of the art NMT systems use Badanau'​s attention mechanism, but context alone should be enough for our dataset.
  
-Seq2seq and encoder/​decoder are nearly synonymous architectures and represent the first major breakthrough using RNNs to map between source and target sequences of differing lengths. The encoder will map input sequences to a fixed length context vector and the decoder will then map that to the output sequence. ​Standard softmax / cross entropy ​is used on the scores output by the decoder and compared against the reference ​sequence.+Seq2seq and encoder/​decoder are nearly synonymous architectures and represent the first major breakthrough using RNNs to map between source and target sequences of differing lengths. The encoder will map input sequences to a fixed length context vector and the decoder will then map that to the output sequence. ​Loss is standard ​cross entropy ​between ​the scores output by the decoder and compared against the reference ​sentence.
  
 The hyperparameters used are given below. The hyperparameters used are given below.
Line 74: Line 81:
 Create an Encoder class that encapsulates all of the graph operations necessary for embedding and returns the context vector. Initialize both nn.GRUCell and nn.Embedding class members to embed the indexed source input sequence. Create an Encoder class that encapsulates all of the graph operations necessary for embedding and returns the context vector. Initialize both nn.GRUCell and nn.Embedding class members to embed the indexed source input sequence.
  
-For each time step use the last previous hidden state and the embedding ​for the current word as the initial hidden state and input tensor for the GRU. For the first time step use a tensor ​of zeros for the initial hidden state. Return the last layer'​s ​hidden state at the last time step.+Implement a GRU using GRUCell using the embedding ​of the source sentence ​as the input at each time step. Use zero-tensor ​as the initial hidden state. Return the last hidden state
 + 
 +You will probably want to use several layers for your GRU.
  
 <code python> <code python>
Line 81: Line 90:
         super(Encoder,​ self).__init__()         super(Encoder,​ self).__init__()
         # Instantiate nn.Embedding and nn.GRUCell         # Instantiate nn.Embedding and nn.GRUCell
-    ​+ 
     def run_timestep(self,​ input, hidden):     def run_timestep(self,​ input, hidden):
         # implement gru here for the nth timestep         # implement gru here for the nth timestep
Line 92: Line 102:
 **Part 3:** Implementing the Decoder **Part 3:** Implementing the Decoder
  
-The decoder will be more involved, but will be similar to the encoderThis time there will be an additional intertemporal connection between ​each layer’s ​output and the subsequent first layer’s input between time steps.+Again implement a standard GRU using GRUCell with the exception that for the first timestep embed a tensor containing the SOS indexThat and the context vector ​will serve as the input and initial hidden state.  
 + 
 +Unlike the encoder, for each time step take the output ​(GRUCell calls it h'​) ​and run it through a linear ​layer and then softmax to get probabilities over the english corpus. Use the word with the highest probability as the input for the next timestep.
  
-For the first timestep embed tensor containing the SOS indexThat and the context vector will serve as the input and initial hidden state. Call GRUCell n_layers times like beforebut for proceeding time steps use the prediction ​of the previous time step as the initial input. Like the autoencoder the initial hidden state at each time step will be the last hidden state from the previous time step.+You may want to consider using method called teacher forcing to begin connecting source/​reference words togetherIf you decide to use this, for a set probability at each iteration input the embedding ​of the correct word it should translate instead of the prediction ​from the previous time step.
  
-Use a linear layer and then softmax to convert ​the output at each time step to a tensor of probabilities ​over all words in your target corpus and use those probabilities ​to create the prediction for the next word.+Compute ​and return ​the prediction ​probabilities in either case to be used by the loss function.
  
-Stop the first time that EOS is predicted. Return the probabilities at each time step and the indices ​of predicted words.+Continue running ​the decoder GRU until the max sentence length or EOS is first predicted. Return the probabilities at each time step regardless ​of whether teacher forcing was used
  
 **Part 4:** Loss, test metrics **Part 4:** Loss, test metrics
Line 106: Line 118:
 Calculate accuracy by something similar to (target==reference).data.numpy(),​ but make sure to compensate for when the target and reference sequences are of different lengths. Calculate accuracy by something similar to (target==reference).data.numpy(),​ but make sure to compensate for when the target and reference sequences are of different lengths.
  
-Perplexity is a standard ​measure ​for NMT and Language Modelling, it is equivalent to 2^cross_entropy.+Consider using perplexity in addition to cross entropy as test metric. It'​s ​standard ​practice ​for NMT and Language Modelling ​and is 2^cross_entropy.
  
 **Part 5:** Optimizer **Part 5:** Optimizer
Line 117: Line 129:
 loss.backward() loss.backward()
  
-if j % == 0:    ​+if j % batch_size ​== 0:    ​
     for p in all_parameters:​     for p in all_parameters:​
         p.grad.div_(n) # in-place         p.grad.div_(n) # in-place
Line 153: Line 165:
 Debugging in PyTorch is significantly more straightforward than in TensorFlow. Tensors are available at any time to print or log. Debugging in PyTorch is significantly more straightforward than in TensorFlow. Tensors are available at any time to print or log.
  
-Better hyperparameters to come. Started to converge after two hours on a K80.+Better hyperparameters to come. Started to converge after two hours on a K80 using Adam.
 <code python> <code python>
-learning_rate = .01 # decayed, lowest .0001+learning_rate = .01 # decayed
 batch_size = 40 # effective batch size batch_size = 40 # effective batch size
-max_seq_length = 40 # ambitious +max_seq_length = 30 
-hidden_dim = 1024 # can use larger+hidden_dim = 1024
 </​code>​ </​code>​
  
cs501r_f2016/lab14.txt · Last modified: 2021/06/30 23:42 (external edit)