This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cs501r_f2016:lab14 [2017/11/17 17:58] jszendre [Deliverable:] |
cs501r_f2016:lab14 [2017/11/20 20:06] jszendre [Deliverable:] |
||
---|---|---|---|
Line 31: | Line 31: | ||
====PyTorch Overview:==== | ====PyTorch Overview:==== | ||
- | |||
You will remember from class that PyTorch is an imperative (as opposed to declarative) automatic differentiation system. Thus all tensor operations occur only as they are initiated in python (although they may run on a GPU). | You will remember from class that PyTorch is an imperative (as opposed to declarative) automatic differentiation system. Thus all tensor operations occur only as they are initiated in python (although they may run on a GPU). | ||
- | A brief review of the low level automatic differentiation system Autograd is in order. The Variable (torch.autograd) class wraps tensors as they flow through the computational graph recording every operation and dependent nodes. When .backward() is called on a variable, standard back propogation leaves a .grad variable initialized in all weight variables (those wrapped in a Parameter class). At this point the computational graph is released and is ready for an optimizer. Always wrap input tensors using the Variable class. | + | A brief review of the low level automatic differentiation system Autograd is in order. The Variable (torch.autograd) class wraps tensors as they flow through the computational graph recording every operation and dependent nodes. When .backward() is called on a variable, standard back propogation leaves a .grad variable initialized in all weight variables (those wrapped in a Parameter class). At this point the computational graph is released and is ready for an optimizer. |
+ | Since we're using the [[http://pytorch.org/docs/master/nn.html|nn.Module]] workflow you should only need to initialize Variable objects to wrap your input tensors. Most of your trainable weights will typically be parameters of submodules, but whenever you initialize a Parameter as a class member of a nn.Module instance it is automatically accessible by module.parameters() and accumulate gradients. | ||
---- | ---- | ||
====Deliverable:==== | ====Deliverable:==== | ||
Line 41: | Line 41: | ||
In this lab you will use PyTorch to implement a vanilla autoencoder / decoder neural machine translation system from Spanish to English. | In this lab you will use PyTorch to implement a vanilla autoencoder / decoder neural machine translation system from Spanish to English. | ||
- | Some of the resources for this lab include [[https://arxiv.org/pdf/1409.3215.pdf|Sequence to Sequence Learning with Neural Networks]] and [[https://arxiv.org/pdf/1409.0473.pdf|D Bahdanau, 2015]]. State of the art NMT systems use Badanau's attention mechanism, but context alone should be enough for our dataset. | + | Some of the resources for this lab include [[https://arxiv.org/pdf/1409.3215.pdf|Sequence to Sequence Learning with Neural Networks]] and [[https://arxiv.org/pdf/1409.0473.pdf|D Bahdanau, 2015]]. The former will be of more use in implementing the lab. State of the art NMT systems use Badanau's attention mechanism, but context alone should be enough for our dataset. |
Seq2seq and encoder/decoder are nearly synonymous architectures and represent the first major breakthrough using RNNs to map between source and target sequences of differing lengths. The encoder will map input sequences to a fixed length context vector and the decoder will then map that to the output sequence. Standard softmax / cross entropy is used on the scores output by the decoder and compared against the reference sequence. | Seq2seq and encoder/decoder are nearly synonymous architectures and represent the first major breakthrough using RNNs to map between source and target sequences of differing lengths. The encoder will map input sequences to a fixed length context vector and the decoder will then map that to the output sequence. Standard softmax / cross entropy is used on the scores output by the decoder and compared against the reference sequence. | ||
- | The hyperparameters used are given below.<del> Despite the large corpus sizes near perfect accuracies</del> | + | The hyperparameters used are given below. |
**Part 0:** Data preprocessing | **Part 0:** Data preprocessing | ||
Line 92: | Line 92: | ||
**Part 3:** Implementing the Decoder | **Part 3:** Implementing the Decoder | ||
- | The decoder will be more involved, but will be similar to the encoder. This time there will be an additional intertemporal connection will be between each layer’s output and the subsequent first layer’s input between time steps. | + | Again implement a standard GRU using GRUCell with the exception that for the first timestep embed a tensor containing the SOS index. That and the context vector will serve as the input and initial hidden state. |
- | Embed a tensor containing the SOS index. That and the context vector will serve as the input and initial hidden state for the first time step. Call GRUCell like before, but for proceeding time steps use the embedding of the prediction of the previous time step as the initial input. <del>The initial hidden state is the context vector throughout</del>. Like the autoencoder the initial hidden state at each time step will be the last hidden state from the previous time step. | + | Unlike the encoder, for each time step take the output (GRUCell calls it h') and run it through a linear layer and then softmax to get probabilities over the english corpus. Use the word with the highest probability as the input for the next timestep. |
- | Use a linear layer and then softmax to convert the output at each time step to a tensor of probabilities over all words in your target corpus and use those probabilities to create the prediction for the next word. | + | You may want to consider using a method called teacher forcing to begin connecting source/reference words together. If you decide to use this, for a set probability at each iteration input the embedding of the correct word it should translate instead of the prediction from the previous time step. |
- | Stop the first time that EOS is predicted. Return the probabilities at each time step and the indices of predicted words. | + | Compute and return the prediction probabilities in either case to be used by the loss function. |
+ | |||
+ | Continue running the decoder GRU until the max sentence length or EOS is first predicted. Return the probabilities at each time step regardless of whether teacher forcing was used. | ||
**Part 4:** Loss, test metrics | **Part 4:** Loss, test metrics | ||
Line 106: | Line 108: | ||
Calculate accuracy by something similar to (target==reference).data.numpy(), but make sure to compensate for when the target and reference sequences are of different lengths. | Calculate accuracy by something similar to (target==reference).data.numpy(), but make sure to compensate for when the target and reference sequences are of different lengths. | ||
- | Perplexity is a standard measure for NMT and Language Modelling, it is equivalent to 2^cross_entropy. | + | Consider using perplexity in addition to cross entropy as a test metric. It's standard practice for NMT and Language Modelling and is 2^cross_entropy. |
**Part 5:** Optimizer | **Part 5:** Optimizer | ||
Line 115: | Line 117: | ||
<code python> | <code python> | ||
- | # compute loss | ||
- | |||
loss.backward() | loss.backward() | ||
- | if j % n == 0: | + | |
+ | if j % batch_size == 0: | ||
for p in all_parameters: | for p in all_parameters: | ||
p.grad.div_(n) # in-place | p.grad.div_(n) # in-place | ||
Line 137: | Line 138: | ||
While deeper architectures come with increased accuracy, state of the art NMT requires lightning fast inference times and often the ability to run on handheld devices. For these reasons most NMT systems use sparse encoder/decoders with at most 4 layers. | While deeper architectures come with increased accuracy, state of the art NMT requires lightning fast inference times and often the ability to run on handheld devices. For these reasons most NMT systems use sparse encoder/decoders with at most 4 layers. | ||
- | [[https://arxiv.org/pdf/1704.05119.pdf|Narang et. al 2017]] recently demonstrated in NMT and other settings that training a sparse network while pruning away low activation weights resulted in better performance and faster inference times than training a similar dense network. | + | [[https://arxiv.org/pdf/1704.05119.pdf|Narang et. al 2017]] recently demonstrated in NMT that training wider networks and then pruning away low activation weights during training increased both speed (using sparse operations) and accuracy than a comparable dense network. |
- | After your net has been reasonably trained, begin to prune your weight matrices of weights with a larger threshold value after every epoch. Save new boolean masks for each parameter (by directly comparing each parameter with the threshold value) so after training steps you can reset those tensor entries to 0. | + | After your net has been reasonably trained, iteratively prune your weight matrices of weights with a larger threshold value after several training iterations. Save new boolean masks for each parameter (by directly comparing each parameter with the threshold value) so after training steps you can reset those tensor entries to 0. This simulates sparse tensor operations but without the speed increase (see [[https://pytorch.org/docs/master/sparse.html|torch.sparse]] if you want to know more). |
Prune at least a few times using increasing thresholds, print accuracies before and after each pruning. To see the full effect use use a deeper (3-4 layer) and wider (up to you) architecture before pruning. How many parameters can you remove before seeing a significant drop in performance? What's the final number of remaining parameters? | Prune at least a few times using increasing thresholds, print accuracies before and after each pruning. To see the full effect use use a deeper (3-4 layer) and wider (up to you) architecture before pruning. How many parameters can you remove before seeing a significant drop in performance? What's the final number of remaining parameters? |