Implement a neural machine translation system using PyTorch.
For this lab you will implement an autoencoder/decoder neural machine translation (NMT) system to translate between English and Spanish. Because of the difficulties inherent in processing variable-length sentences, we will be using PyTorch instead of Tensorflow.
The data has been provided by the Church’s translation department. It is proprietary and should not be distributed.
You will turn in a python notebook that contains your code, as well as 100 input / output / reference translation examples.
The dataset consists of 73,911 sentences drawn from general conference talks between 2010 and 2017. There are two files: one for English, one for Spanish. Each line represents one sentence.
In order to debug this lab, you will probably want to translate Spanish to English. That way, you will be able to judge the quality of the sentences that are coming out!
Some starter code is available for download via Dropbox:
You will remember from class that PyTorch is an imperative (as opposed to declarative) automatic differentiation system. Thus all tensor operations occur only as they are initiated in python (although they may run on a GPU).
A brief review of the low level automatic differentiation system Autograd is in order. The Variable (torch.autograd) class wraps tensors as they flow through the computational graph recording every operation and dependent nodes. When .backward() is called on a variable, standard back propogation leaves a .grad variable initialized in all weight variables (those wrapped in a Parameter class). At this point the computational graph is released and is ready for an optimizer.
Since we're using the nn.Module workflow you should only need to initialize Variable objects to wrap your input tensors. Most of your trainable weights will typically be parameters of submodules, but whenever you initialize a Parameter as a class member of a nn.Module instance it is automatically accessible by module.parameters() and accumulate gradients.
In this lab you will use PyTorch to implement a vanilla autoencoder / decoder neural machine translation system from Spanish to English.
Some of the resources for this lab include Sequence to Sequence Learning with Neural Networks and D Bahdanau, 2015. The former will be of more use in implementing the lab. State of the art NMT systems use Badanau's attention mechanism, but context alone should be enough for our dataset.
Seq2seq and encoder/decoder are nearly synonymous architectures and represent the first major breakthrough using RNNs to map between source and target sequences of differing lengths. The encoder will map input sequences to a fixed length context vector and the decoder will then map that to the output sequence. Loss is standard cross entropy between the scores output by the decoder and compared against the reference sentence.
The hyperparameters used are given below.
Part 0: Data preprocessing
Like the char-rnn lab we will index each token (word) with an integer and convert sequences of tokens to an index integer tensor. NMT systems almost always specify a maximum length for your input and target sentences, so filter out any source / reference pairs with lengths exceeding that.
Part 1: Create a linear layer.
As a further introduction to PyTorch, you will be required to implement your own version of torch.nn.Linear. Inheriting from nn.Module requires only implementing the .init and .forward methods. As a submodule its Parameter instance members will be available through the parent module's .parameter() method.
class Linear(nn.Module): def __init__(self, in_length, out_length): super(Linear, self).__init__() # initialize the weight and bias Parameter class members def forward(self, input_): # use the weight and bias Parameter class members you created # use torch.matmul not mm (matmul can handle tensor matrix multiplies) # return the output, that's it!
Part 2: Implementing the Autoencoder
Your encoder will return the last hidden state as the context vector. It will need to be of sufficient dimension so the decoder can use it to map to the longer sentences in your target corpus.
Create an Encoder class that encapsulates all of the graph operations necessary for embedding and returns the context vector. Initialize both nn.GRUCell and nn.Embedding class members to embed the indexed source input sequence.
Implement a GRU using GRUCell using the embedding of the source sentence as the input at each time step. Use a zero-tensor as the initial hidden state. Return the last hidden state.
You will probably want to use several layers for your GRU.
class Encoder(nn.Module): def __init__(self, num_src_corpus, hidden_size, num_layers): # can change these super(Encoder, self).__init__() # Instantiate nn.Embedding and nn.GRUCell def run_timestep(self, input, hidden): # implement gru here for the nth timestep def forward(self, sentence_indices): #
Part 3: Implementing the Decoder
Again implement a standard GRU using GRUCell with the exception that for the first timestep embed a tensor containing the SOS index. That and the context vector will serve as the input and initial hidden state.
Unlike the encoder, for each time step take the output (GRUCell calls it h') and run it through a linear layer and then softmax to get probabilities over the english corpus. Use the word with the highest probability as the input for the next timestep.
You may want to consider using a method called teacher forcing to begin connecting source/reference words together. If you decide to use this, for a set probability at each iteration input the embedding of the correct word it should translate instead of the prediction from the previous time step.
Compute and return the prediction probabilities in either case to be used by the loss function.
Continue running the decoder GRU until the max sentence length or EOS is first predicted. Return the probabilities at each time step regardless of whether teacher forcing was used.
Part 4: Loss, test metrics
The loss is standard cross entropy between the output probabilities and reference tensor like in the char-rnn lab, but like the nn.Embedding class the nn.NLLoss class accepts indexed labels. You may want to use pytorch methods similar to tf.concat and tf.reduce_sum.
Calculate accuracy by something similar to (target==reference).data.numpy(), but make sure to compensate for when the target and reference sequences are of different lengths.
Consider using perplexity in addition to cross entropy as a test metric. It's standard practice for NMT and Language Modelling and is 2^cross_entropy.
Part 5: Optimizer
Notice that since we’ve used nn.Module classes to build our graph, we have access to all learnable parameters using .parameters(). Initialize an optimizer from torch.optim using both the parameters from both your Encoder and Decoder modules.
If you want to use perform gradient updates every n iterations call optimizer.step() and optimizer.zero_grad() every n iterations. You may have problems tuning your learning rate if you don’t divide each parameter's calculated gradient by n before calling optimizer.step()
loss.backward() if j % batch_size == 0: for p in all_parameters: p.grad.div_(n) # in-place optim.step() optim.zero_grad()
The Attention Mechanism makes use of all of the sequence and hidden outputs of the Encoder GRU in the decoding phase. You may either implement the attention mechanism or create an innovative way of using the extra information generated from the Encoder's GRU in the decoder to aid translation. You will need to return all outputs the encoder's gru in its .forward() method and also pass that information to the decoder.
While deeper architectures come with increased accuracy, state of the art NMT requires lightning fast inference times and often the ability to run on handheld devices. For these reasons most NMT systems use sparse encoder/decoders with at most 4 layers.
Narang et. al 2017 recently demonstrated in NMT that training wider networks and then pruning away low activation weights during training increased both speed (using sparse operations) and accuracy than a comparable dense network.
After your net has been reasonably trained, iteratively prune your weight matrices of weights with a larger threshold value after several training iterations. Save new boolean masks for each parameter (by directly comparing each parameter with the threshold value) so after training steps you can reset those tensor entries to 0. This simulates sparse tensor operations but without the speed increase (see torch.sparse if you want to know more).
Prune at least a few times using increasing thresholds, print accuracies before and after each pruning. To see the full effect use use a deeper (3-4 layer) and wider (up to you) architecture before pruning. How many parameters can you remove before seeing a significant drop in performance? What's the final number of remaining parameters?
# use in reporting results num_total_params = sum([p.data.size for p in all_parameters]) num_masked = sum([p[p==0.].size for p in all_parameters])
Debugging in PyTorch is significantly more straightforward than in TensorFlow. Tensors are available at any time to print or log.
Better hyperparameters to come. Started to converge after two hours on a K80 using Adam.
learning_rate = .01 # decayed batch_size = 40 # effective batch size max_seq_length = 30 hidden_dim = 1024
The folks at the supercomputer center have installed `pytorch` and `torchvision`. To use pytorch, you'll need to use the following modules in your SLURM file:
# these are dependencies: module load cuda/8.0 cudnn/6.0_8.0 module load python/27 module load python-pytorch python-torchvision
If you need more python libraries you can install them to your home directory with:
pip install --user libraryname