BYU CS classes

This is an old revision of the document!

Objective:

Implement a neural machine translation system using PyTorch.

Deliverable:

For this lab you will implement an autoencoder/decoder neural machine translation (NMT) system to translate between English and Spanish. Because of the difficulties inherent in processing variable-length sentences, we will be using PyTorch instead of Tensorflow.

The data has been provided by the Church’s translation department. It is proprietary and should not be distributed.

You will turn in a python notebook that contains your code, as well as 100 input / output / reference translation examples.

Grading Standards:

30% Proper design, creation and debugging of an autoencoder/decoder NMT system.
30% Correct GRU implementation using nn.GRUCell.
20% Correct test/loss function implementation.
20% Translation recognizable (print both source and reference sentences)

The dataset:

The dataset consists of 73,911 sentences drawn from general conference talks between 2010 and 2017. There are two files: one for English, one for Spanish. Each line represents one sentence.

In order to debug this lab, you will probably want to translate Spanish to English. That way, you will be able to judge the quality of the sentences that are coming out!

PyTorch Overview:

You will remember from class that PyTorch is an imperative (as opposed to declarative) automatic differentiation system. Thus all tensor operations occur only as they are initiated in python (although they may run on a GPU).

A brief review of the low level automatic differentiation system Autograd is in order. The Variable (torch.autograd) class wraps tensors as they flow through the computational graph recording every operation and dependent nodes. When .backward() is called on a variable, standard back propogation leaves a .grad variable initialized in all weight variables (those wrapped in a Parameter class). At this point the computational graph is released and is ready for an optimizer.

Some particulars for this lab since we're using the nn.Module workflow. Always wrap input tensors in Variable classes and any trainable weights as Parameter. Parameter is a shallow wrapper for Variable(requires_grad=True) and is accessible by .parameters() on its module or any parent module. Note that since modules are callable weight sharing is significantly easier than in tensorflow.

Deliverable:

In this lab you will use PyTorch to implement a vanilla autoencoder / decoder neural machine translation system from Spanish to English.

Some of the resources for this lab include Sequence to Sequence Learning with Neural Networks and D Bahdanau, 2015. State of the art NMT systems use Badanau's attention mechanism, but context alone should be enough for our dataset.

Seq2seq and encoder/decoder are nearly synonymous architectures and represent the first major breakthrough using RNNs to map between source and target sequences of differing lengths. The encoder will map input sequences to a fixed length context vector and the decoder will then map that to the output sequence. Standard softmax / cross entropy is used on the scores output by the decoder and compared against the reference sequence.

The hyperparameters used are given below.

Part 0: Data preprocessing

Like the char-rnn lab we will index each token (word) with an integer and convert sequences of tokens to an index integer tensor. NMT systems almost always specify a maximum length for your input and target sentences, so filter out any source / reference pairs with lengths exceeding that.

Part 1: Create a linear layer.

As a further introduction to PyTorch, you will be required to implement your own version of torch.nn.Linear. Inheriting from nn.Module requires only implementing the .init and .forward methods. As a submodule its Parameter instance members will be available through the parent module's .parameter() method.

class Linear(nn.Module):
    def __init__(self, in_length, out_length):
        super(Linear, self).__init__()
        # initialize the weight and bias Parameter class members
 
    def forward(self, input_):
        # use the weight and bias Parameter class members you created
        # use torch.matmul not mm (matmul can handle tensor matrix multiplies)
        # return the output, that's it!

Part 2: Implementing the Autoencoder

Your encoder will return the last hidden state as the context vector. It will need to be of sufficient dimension so the decoder can use it to map to the longer sentences in your target corpus.

Create an Encoder class that encapsulates all of the graph operations necessary for embedding and returns the context vector. Initialize both nn.GRUCell and nn.Embedding class members to embed the indexed source input sequence.

For each time step use the last previous hidden state and the embedding for the current word as the initial hidden state and input tensor for the GRU. For the first time step use a tensor of zeros for the initial hidden state. Return the last layer's hidden state at the last time step.

class Encoder(nn.Module):
    def __init__(self, num_src_corpus, hidden_size, num_layers): # can change these
        super(Encoder, self).__init__()
        # Instantiate nn.Embedding and nn.GRUCell
 
    def run_timestep(self, input, hidden):
        # implement gru here for the nth timestep
 
    def forward(self, sentence_indices):
        #

Part 3: Implementing the Decoder

The decoder will be more involved, but will be similar to the encoder. This time there will be an additional intertemporal connection between each layer’s output and the subsequent first layer’s input between time steps.

For the first timestep embed a tensor containing the SOS index. That and the context vector will serve as the input and initial hidden state. Call GRUCell n_layers times like before, but for proceeding time steps use the prediction of the previous time step as the initial input. Like the autoencoder the initial hidden state at each time step will be the last hidden state from the previous time step.

Use a linear layer and then softmax to convert the output at each time step to a tensor of probabilities over all words in your target corpus and use those probabilities to create the prediction for the next word.

Stop the first time that EOS is predicted. Return the probabilities at each time step and the indices of predicted words.

Part 4: Loss, test metrics

The loss is standard cross entropy between the output probabilities and reference tensor like in the char-rnn lab, but like the nn.Embedding class the nn.NLLoss class accepts indexed labels. You may want to use pytorch methods similar to tf.concat and tf.reduce_sum.

Calculate accuracy by something similar to (target==reference).data.numpy(), but make sure to compensate for when the target and reference sequences are of different lengths.

Perplexity is a standard measure for NMT and Language Modelling, it is equivalent to 2^cross_entropy.

Part 5: Optimizer

Notice that since we’ve used nn.Module classes to build our graph, we have access to all learnable parameters using .parameters(). Initialize an optimizer from torch.optim using both the parameters from both your Encoder and Decoder modules.

If you want to use perform gradient updates every n iterations call optimizer.step() and optimizer.zero_grad() every n iterations. You may have problems tuning your learning rate if you don’t divide each parameter's calculated gradient by n before calling optimizer.step()

# compute loss
 
loss.backward()
if j % n == 0:    
    for p in all_parameters:
        p.grad.div_(n) # in-place
 
    optim.step() 
    optim.zero_grad()

Bonus ideas:

Attention Mechanism

The Attention Mechanism makes use of all of the sequence and hidden outputs of the Encoder GRU in the decoding phase. You may either implement the attention mechanism or create an innovative way of using the extra information generated from the Encoder's GRU in the decoder to aid translation. You will need to return all outputs the encoder's gru in its .forward() method and also pass that information to the decoder.

Pruning

While deeper architectures come with increased accuracy, state of the art NMT requires lightning fast inference times and often the ability to run on handheld devices. For these reasons most NMT systems use sparse encoder/decoders with at most 4 layers.

Narang et. al 2017 recently demonstrated in NMT and other settings that training a sparse network while pruning away low activation weights resulted in better performance and faster inference times than training a similar dense network.

After your net has been reasonably trained, iteratively prune your weight matrices of weights with a larger threshold value after several training iterations. Save new boolean masks for each parameter (by directly comparing each parameter with the threshold value) so after training steps you can reset those tensor entries to 0. This simulates sparse tensor operations but without the speed increase (see pytorch.org/docs/master/sparse.html if you want to know more).

Prune at least a few times using increasing thresholds, print accuracies before and after each pruning. To see the full effect use use a deeper (3-4 layer) and wider (up to you) architecture before pruning. How many parameters can you remove before seeing a significant drop in performance? What's the final number of remaining parameters?

# use in reporting results
num_total_params = sum([p.data.size for p in all_parameters])
num_masked = sum([p[p==0.].size for p in all_parameters])

Notes:

Debugging in PyTorch is significantly more straightforward than in TensorFlow. Tensors are available at any time to print or log.

Better hyperparameters to come. Started to converge after two hours on a K80.

learning_rate = .01 # decayed, lowest .0001
batch_size = 40 # effective batch size
max_seq_length = 40 # ambitious
hidden_dim = 1024 # can use larger

BYU CS classes

User Tools

Site Tools

Table of Contents

Objective:

Deliverable:

Grading Standards:

The dataset:

PyTorch Overview:

Deliverable:

Bonus ideas:

Notes:

Page Tools