Table of Contents

Objective:


Deliverable:

For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm. You must provide tangible proof that your algorithm is working.


Grading standards:

Your notebook will be graded on the following:


Description:

For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems.

You may use any code you want from the internet to help you understand how to implement this, but all final code must be your own.

To successfully complete this lab, you must do the following:

Part of this lab involves a demonstration that your implementation is working properly. Most likely, this will be a measure of return / reward / value over time; as the policy improves, we should see it increase.

Some domains have other, more natural measures of performance. For example, for the cart-pole problem, we can measure how many iterations it takes before the pole falls over and we reach a terminal state; as the policy improves, this number improves.

You are also welcome to be creative in your illustrations - you could use a video of the agent doing something reasonable, for example.


Background and documentation:

Here is the OpenAI gym worlds

Here is a blog post introducing the idea.

Here is the paper with a technical description of the algorithm .

Here is a video describing it at a high level .


Hints and helps:

Update: Here is our our lab's implementation of PPO. NOTE: because this code comes with a complete implementation of running on VizDoom, you may not use that as your additional test domain.

Here are some instructions for installing vizdoom on colab.


Here is some code from our reference implementation. Hopefully it will serve as a good outline of what you need to do.

import gym
import torch
...
 
class PolicyNetwork(nn.Module):
    ...
 
 
class ValueNetwork(nn.Module):
    ....
 
class AdvantageDataset(Dataset):                                                                                                                    
    def __init__(self, experience):                                                                                                                 
        super(AdvantageDataset, self).__init__()                                                                                                    
        self._exp = experience                                                                                                                      
        self._num_runs = len(experience)                                                                                                            
        self._length = reduce(lambda acc, x: acc + len(x), experience, 0)                                                                           
 
    def __getitem__(self, index):                                                                                                                   
        idx = 0                                                                                                                                     
        seen_data = 0                                                                                                                               
        current_exp = self._exp[0]                                                                                                                  
        while seen_data + len(current_exp) - 1 < index:                                                                                             
            seen_data += len(current_exp)                                                                                                           
            idx += 1                                                                                                                                
            current_exp = self._exp[idx]                                                                                                            
        chosen_exp = current_exp[index - seen_data]                                                                                                 
        return chosen_exp[0], chosen_exp[4]                                                                                                         
 
    def __len__(self):                                                                                                                              
        return self._length                                                                                                                         
 
 
class PolicyDataset(Dataset):                                                                                                                       
    def __init__(self, experience):                                                                                                                 
        super(PolicyDataset, self).__init__()                                                                                                       
        self._exp = experience                                                                                                                      
        self._num_runs = len(experience)                                                                                                            
        self._length = reduce(lambda acc, x: acc + len(x), experience, 0)                                                                           
 
    def __getitem__(self, index):                                                                                                                   
        idx = 0                                                                                                                                     
        seen_data = 0                                                                                                                               
        current_exp = self._exp[0]                                                                                                                  
        while seen_data + len(current_exp) - 1 < index:                                                                                             
            seen_data += len(current_exp)                                                                                                           
            idx += 1                                                                                                                                
            current_exp = self._exp[idx]                                                                                                            
        chosen_exp = current_exp[index - seen_data]                                                                                                 
        return chosen_exp                                                                                                                           
 
    def __len__(self):                                                                                                                              
        return self._length                                                                                                                         
 
 
def main():
 
    env = gym.make('CartPole-v0')
    policy = PolicyNetwork(4, 2)
    value = ValueNetwork(4)
 
    policy_optim = optim.Adam(policy.parameters(), lr=1e-2, weight_decay=0.01)
    value_optim = optim.Adam(value.parameters(), lr=1e-3, weight_decay=1)
 
    # ... more stuff here...
 
    # Hyperparameters
    epochs = 1000
    env_samples = 100
    episode_length = 200
    gamma = 0.9
    value_epochs = 2
    policy_epochs = 5
    batch_size = 32
    policy_batch_size = 256
    epsilon = 0.2
 
    for _ in range(epochs):
        # generate rollouts
        rollouts = []
        for _ in range(env_samples):
            # don't forget to reset the environment at the beginning of each episode!
            # rollout for a certain number of steps!
 
        print('avg standing time:', standing_len / env_samples)
        calculate_returns(rollouts, gamma)
 
        # Approximate the value function
        value_dataset = AdvantageDataset(rollouts)
        value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
        for _ in range(value_epochs):
            # train value network
 
        calculate_advantages(rollouts, value)
 
        # Learn a policy
        policy_dataset = PolicyDataset(rollouts)
        policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True)
        for _ in range(policy_epochs):
            # train policy network