====Objective:==== * To implement the Proximal Policy Optimization algorithm, and learn about the use of deep learning in the context of deep RL. ---- ====Deliverable:==== For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm. You must provide tangible proof that your algorithm is working. ---- ====Grading standards:==== Your notebook will be graded on the following: * 45% Proper design, creation and debugging of an actor and critic networks * 25% Proper implementation of the PPO loss function and objective on cart-pole ("CartPole-v0") * 20% Implementation and demonstrated learning of PPO on another domain of your choice (**except** VizDoom) * 10% Visualization of policy return as a function of training ---- ====Description:==== For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems. You may use any code you want from the internet to help you understand how to implement this, but **all final code must be your own**. To successfully complete this lab, you must do the following: * Implement a policy network. This should be a mapping from states to probabilities over actions. * Implement a value network. This should be a mapping from states to the value of that state. * Implement a loss function for the value network. This should compare the estimated value to the observed return. * Implement the PPO loss function. * Train the value and policy networks. This will at least involve: - Generating data by generating roll-outs (according to the current policy) - Calculating the actual discounted sum of rewards for each rollout - Using the value network to estimate the value of each rollout - Using actual and estimated value to calculate the advantage Part of this lab involves a demonstration that your implementation is working properly. Most likely, this will be a measure of return / reward / value over time; as the policy improves, we should see it increase. Some domains have other, more natural measures of performance. For example, for the cart-pole problem, we can measure how many iterations it takes before the pole falls over and we reach a terminal state; as the policy improves, this number improves. You are also welcome to be creative in your illustrations - you could use a video of the agent doing something reasonable, for example. ---- ====Background and documentation:==== Here is [[https://gym.openai.com/|the OpenAI gym worlds]] Here is a [[https://blog.openai.com/openai-baselines-ppo/ | blog post introducing the idea]]. Here is the [[https://arxiv.org/pdf/1707.06347.pdf | paper with a technical description of the algorithm ]]. Here is a [[https://www.youtube.com/watch?v=5P7I-xPq8u8 | video describing it at a high level ]]. ---- ====Hints and helps:==== **Update**: Here is our [[https://github.com/joshgreaves/reinforcement-learning|our lab's implementation of PPO]]. NOTE: because this code comes with a complete implementation of running on VizDoom, **you may not use that as your additional test domain.** Here are some [[https://stackoverflow.com/questions/50667565/how-to-install-vizdoom-using-google-colab|instructions for installing vizdoom on colab]]. ---- Here is some code from our reference implementation. Hopefully it will serve as a good outline of what you need to do.



import gym
import torch
...

class PolicyNetwork(nn.Module):
    ...


class ValueNetwork(nn.Module):
    ....

class AdvantageDataset(Dataset):                                                                                                                    
    def __init__(self, experience):                                                                                                                 
        super(AdvantageDataset, self).__init__()                                                                                                    
        self._exp = experience                                                                                                                      
        self._num_runs = len(experience)                                                                                                            
        self._length = reduce(lambda acc, x: acc + len(x), experience, 0)                                                                           
                                                                                                                                                    
    def __getitem__(self, index):                                                                                                                   
        idx = 0                                                                                                                                     
        seen_data = 0                                                                                                                               
        current_exp = self._exp[0]                                                                                                                  
        while seen_data + len(current_exp) - 1 < index:                                                                                             
            seen_data += len(current_exp)                                                                                                           
            idx += 1                                                                                                                                
            current_exp = self._exp[idx]                                                                                                            
        chosen_exp = current_exp[index - seen_data]                                                                                                 
        return chosen_exp[0], chosen_exp[4]                                                                                                         
                                                                                                                                                    
    def __len__(self):                                                                                                                              
        return self._length                                                                                                                         
                                                                                                                                                    
                                                                                                                                                    
class PolicyDataset(Dataset):                                                                                                                       
    def __init__(self, experience):                                                                                                                 
        super(PolicyDataset, self).__init__()                                                                                                       
        self._exp = experience                                                                                                                      
        self._num_runs = len(experience)                                                                                                            
        self._length = reduce(lambda acc, x: acc + len(x), experience, 0)                                                                           
                                                                                                                                                    
    def __getitem__(self, index):                                                                                                                   
        idx = 0                                                                                                                                     
        seen_data = 0                                                                                                                               
        current_exp = self._exp[0]                                                                                                                  
        while seen_data + len(current_exp) - 1 < index:                                                                                             
            seen_data += len(current_exp)                                                                                                           
            idx += 1                                                                                                                                
            current_exp = self._exp[idx]                                                                                                            
        chosen_exp = current_exp[index - seen_data]                                                                                                 
        return chosen_exp                                                                                                                           
                                                                                                                                                    
    def __len__(self):                                                                                                                              
        return self._length                                                                                                                         


def main():

    env = gym.make('CartPole-v0')
    policy = PolicyNetwork(4, 2)
    value = ValueNetwork(4)

    policy_optim = optim.Adam(policy.parameters(), lr=1e-2, weight_decay=0.01)
    value_optim = optim.Adam(value.parameters(), lr=1e-3, weight_decay=1)
    
    # ... more stuff here...

    # Hyperparameters
    epochs = 1000
    env_samples = 100
    episode_length = 200
    gamma = 0.9
    value_epochs = 2
    policy_epochs = 5
    batch_size = 32
    policy_batch_size = 256
    epsilon = 0.2

    for _ in range(epochs):
        # generate rollouts
        rollouts = []
        for _ in range(env_samples):
            # don't forget to reset the environment at the beginning of each episode!
            # rollout for a certain number of steps!

        print('avg standing time:', standing_len / env_samples)
        calculate_returns(rollouts, gamma)

        # Approximate the value function
        value_dataset = AdvantageDataset(rollouts)
        value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
        for _ in range(value_epochs):
            # train value network

        calculate_advantages(rollouts, value)

        # Learn a policy
        policy_dataset = PolicyDataset(rollouts)
        policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True)
        for _ in range(policy_epochs):
            # train policy network