User Tools

Site Tools


cs501r_f2018:lab9

This is an old revision of the document!


Objective:

  • To implement the Proximal Policy Optimization algorithm

Deliverable:

For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm.


Grading standards:

Your notebook will be graded on the following:

  • 45% Proper design, creation and debugging of an actor and critic networks
  • 25% Proper implementation of the PPO loss function and objective on cart-pole (“CartPole-v0”)
  • 20% Implementation and demonstrated learning of PPO on another domain of your choice
  • 10% Visualization of policy return as a function of training

Description:

For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems.

You may use any code you want from the internet to help you understand how to implement this, but all final code must be your own.

To successfully complete this lab, you must do the following:

  • Implement a policy network. This should be a mapping from states to probabilities over actions.
  • Implement a value network. This should be a mapping from states to the value of that state.
  • Implement a loss function for the value network. This should compare the estimated value to the observed return.
  • Implement the PPO loss function.
  • Train the value and policy networks. This will at least involve:
    1. Generating data by generating roll-outs (according to the current policy)
    2. Calculating the actual discounted sum of rewards for each rollout
    3. Using the value network to estimate the value of each rollout
    4. Using actual and estimated value to calculate the advantage

Background and documentation:

Hints and helps:

import gym
import torch
...
 
class PolicyNetwork(nn.Module):
    ...
 
 
class ValueNetwork(nn.Module):
    ....
 
class AdvantageDataset(Dataset):
    ....
 
class PolicyDataset(Dataset):
    ....
 
def main():
 
    env = gym.make('CartPole-v0')
 
    # Hyperparameters
    epochs = 1000
    env_samples = 100
    episode_length = 200
    gamma = 0.9
    value_epochs = 2
    policy_epochs = 5
    batch_size = 32
    policy_batch_size = 256
    epsilon = 0.2
 
    for _ in range(epochs):
        # generate rollouts
        rollouts = []
        for _ in range(env_samples):
            # don't forget to reset the environment at the beginning of each episode!
            # rollout for a certain number of steps!
 
        print('avg standing time:', standing_len / env_samples)
        calculate_returns(rollouts, gamma)
 
        # Approximate the value function
        value_dataset = AdvantageDataset(rollouts)
        value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
        for _ in range(value_epochs):
            # train value network
 
        calculate_advantages(rollouts, value)
 
        # Learn a policy
        policy_dataset = PolicyDataset(rollouts)
        policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True)
        for _ in range(policy_epochs):
            # train policy network
cs501r_f2018/lab9.1542056442.txt.gz · Last modified: 2021/06/30 23:40 (external edit)