This is an old revision of the document!
For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm.
Your notebook will be graded on the following:
For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems.
You may use any code you want from the internet to help you understand how to implement this, but all final code must be your own.
To successfully complete this lab, you must do the following:
Here is the OpenAI gym worlds
Here is a blog post introducing the idea.
Here is the paper with a technical description of the algorithm .
Here is a video describing it at a high level .
import gym import torch ... class PolicyNetwork(nn.Module): ... class ValueNetwork(nn.Module): .... class AdvantageDataset(Dataset): .... class PolicyDataset(Dataset): .... def main(): env = gym.make('CartPole-v0') # Hyperparameters epochs = 1000 env_samples = 100 episode_length = 200 gamma = 0.9 value_epochs = 2 policy_epochs = 5 batch_size = 32 policy_batch_size = 256 epsilon = 0.2 for _ in range(epochs): # generate rollouts rollouts = [] for _ in range(env_samples): # don't forget to reset the environment at the beginning of each episode! # rollout for a certain number of steps! print('avg standing time:', standing_len / env_samples) calculate_returns(rollouts, gamma) # Approximate the value function value_dataset = AdvantageDataset(rollouts) value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True) for _ in range(value_epochs): # train value network calculate_advantages(rollouts, value) # Learn a policy policy_dataset = PolicyDataset(rollouts) policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True) for _ in range(policy_epochs): # train policy network