cs501r_f2018:lab9

- To implement the Proximal Policy Optimization algorithm, and learn about the use of deep learning in the context of deep RL.

For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm. You must provide tangible proof that your algorithm is working.

Your notebook will be graded on the following:

- 45% Proper design, creation and debugging of an actor and critic networks
- 25% Proper implementation of the PPO loss function and objective on cart-pole (“CartPole-v0”)
- 20% Implementation and demonstrated learning of PPO on another domain of your choice (
**except**VizDoom) - 10% Visualization of policy return as a function of training

For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems.

You may use any code you want from the internet to help you understand how to implement this, but **all final code must be your own**.

To successfully complete this lab, you must do the following:

- Implement a policy network. This should be a mapping from states to probabilities over actions.
- Implement a value network. This should be a mapping from states to the value of that state.
- Implement a loss function for the value network. This should compare the estimated value to the observed return.
- Implement the PPO loss function.
- Train the value and policy networks. This will at least involve:
- Generating data by generating roll-outs (according to the current policy)
- Calculating the actual discounted sum of rewards for each rollout
- Using the value network to estimate the value of each rollout
- Using actual and estimated value to calculate the advantage

Part of this lab involves a demonstration that your implementation is working properly. Most likely, this will be a measure of return / reward / value over time; as the policy improves, we should see it increase.

Some domains have other, more natural measures of performance. For example, for the cart-pole problem, we can measure how many iterations it takes before the pole falls over and we reach a terminal state; as the policy improves, this number improves.

You are also welcome to be creative in your illustrations - you could use a video of the agent doing something reasonable, for example.

Here is the OpenAI gym worlds

Here is a blog post introducing the idea.

Here is the paper with a technical description of the algorithm .

Here is a video describing it at a high level .

**Update**: Here is our
our lab's implementation of PPO. NOTE: because this code comes with a complete implementation of running on VizDoom, **you may not use that as your additional test domain.**

Here are some instructions for installing vizdoom on colab.

Here is some code from our reference implementation. Hopefully it will serve as a good outline of what you need to do.

import gym import torch ... class PolicyNetwork(nn.Module): ... class ValueNetwork(nn.Module): .... class AdvantageDataset(Dataset): def __init__(self, experience): super(AdvantageDataset, self).__init__() self._exp = experience self._num_runs = len(experience) self._length = reduce(lambda acc, x: acc + len(x), experience, 0) def __getitem__(self, index): idx = 0 seen_data = 0 current_exp = self._exp[0] while seen_data + len(current_exp) - 1 < index: seen_data += len(current_exp) idx += 1 current_exp = self._exp[idx] chosen_exp = current_exp[index - seen_data] return chosen_exp[0], chosen_exp[4] def __len__(self): return self._length class PolicyDataset(Dataset): def __init__(self, experience): super(PolicyDataset, self).__init__() self._exp = experience self._num_runs = len(experience) self._length = reduce(lambda acc, x: acc + len(x), experience, 0) def __getitem__(self, index): idx = 0 seen_data = 0 current_exp = self._exp[0] while seen_data + len(current_exp) - 1 < index: seen_data += len(current_exp) idx += 1 current_exp = self._exp[idx] chosen_exp = current_exp[index - seen_data] return chosen_exp def __len__(self): return self._length def main(): env = gym.make('CartPole-v0') policy = PolicyNetwork(4, 2) value = ValueNetwork(4) policy_optim = optim.Adam(policy.parameters(), lr=1e-2, weight_decay=0.01) value_optim = optim.Adam(value.parameters(), lr=1e-3, weight_decay=1) # ... more stuff here... # Hyperparameters epochs = 1000 env_samples = 100 episode_length = 200 gamma = 0.9 value_epochs = 2 policy_epochs = 5 batch_size = 32 policy_batch_size = 256 epsilon = 0.2 for _ in range(epochs): # generate rollouts rollouts = [] for _ in range(env_samples): # don't forget to reset the environment at the beginning of each episode! # rollout for a certain number of steps! print('avg standing time:', standing_len / env_samples) calculate_returns(rollouts, gamma) # Approximate the value function value_dataset = AdvantageDataset(rollouts) value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True) for _ in range(value_epochs): # train value network calculate_advantages(rollouts, value) # Learn a policy policy_dataset = PolicyDataset(rollouts) policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True) for _ in range(policy_epochs): # train policy network

cs501r_f2018/lab9.txt · Last modified: 2018/11/19 14:21 by wingated