Differences

This shows you the differences between two versions of the page.

--- cs501r_f2018:lab9 [2018/11/12 21:00]
wingated
+++ cs501r_f2018:lab9 [2021/06/30 23:42]
@@ Line 1: / Line 1: @@
-====Objective:====
-  * To implement the Proximal Policy Optimization algorithm
-----
-====Deliverable:====
-For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm.
-----
-====Grading standards:====
-Your notebook will be graded on the following:
-  * 45% Proper design, creation and debugging of an actor and critic networks
-  * 25% Proper implementation of the PPO loss function and objective on cart-pole ("CartPole-v0")
-  * 20% Implementation and demonstrated learning of PPO on another domain of your choice
-  * 10% Visualization of policy return as a function of training
-----
-====Description:====
-For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems.
-You may use any code you want from the internet to help you understand how to implement this, but **all final code must be your own**.
-To successfully complete this lab, you must do the following:
-  * Implement a policy network.  This should be a mapping from states to probabilities over actions.
-  * Implement a value network.  This should be a mapping from states to the value of that state.
-  * Implement a loss function for the value network.  This should compare the estimated value to the observed return.
-  * Implement the PPO loss function.
-  * Train the value and policy networks.  This will at least involve:
-    - Generating data by generating roll-outs (according to the current policy)
-    - Calculating the actual discounted sum of rewards for each rollout
-    - Using the value network to estimate the value of each rollout
-    - Using actual and estimated value to calculate the advantage
-----
-====Background and documentation:====
-Here is [[https://gym.openai.com/|the OpenAI gym worlds]]
-Here is a [[https://blog.openai.com/openai-baselines-ppo/ | blog post introducing the idea]].
-Here is the [[https://arxiv.org/pdf/1707.06347.pdf | paper with a technical description of the algorithm ]].
-Here is a [[https://www.youtube.com/watch?v=5P7I-xPq8u8 | video describing it at a high level ]].
-----
-====Hints and helps:====
-<code python>
-import gym
-import torch
-...
-class PolicyNetwork(nn.Module):
-    ...
-class ValueNetwork(nn.Module):
-    ....
-class AdvantageDataset(Dataset):
-    ....
-class PolicyDataset(Dataset):
-    ....
-def main():
-    env = gym.make('CartPole-v0')
-    # Hyperparameters
-    epochs = 1000
-    env_samples = 100
-    episode_length = 200
-    gamma = 0.9
-    value_epochs = 2
-    policy_epochs = 5
-    batch_size = 32
-    policy_batch_size = 256
-    epsilon = 0.2
-    for _ in range(epochs):
-        # generate rollouts
-        rollouts = []
-        for _ in range(env_samples):
-            # don't forget to reset the environment at the beginning of each episode!
-            # rollout for a certain number of steps!
-        print('avg standing time:', standing_len / env_samples)
-        calculate_returns(rollouts, gamma)
-        # Approximate the value function
-        value_dataset = AdvantageDataset(rollouts)
-        value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
-        for _ in range(value_epochs):
-            # train value network
-        calculate_advantages(rollouts, value)
-        # Learn a policy
-        policy_dataset = PolicyDataset(rollouts)
-        policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True)
-        for _ in range(policy_epochs):
-            # train policy network
-</code>

BYU CS classes

User Tools

Site Tools

Differences

Page Tools