Differences

This shows you the differences between two versions of the page.

--- cs501r_f2018:lab9 [2018/11/09 21:41]
wingated
+++ cs501r_f2018:lab9 [2021/06/30 23:42] (current)
@@ Line 2: / Line 2: @@
 ====Objective:====
-  * To implement the Proximal Policy Optimization algorithm
+  * To implement the Proximal Policy Optimization algorithm, and learn about the use of deep learning in the context of deep RL.
 ----
 ====Deliverable:====
-For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm.
+For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm.  You must provide tangible proof that your algorithm is working.
 ----
@@ Line 15: / Line 15: @@
   * 45% Proper design, creation and debugging of an actor and critic networks
-  * 45% Proper implementation of the PPO loss function and objective
+  * 25% Proper implementation of the PPO loss function and objective on cart-pole ("CartPole-v0")
+  * 20% Implementation and demonstrated learning of PPO on another domain of your choice (**except** VizDoom)
   * 10% Visualization of policy return as a function of training
@@ Line 21: / Line 22: @@
 ====Description:====
-For this lab, you will implement the PPO algorithm, and train it on a few simple worlds.
+For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems.
+You may use any code you want from the internet to help you understand how to implement this, but **all final code must be your own**.
+To successfully complete this lab, you must do the following:
+  * Implement a policy network.  This should be a mapping from states to probabilities over actions.
+  * Implement a value network.  This should be a mapping from states to the value of that state.
+  * Implement a loss function for the value network.  This should compare the estimated value to the observed return.
+  * Implement the PPO loss function.
+  * Train the value and policy networks.  This will at least involve:
+    - Generating data by generating roll-outs (according to the current policy)
+    - Calculating the actual discounted sum of rewards for each rollout
+    - Using the value network to estimate the value of each rollout
+    - Using actual and estimated value to calculate the advantage
+Part of this lab involves a demonstration that your implementation is working properly.  Most likely, this will be a measure of return / reward / value over time; as the policy improves, we should see it increase.
+Some domains have other, more natural measures of performance.  For example, for the cart-pole problem, we can measure how many iterations it takes before the pole falls over and we reach a terminal state; as the policy improves, this number improves.
+You are also welcome to be creative in your illustrations - you could use a video of the agent doing something reasonable, for example.
+----
+====Background and documentation:====
+Here is [[https://gym.openai.com/|the OpenAI gym worlds]]
 Here is a [[https://blog.openai.com/openai-baselines-ppo/ | blog post introducing the idea]].
-Here is the paper with a technical description of the algorithm: [[https://arxiv.org/pdf/1707.06347.pdf | Proximal policy optimization]].
+Here is the [[https://arxiv.org/pdf/1707.06347.pdf | paper with a technical description of the algorithm ]].
+Here is a [[https://www.youtube.com/watch?v=5P7I-xPq8u8 | video describing it at a high level ]].
+----
+====Hints and helps:====
+**Update**: Here is our
+[[https://github.com/joshgreaves/reinforcement-learning|our lab's implementation of PPO]].  NOTE: because this code comes with a complete implementation of running on VizDoom, **you may not use that as your additional test domain.**
+Here are some [[https://stackoverflow.com/questions/50667565/how-to-install-vizdoom-using-google-colab|instructions for installing vizdoom on colab]].
+----
+Here is some code from our reference implementation.  Hopefully it will serve as a good outline of what you need to do.
+<code python>
+import gym
+import torch
+...
+class PolicyNetwork(nn.Module):
+    ...
+class ValueNetwork(nn.Module):
+    ....
+class AdvantageDataset(Dataset):
+    def __init__(self, experience):
+        super(AdvantageDataset, self).__init__()
+        self._exp = experience
+        self._num_runs = len(experience)
+        self._length = reduce(lambda acc, x: acc + len(x), experience, 0)
+    def __getitem__(self, index):
+        idx = 0
+        seen_data = 0
+        current_exp = self._exp[0]
+        while seen_data + len(current_exp) - 1 < index:
+            seen_data += len(current_exp)
+            idx += 1
+            current_exp = self._exp[idx]
+        chosen_exp = current_exp[index - seen_data]
+        return chosen_exp[0], chosen_exp[4]
+    def __len__(self):
+        return self._length
+class PolicyDataset(Dataset):
+    def __init__(self, experience):
+        super(PolicyDataset, self).__init__()
+        self._exp = experience
+        self._num_runs = len(experience)
+        self._length = reduce(lambda acc, x: acc + len(x), experience, 0)
+    def __getitem__(self, index):
+        idx = 0
+        seen_data = 0
+        current_exp = self._exp[0]
+        while seen_data + len(current_exp) - 1 < index:
+            seen_data += len(current_exp)
+            idx += 1
+            current_exp = self._exp[idx]
+        chosen_exp = current_exp[index - seen_data]
+        return chosen_exp
+    def __len__(self):
+        return self._length
+def main():
+    env = gym.make('CartPole-v0')
+    policy = PolicyNetwork(4, 2)
+    value = ValueNetwork(4)
+    policy_optim = optim.Adam(policy.parameters(), lr=1e-2, weight_decay=0.01)
+    value_optim = optim.Adam(value.parameters(), lr=1e-3, weight_decay=1)
+    # ... more stuff here...
+    # Hyperparameters
+    epochs = 1000
+    env_samples = 100
+    episode_length = 200
+    gamma = 0.9
+    value_epochs = 2
+    policy_epochs = 5
+    batch_size = 32
+    policy_batch_size = 256
+    epsilon = 0.2
+    for _ in range(epochs):
+        # generate rollouts
+        rollouts = []
+        for _ in range(env_samples):
+            # don't forget to reset the environment at the beginning of each episode!
+            # rollout for a certain number of steps!
+        print('avg standing time:', standing_len / env_samples)
+        calculate_returns(rollouts, gamma)
+        # Approximate the value function
+        value_dataset = AdvantageDataset(rollouts)
+        value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
+        for _ in range(value_epochs):
+            # train value network
+        calculate_advantages(rollouts, value)
+        # Learn a policy
+        policy_dataset = PolicyDataset(rollouts)
+        policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True)
+        for _ in range(policy_epochs):
+            # train policy network
-Here is a video describing it at a high level: [[https://www.youtube.com/watch?v=5P7I-xPq8u8 | PPO video ]]
+</code>

BYU CS classes

User Tools

Site Tools

Differences

Page Tools