This shows you the differences between two versions of the page.
cs501r_f2018:lab9 [2018/11/12 19:56] wingated |
cs501r_f2018:lab9 [2021/06/30 23:42] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | |||
- | ====Objective:==== | ||
- | |||
- | * To implement the Proximal Policy Optimization algorithm | ||
- | |||
- | ---- | ||
- | ====Deliverable:==== | ||
- | |||
- | For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm. | ||
- | |||
- | ---- | ||
- | ====Grading standards:==== | ||
- | |||
- | Your notebook will be graded on the following: | ||
- | |||
- | * 45% Proper design, creation and debugging of an actor and critic networks | ||
- | * 25% Proper implementation of the PPO loss function and objective on cart-pole | ||
- | * 20% Implementation and demonstrated learning of PPO on another domain of your choice | ||
- | * 10% Visualization of policy return as a function of training | ||
- | |||
- | ---- | ||
- | ====Description:==== | ||
- | |||
- | For this lab, you will implement the PPO algorithm, and train it on a few simple worlds from the OpenAI gym test suite of problems. | ||
- | |||
- | You may use any code you want from the internet to help you understand how to implement this, but **all final code must be your own**. | ||
- | |||
- | |||
- | Here is [[https://gym.openai.com/|the OpenAI gym worlds]] | ||
- | |||
- | Here is a [[https://blog.openai.com/openai-baselines-ppo/ | blog post introducing the idea]]. | ||
- | |||
- | Here is the [[https://arxiv.org/pdf/1707.06347.pdf | paper with a technical description of the algorithm ]]. | ||
- | |||
- | Here is a [[https://www.youtube.com/watch?v=5P7I-xPq8u8 | video describing it at a high level ]]. | ||
- | |||
- | ---- | ||
- | ====Hints and helps:==== | ||
- | |||
- | <code python> | ||
- | |||
- | import gym | ||
- | import torch | ||
- | ... | ||
- | |||
- | class PolicyNetwork(nn.Module): | ||
- | ... | ||
- | |||
- | |||
- | class ValueNetwork(nn.Module): | ||
- | .... | ||
- | |||
- | class AdvantageDataset(Dataset): | ||
- | .... | ||
- | |||
- | class PolicyDataset(Dataset): | ||
- | .... | ||
- | |||
- | def main(): | ||
- | |||
- | env = gym.make('CartPole-v0') | ||
- | |||
- | # Hyperparameters | ||
- | epochs = 1000 | ||
- | env_samples = 100 | ||
- | episode_length = 200 | ||
- | gamma = 0.9 | ||
- | value_epochs = 2 | ||
- | policy_epochs = 5 | ||
- | batch_size = 32 | ||
- | policy_batch_size = 256 | ||
- | epsilon = 0.2 | ||
- | |||
- | for _ in range(epochs): | ||
- | rollouts = [] | ||
- | standing_len = 0 | ||
- | for _ in range(env_samples): | ||
- | # generate rollouts | ||
- | |||
- | print('avg standing time:', standing_len / env_samples) | ||
- | calculate_returns(rollouts, gamma) | ||
- | |||
- | # Approximate the value function | ||
- | value_dataset = AdvantageDataset(rollouts) | ||
- | value_loader = DataLoader(value_dataset, batch_size=batch_size, shuffle=True, pin_memory=True) | ||
- | for _ in range(value_epochs): | ||
- | # train value network | ||
- | |||
- | calculate_advantages(rollouts, value) | ||
- | |||
- | # Learn a policy | ||
- | policy_dataset = PolicyDataset(rollouts) | ||
- | policy_loader = DataLoader(policy_dataset, batch_size=policy_batch_size, shuffle=True, pin_memory=True) | ||
- | for _ in range(policy_epochs): | ||
- | # train policy network | ||
- | |||
- | |||
- | |||
- | </code> | ||