This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cs501r_f2018:lab9 [2018/11/12 20:59] wingated |
cs501r_f2018:lab9 [2021/06/30 23:42] (current) |
||
---|---|---|---|
Line 2: | Line 2: | ||
====Objective:==== | ====Objective:==== | ||
- | * To implement the Proximal Policy Optimization algorithm | + | * To implement the Proximal Policy Optimization algorithm, and learn about the use of deep learning in the context of deep RL. |
---- | ---- | ||
====Deliverable:==== | ====Deliverable:==== | ||
- | For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm. | + | For this lab, you will turn in a colab notebook that implements the proximal policy optimization (PPO) algorithm. You must provide tangible proof that your algorithm is working. |
---- | ---- | ||
Line 16: | Line 16: | ||
* 45% Proper design, creation and debugging of an actor and critic networks | * 45% Proper design, creation and debugging of an actor and critic networks | ||
* 25% Proper implementation of the PPO loss function and objective on cart-pole ("CartPole-v0") | * 25% Proper implementation of the PPO loss function and objective on cart-pole ("CartPole-v0") | ||
- | * 20% Implementation and demonstrated learning of PPO on another domain of your choice | + | * 20% Implementation and demonstrated learning of PPO on another domain of your choice (**except** VizDoom) |
* 10% Visualization of policy return as a function of training | * 10% Visualization of policy return as a function of training | ||
Line 32: | Line 32: | ||
* Implement a loss function for the value network. This should compare the estimated value to the observed return. | * Implement a loss function for the value network. This should compare the estimated value to the observed return. | ||
* Implement the PPO loss function. | * Implement the PPO loss function. | ||
- | |||
* Train the value and policy networks. This will at least involve: | * Train the value and policy networks. This will at least involve: | ||
- Generating data by generating roll-outs (according to the current policy) | - Generating data by generating roll-outs (according to the current policy) | ||
Line 39: | Line 38: | ||
- Using actual and estimated value to calculate the advantage | - Using actual and estimated value to calculate the advantage | ||
+ | Part of this lab involves a demonstration that your implementation is working properly. Most likely, this will be a measure of return / reward / value over time; as the policy improves, we should see it increase. | ||
+ | |||
+ | Some domains have other, more natural measures of performance. For example, for the cart-pole problem, we can measure how many iterations it takes before the pole falls over and we reach a terminal state; as the policy improves, this number improves. | ||
+ | |||
+ | You are also welcome to be creative in your illustrations - you could use a video of the agent doing something reasonable, for example. | ||
---- | ---- | ||
Line 53: | Line 57: | ||
---- | ---- | ||
====Hints and helps:==== | ====Hints and helps:==== | ||
+ | |||
+ | **Update**: Here is our | ||
+ | [[https://github.com/joshgreaves/reinforcement-learning|our lab's implementation of PPO]]. NOTE: because this code comes with a complete implementation of running on VizDoom, **you may not use that as your additional test domain.** | ||
+ | |||
+ | Here are some [[https://stackoverflow.com/questions/50667565/how-to-install-vizdoom-using-google-colab|instructions for installing vizdoom on colab]]. | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | Here is some code from our reference implementation. Hopefully it will serve as a good outline of what you need to do. | ||
<code python> | <code python> | ||
Line 67: | Line 81: | ||
.... | .... | ||
- | class AdvantageDataset(Dataset): | + | class AdvantageDataset(Dataset): |
- | .... | + | def __init__(self, experience): |
- | | + | super(AdvantageDataset, self).__init__() |
- | class PolicyDataset(Dataset): | + | self._exp = experience |
- | .... | + | self._num_runs = len(experience) |
+ | self._length = reduce(lambda acc, x: acc + len(x), experience, 0) | ||
+ | |||
+ | def __getitem__(self, index): | ||
+ | idx = 0 | ||
+ | seen_data = 0 | ||
+ | current_exp = self._exp[0] | ||
+ | while seen_data + len(current_exp) - 1 < index: | ||
+ | seen_data += len(current_exp) | ||
+ | idx += 1 | ||
+ | current_exp = self._exp[idx] | ||
+ | chosen_exp = current_exp[index - seen_data] | ||
+ | return chosen_exp[0], chosen_exp[4] | ||
+ | |||
+ | def __len__(self): | ||
+ | return self._length | ||
+ | |||
+ | | ||
+ | class PolicyDataset(Dataset): | ||
+ | def __init__(self, experience): | ||
+ | super(PolicyDataset, self).__init__() | ||
+ | self._exp = experience | ||
+ | self._num_runs = len(experience) | ||
+ | self._length = reduce(lambda acc, x: acc + len(x), experience, 0) | ||
+ | |||
+ | def __getitem__(self, index): | ||
+ | idx = 0 | ||
+ | seen_data = 0 | ||
+ | current_exp = self._exp[0] | ||
+ | while seen_data + len(current_exp) - 1 < index: | ||
+ | seen_data += len(current_exp) | ||
+ | idx += 1 | ||
+ | current_exp = self._exp[idx] | ||
+ | chosen_exp = current_exp[index - seen_data] | ||
+ | return chosen_exp | ||
+ | |||
+ | def __len__(self): | ||
+ | return self._length | ||
def main(): | def main(): | ||
env = gym.make('CartPole-v0') | env = gym.make('CartPole-v0') | ||
+ | policy = PolicyNetwork(4, 2) | ||
+ | value = ValueNetwork(4) | ||
+ | |||
+ | policy_optim = optim.Adam(policy.parameters(), lr=1e-2, weight_decay=0.01) | ||
+ | value_optim = optim.Adam(value.parameters(), lr=1e-3, weight_decay=1) | ||
+ | | ||
+ | # ... more stuff here... | ||
# Hyperparameters | # Hyperparameters | ||
Line 89: | Line 148: | ||
for _ in range(epochs): | for _ in range(epochs): | ||
+ | # generate rollouts | ||
rollouts = [] | rollouts = [] | ||
- | standing_len = 0 | ||
for _ in range(env_samples): | for _ in range(env_samples): | ||
- | # generate rollouts | + | # don't forget to reset the environment at the beginning of each episode! |
+ | # rollout for a certain number of steps! | ||
print('avg standing time:', standing_len / env_samples) | print('avg standing time:', standing_len / env_samples) |