How do you incorporate exploration or curiosity in PPO?

Proximal policy optimization (PPO) is a popular reinforcement learning (RL) algorithm that can learn complex policies from high-dimensional observations and actions. However, one of the challenges of PPO is to balance exploration and exploitation, that is, to find new and potentially rewarding states without losing the performance of the current policy. In this article, you will learn how to incorporate exploration or curiosity in PPO using different methods and techniques.

1 Entropy bonus

One simple way to encourage exploration in PPO is to add an entropy bonus to the objective function. Entropy measures the randomness or uncertainty of a probability distribution, and in this case, it reflects how diverse the policy is. By maximizing the entropy of the policy, you can prevent it from becoming too deterministic or greedy, and encourage it to try different actions. The entropy bonus is usually a hyperparameter that you can tune to balance exploration and exploitation.

Add your perspective

Pranay Pasula

VP of AI Research @ Stealth | NeurIPS Area Chair | Advancing multimodal generative AI towards general agents
Report contribution
The use of entropy bonuses with proximal policy optimization (PPO) has both advantages and disadvantages. Encouraging exploration through the entropy bonus can help the policy learn a better policy faster, avoid local optima, and improve performance. However, it can also lead to suboptimal policies if the entropy bonus is too high, and slow down training by increasing the complexity of the objective function. It's important to consider the domain in which the entropy bonus is applied. Simple, fully observed, closed domains, may not require the robustness that complex, partially observed, open domains require. Ultimately, the entropy bonus needs to be carefully tuned to achieve the right balance between exploration and exploitation.

Like

Unhelpful

2 Intrinsic motivation

Another way to incorporate exploration or curiosity in PPO is to use intrinsic motivation, which is a reward signal that depends on the agent's own internal state and learning progress, rather than the external environment. Intrinsic motivation can capture the agent's curiosity or interest in novel or informative states, and drive it to explore them. There are different ways to define and compute intrinsic motivation, such as prediction error, information gain, empowerment, or novelty.

Add your perspective

3 Random network distillation

One specific example of intrinsic motivation is random network distillation (RND), which was proposed by Burda et al. (2018) and applied to PPO. RND consists of two neural networks: a fixed random network and a trainable predictor network. The random network maps the state observations to a random feature vector, and the predictor network tries to match this vector. The prediction error is then used as the intrinsic reward, which is high for novel states and low for familiar states. RND can help PPO explore large and sparse reward environments.

Add your perspective

Dwait Bhatt

Graduate Student in AI and Robotics at UCSD | Ex - Samsung Research
(edited)
Report contribution
- Another thing RND introduces is separate value heads for intrinsic and extrinsic reward streams. This idea can be extended to any actor-critic style training algorithm which intends to introduce intrinsic rewards. It is especially useful if the extrinsic reward is episodic but intrinsic reward is modeled as non-episodic.

Like

Unhelpful

4 Parameter space noise

Another technique to incorporate exploration or curiosity in PPO is to use parameter space noise, which was proposed by Plappert et al. (2017) and applied to PPO. Parameter space noise is a way of adding noise to the policy network parameters, rather than the action space. This can create more consistent and correlated exploration, and avoid disrupting the action distribution too much. Parameter space noise can be adapted to the performance of the policy, and can improve the sample efficiency and robustness of PPO.

Add your perspective

Pranay Pasula

VP of AI Research @ Stealth | NeurIPS Area Chair | Advancing multimodal generative AI towards general agents
Report contribution
The entry above speaks only to upsides of adding parameter space noise, so I'll describe the downsides of using it with PPO. It can increase training time, lead to suboptimal policies if the noise isn't controlled, require careful tuning of hyperparameters, and be sensitive to the environment and not generalize well to other environments. Again, careful tuning is essential to achieve a balance between exploration and exploitation, but this can be time-consuming and require significant experimentation. Therefore, while parameter space noise can be a useful technique, the potential downsides should be considered when used with PPO and consideration should be given to other approaches to addressing exploration or curiosity.

Like

Unhelpful

5 Action space noise

A final technique to incorporate exploration or curiosity in PPO is to use action space noise, which is a more traditional way of adding noise to the actions taken by the agent. Action space noise can be either additive or multiplicative, and can be sampled from different distributions, such as Gaussian, uniform, or Ornstein-Uhlenbeck. Action space noise can help PPO escape from local optima and explore more diverse actions, but it can also introduce more variance and instability.

Add your perspective

6 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

Add your perspective

How do you incorporate exploration or curiosity in PPO?

1

2

3

4

5

6

1 Entropy bonus

2 Intrinsic motivation

3 Random network distillation

4 Parameter space noise

5 Action space noise

6 Here’s what else to consider

Reinforcement Learning

Rate this article

Thanks for your feedback

More articles on Reinforcement Learning

More relevant reading