How do you incorporate exploration or curiosity in PPO?
Proximal policy optimization (PPO) is a popular reinforcement learning (RL) algorithm that can learn complex policies from high-dimensional observations and actions. However, one of the challenges of PPO is to balance exploration and exploitation, that is, to find new and potentially rewarding states without losing the performance of the current policy. In this article, you will learn how to incorporate exploration or curiosity in PPO using different methods and techniques.
One simple way to encourage exploration in PPO is to add an entropy bonus to the objective function. Entropy measures the randomness or uncertainty of a probability distribution, and in this case, it reflects how diverse the policy is. By maximizing the entropy of the policy, you can prevent it from becoming too deterministic or greedy, and encourage it to try different actions. The entropy bonus is usually a hyperparameter that you can tune to balance exploration and exploitation.
-
The use of entropy bonuses with proximal policy optimization (PPO) has both advantages and disadvantages. Encouraging exploration through the entropy bonus can help the policy learn a better policy faster, avoid local optima, and improve performance. However, it can also lead to suboptimal policies if the entropy bonus is too high, and slow down training by increasing the complexity of the objective function. It's important to consider the domain in which the entropy bonus is applied. Simple, fully observed, closed domains, may not require the robustness that complex, partially observed, open domains require. Ultimately, the entropy bonus needs to be carefully tuned to achieve the right balance between exploration and exploitation.
Another way to incorporate exploration or curiosity in PPO is to use intrinsic motivation, which is a reward signal that depends on the agent's own internal state and learning progress, rather than the external environment. Intrinsic motivation can capture the agent's curiosity or interest in novel or informative states, and drive it to explore them. There are different ways to define and compute intrinsic motivation, such as prediction error, information gain, empowerment, or novelty.
One specific example of intrinsic motivation is random network distillation (RND), which was proposed by Burda et al. (2018) and applied to PPO. RND consists of two neural networks: a fixed random network and a trainable predictor network. The random network maps the state observations to a random feature vector, and the predictor network tries to match this vector. The prediction error is then used as the intrinsic reward, which is high for novel states and low for familiar states. RND can help PPO explore large and sparse reward environments.
-
- Another thing RND introduces is separate value heads for intrinsic and extrinsic reward streams. This idea can be extended to any actor-critic style training algorithm which intends to introduce intrinsic rewards. It is especially useful if the extrinsic reward is episodic but intrinsic reward is modeled as non-episodic.
Another technique to incorporate exploration or curiosity in PPO is to use parameter space noise, which was proposed by Plappert et al. (2017) and applied to PPO. Parameter space noise is a way of adding noise to the policy network parameters, rather than the action space. This can create more consistent and correlated exploration, and avoid disrupting the action distribution too much. Parameter space noise can be adapted to the performance of the policy, and can improve the sample efficiency and robustness of PPO.
-
The entry above speaks only to upsides of adding parameter space noise, so I'll describe the downsides of using it with PPO. It can increase training time, lead to suboptimal policies if the noise isn't controlled, require careful tuning of hyperparameters, and be sensitive to the environment and not generalize well to other environments. Again, careful tuning is essential to achieve a balance between exploration and exploitation, but this can be time-consuming and require significant experimentation. Therefore, while parameter space noise can be a useful technique, the potential downsides should be considered when used with PPO and consideration should be given to other approaches to addressing exploration or curiosity.
A final technique to incorporate exploration or curiosity in PPO is to use action space noise, which is a more traditional way of adding noise to the actions taken by the agent. Action space noise can be either additive or multiplicative, and can be sampled from different distributions, such as Gaussian, uniform, or Ornstein-Uhlenbeck. Action space noise can help PPO escape from local optima and explore more diverse actions, but it can also introduce more variance and instability.
Rate this article
More relevant reading
-
Data ScienceWhat are the most advanced data mining algorithms used in industry?
-
Neural NetworksHow do you incorporate exploration and exploitation trade-offs in policy gradient methods?
-
Statistical ModelingWhat are some of the latest trends and developments in text mining research and applications?
-
Neural NetworksHow do you optimize the learning rate and the batch size for backpropagation?