Exploration Strategies in Reinforcement Learning: An Overview

Reinforcement learning (RL) represents a category of machine learning where an agent learns to make sequential decisions by interacting with its environment, with the goal of maximizing cumulative rewards. The agent develops a policy that correlates states with actions to achieve long-term objectives.

The core elements of RL are:

Agent: The entity making decisions while interacting with the environment.
Environment: The external context in which the agent operates and receives feedback.
State: The agent's perception of the current situation.
Action: The choice made by the agent based on its observed state.
Reward: The feedback signal from the environment, which evaluates the agent's actions (can be positive or negative).
Policy: The strategy the agent employs to determine its actions.
Value Function: The anticipated total reward an agent can obtain from a specific state or action.
Episode: A full cycle of interactions between the agent and the environment, starting at an initial state and concluding at a terminal state.
Timestep: A single interaction unit between the agent and the environment, typically reflecting one action and the subsequent feedback.

Example: In a game like Tic-Tac-Toe, the player (agent) interacts with the game board (environment) by choosing spots to place X's or O's (actions), aiming for a reward (winning the game (+1), losing (-1), or drawing (0)). The player's strategy (policy) informs its actions based on the current board state, while a value function assesses the potential of various board configurations.

Exploitation-Exploration Tradeoff

In reinforcement learning, the exploitation-exploration dilemma poses a significant challenge for agents as they learn to engage with their environment.

Exploitation refers to using the agent's existing knowledge to select actions that are expected to yield the highest immediate rewards.
Exploration, conversely, involves taking actions that may not provide immediate rewards but offer chances to learn more about the environment and possibly uncover better long-term strategies.

Striking the right balance between exploration and exploitation is essential for effective RL. Excessive exploration may slow learning and hinder convergence to optimal policies, while too much exploitation without sufficient exploration can lead to early convergence on suboptimal solutions.

In this article, I will introduce various exploration strategies. While this collection is not exhaustive, it serves as a valuable starting point for those interested in understanding exploration techniques.

Classic Exploration Strategies

These strategies work effectively in environments with a relatively small and well-defined action space and moderate uncertainty regarding the rewards associated with different actions.

Epsilon-Greedy: The agent selects the action with the highest estimated value with probability 1-? (exploitation) and with probability ? chooses a random action (exploration).
Upper Confidence Bound (UCB): Balances exploration and exploitation by considering both the potential high rewards and the uncertainty level associated with each action, prioritizing actions with greater uncertainty.

The action at time t can be defined as follows:

Here, Qt denotes the action value function.

Thompson Sampling: Each action starts with a prior probability distribution that reflects the agent’s uncertainty about its true reward. At each timestep, a reward distribution for each action is sampled from its corresponding prior. The agent then selects the action with the highest sampled reward, balancing the goal of maximizing expected reward with exploring less certain actions (those with broader distributions). Once the outcome of the chosen action is observed, the agent updates the reward distributions using Bayesian inference.
Boltzmann Exploration: This strategy selects actions probabilistically based on their estimated values and a temperature parameter. The probability of selecting action a at time t is computed using the softmax function:

Where: - (P(a_t=a)) is the probability of selecting action a at time t. - (Q(a_t=a)) is the estimated value of action a at time t. - n is the total number of actions. - ? is the temperature parameter, controlling exploration levels. A higher ? encourages more exploration, while a lower ? promotes exploitation of actions with higher estimated values. - Exploration by Random Initialization: Initiating the agent’s policy or value function parameters randomly can foster exploration, as it begins with a variety of strategies that are refined through learning.

Exploration Problems

Sparse Rewards

Classic exploration algorithms are designed for environments where rewards are frequent, meaning the agent typically receives feedback after almost every action. In contrast, in environments with sparse rewards, the agent may only receive a reward signal after a lengthy sequence of actions (for example, only receiving a reward after completing a game level). In such cases, the agent lacks local feedback to improve its policy, making naive exploration strategies ineffective.

The Noisy TV Problem

In reinforcement learning, the noisy TV problem illustrates a scenario in which an agent becomes trapped in meaningless exploration.

Imagine a TV that constantly changes channels (the noisy TV).
The agent can flip through channels (take actions) to view new content (experience surprises).
However, this constant shifting does not yield useful information related to the task the agent is meant to learn.

As a result, the agent becomes stuck in endless exploration of the noisy TV, failing to make progress toward its actual goal. This issue highlights challenges caused by stochasticity or randomness in the environment, as the agent may be drawn to stochastic transitions rather than genuinely new states.

Intrinsic Motivation

Intrinsic motivation (IM) is a concept developed to tackle specific limitations and challenges found in traditional RL methods, especially within complex and uncertain environments. It refers to the internal drive or curiosity that propels an agent to explore its surroundings and participate in learning activities.

Unlike extrinsic motivation, which arises from external rewards or penalties tied to the agent's interactions with the environment, intrinsic motivation stems from the agent's inherent desire to learn and discover new experiences.

In essence, intrinsic motivation assigns an internal reward to encourage the agent to autonomously acquire new knowledge and skills.

The total reward received by the agent at each timestep is:

Where: - (r_e) is the extrinsic reward received at time t, given by the environment based on the agent’s actions. - (r_i) is the intrinsic reward at time t, reflecting the agent’s internal motivation or curiosity-driven exploration. - ? is a weighting factor that balances extrinsic and intrinsic rewards.

Intrinsic motivation addresses several limitations in traditional RL:

Sparse rewards: IM provides an internal drive for exploration and learning, enabling agents to pursue new experiences and gather information even when external rewards are infrequent.
Building effective state representation: IM assists the agent in distinguishing relevant features from irrelevant noise. This allows the agent to create a more effective state representation that captures the environment's underlying structure while filtering out extraneous details.
Temporal abstraction of actions: IM encourages exploration of temporally extended actions or sequences, facilitating the discovery of meaningful temporal patterns in the environment. This enables agents to learn higher-level strategies and action sequences (referred to as options) that lead to long-term rewards.
Learning a curriculum: IM aids agents in learning to schedule or order tasks (or options) in multi-task RL, without requiring expert knowledge.

In the sections that follow, I will present various IM exploration strategies.

Count-Based Methods

The principle underlying these methods is to monitor how often states are encountered and assign rewards accordingly, favoring states that are visited less frequently by giving them higher intrinsic rewards.

Empirical Count Function

Let (N(s)) represent the empirical count function that tracks the actual encounters of state s. The intrinsic reward can be expressed as:

While effective in tabular environments (with discrete state spaces), this method is impractical for continuous or large state spaces, as an agent rarely revisits the same state. Additionally, it is inefficient because most states may have a count of 0. A non-zero count for states, even those not previously visited, is necessary.

Counting After Hashing

To address the limitations mentioned above and facilitate counting in high-dimensional state spaces, one solution is to hash the state space when it becomes too extensive. The Locality-Sensitive Hashing (LSH) function: (h: S ? Z) discretizes the state space, mapping states into hash codes for easier tracking.

The intrinsic reward is expressed as:

However, results obtained through counting after hashing show only slight improvement compared to classic exploration policies.

Density Model

This approach is effective in continuous scenarios as it incorporates a notion of generalization, where nearby states receive similar counts. The goal is to learn a probability density function and observe how this function changes upon visiting a state. The change in the probability density function can be used to create a pseudo-count estimation.

The intrinsic reward can be represented as:

The pseudo-count is defined as:

Where (p(s)) is the density model outputting the probability of observing s, and (p'(s)) is the probability of observing s after an additional pass.

This method demands numerous assumptions regarding both training dynamics and the architectures employed:

Specifically, the training procedure for the density model must be fully online and can only update a state once.
The model used must provide normalized probability density estimates (no use of GANs or VAEs).

Although density model algorithms function well in environments with sparse rewards, they introduce a significant layer of complexity.

Prediction-Based Methods

Prediction-based methods utilize the agent's predictive abilities to promote exploration in regions of the environment where predictions are uncertain or diverge from reality.

Forward Dynamics Approaches

In forward dynamics, the agent learns to predict the outcomes of its actions by estimating the next state or state transition dynamics. Prediction-based intrinsic motivation methods rooted in forward dynamics assess prediction errors by comparing the anticipated next state with the actual observed state. High prediction errors indicate uncertainty or surprise, motivating the agent to explore new actions or states to mitigate uncertainty.

Where (f) is the feature embedding function.

Dynamic Auto-Encoder: This model computes the distance between the predicted state and the actual state in a feature space compressed using an auto-encoder. This method yields marginal improvements compared to Boltzmann exploration on specific standard Atari games (a set of classic video games developed by Atari and used as benchmarks for testing RL algorithms). However, these methods struggle with stochasticity in the environment. For example, introducing random noise can distract the agent, causing it to focus on the noise rather than predicting the next state (the white-noise problem).

Intrinsic Curiosity Module (ICM): The agent's curiosity is framed as the error in its capacity to predict the outcomes of its actions within a feature space learned by an inverse dynamics model. At timestep t, the agent is in state (s_t), interacts with the environment by executing an action (a_t) sampled from its current policy, and transitions to state (s_{t+1}).

The objective of policy ? is to maximize the sum of the extrinsic reward (r_e) provided by the environment and the curiosity-driven intrinsic reward (r_i) generated by ICM: (r_e + r_i).

The agent consists of two subsystems:

ICM: A reward generator that produces a curiosity-driven intrinsic reward.
A policy that outputs a sequence of actions to maximize that reward signal.

ICM incorporates two key components:

Inverse Dynamics Model: This model learns to predict the agent's actions from observed state transitions, encoding states (s_t) and (s_{t+1}) into features (?(s_t), ?(s_{t+1})) that aim to predict (a_t).
Forward Dynamics Model: This model estimates the consequences of the agent's actions by predicting the next state or state transition dynamics. It inputs (?(s_t)) and (a_t) to forecast the feature representation (??(s_{t+1})) of state (s_{t+1}). The intrinsic reward assigned is the prediction error between (??(s_{t+1})) and (?(s_{t+1})).

The embedding function (?(s_t)) lacks motivation to encode environmental features that the agent cannot influence through its actions, making the exploration strategy resilient to uncontrollable aspects of the environment (addressing the stochasticity issue).

Exploration with Mutual Information (EMI): EMI constructs embedding representations for both state and action spaces (whether discrete or continuous). It maximizes mutual information to ensure that representations of functionally similar states are positioned closely together, while those of distinct states are spaced apart.

Let (?_s: S ? R^d) be the embedding function for states and (?_a: A ? R^d) be the embedding function for actions with parameters ? and ?, respectively. The aim is to minimize uncertainty regarding (?_s(s')) given the embedding representations of the preceding state and action ([?_s(s), ?_a(a)]), and vice versa.

The embedding functions are learned through the maximization of (I([?_s(s), ?_a(a)], ?_s(s'))) and (I([?_s(s), ?_s(s')], ?_a(a))).

The forward model F is constrained to function as a simple linear model in the representation space. However, some transitions remain highly nonlinear and challenging to predict. The error model (S: S × A ? R) is another neural network that takes state and action as inputs, estimating the irreducible error under the linear model. EMI mitigates the white-noise problem; however, like ICM, it does not factor in features that relate to long-term control in the representation.

Random Networks Approaches

Instead of predicting the dynamics of the environment as an exploration strategy (forward dynamics), we can also make predictions regarding a random task. This section introduces random networks-based approaches for exploration.

Random Network Distillation (RND): RND evaluates state novelty by distilling a random neural network (with fixed weights) into another neural network. It consists of:

A target function f, which is a fixed, randomly initialized neural network based on observations.
A learner/predictor function (f?), a neural network trained on the data collected by the agent.

For each state, the random network generates continuous random features. The predictor network learns to replicate the output of the random network for each state. The intrinsic reward is based on the prediction error.

As the prediction network becomes trained on certain states, it increasingly approximates the output of the random network for those states, resulting in lower prediction errors. Conversely, new states will exhibit higher prediction errors due to their absence in training.

RND performs exceptionally on Montezuma’s Revenge, a popular Atari game used for benchmarking, but requires a significantly larger number of steps. A downside is that random features might not sufficiently represent the richness of an environment.

Memory-Based Methods

Memory-based exploration methods counteract the limitations associated with reward-based exploration by utilizing external memory. This approach addresses several drawbacks of reward bonus-based exploration, including slow function approximation, fluctuating exploration bonuses, and knowledge loss due to states losing novelty over time.

Never Give Up Algorithm (NGU): This method comprises a dual intrinsic module:

Short Term: An episodic novelty module that encourages the agent to explore new states within a single episode.
Lifelong Novelty Module: Gradually discourages revisiting states that have already been explored multiple times across various episodes.

The episodic novelty module maintains an episodic memory that records an embedding for each state visited during an episode. The short-term intrinsic reward (r_t) is calculated by measuring the distance between the embedding of the current state and those stored in the memory buffer. A higher distance indicates a novel state, resulting in a greater reward.

In this module, the concept of novelty does not consider inter-episode interactions; a state visited multiple times in other episodes yields the same intrinsic reward as a completely new state, as long as both are novel concerning the current episode.

The Lifelong Novelty Module (inter-episodic) employs Random Network Distillation. A convolutional network (g) is trained to match the output of another randomly initialized, untrained convolutional neural network (h).

The episodic and inter-episodic rewards are combined according to the following formula:

NGU demonstrates competitive performance across various benchmark games:

Go-Explore: This method suggests that the primary barrier to effective exploration arises from algorithms experiencing “detachment,” where they forget how to return to previously visited states, and “derailment,” where they fail to revisit a state before further exploration. These challenges are tackled by explicitly retaining promising states and prioritizing returns to these states before continuing exploration.

The Go-Explore algorithm consists of two phases:

Phase 1 (Exploration Phase): Go-Explore keeps an archive of diverse and high-performing states discovered during exploration, along with the trajectories leading to them. This archive serves as a repository of promising states that the agent can return to and learn from by continuing random exploration from there, updating memory, etc. This process continues until the task is resolved and at least one successful trajectory is found.
Phase 2 (“Robustification”): The aim is to solidify the solution against stochasticity through imitation learning: Go-Explore retains the highest scoring trajectory for each state. These trajectories facilitate training a robust and effective policy through Learning from Demonstrations, replacing human expert demonstrations.

Go-Explore achieves impressive results on Montezuma’s Revenge compared to other algorithms:

The policy-based Go-Explore represents an enhancement of the Go-Explore algorithm, achieving even better results on Montezuma’s Revenge.

As previously noted, this discussion of exploration strategies is not exhaustive. For additional implementations of various DRL and exploration algorithms, please visit: https://github.com/opendilab/DI-engine.

References

Aubret, Arthur, Laetitia Matignon, and Salima Hassas. 2019. A survey on intrinsic motivation in reinforcement learning. France: n.p. https://doi.org/10.48550/arXiv.1908.06976.

Pathak, Deepak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. 2017. Curiosity-driven Exploration by Self-supervised Prediction. https://doi.org/10.48550/arXiv.1705.05363.

Badia, Adrià P., Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, et al. 2020. Never Give Up: Learning Directed Exploration Strategies. https://doi.org/10.48550/arXiv.2002.06038.

Burda, Yuri, Harrison Edwards, Amos Storkey, and Oleg Klimov. 2018. Exploration by Random Network Distillation. https://doi.org/10.48550/arXiv.1810.12894.

Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2021. First return, then explore. https://doi.org/10.1038/s41586-020-03157-9.

Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2021. Go-Explore: a New Approach for Hard-Exploration Problems. https://doi.org/10.48550/arXiv.1901.10995.

Jiang, Yiding, J. Zico Kolter, and Roberta Raileanu. 2023. On the Importance of Exploration for Generalization in Reinforcement Learning. https://doi.org/10.48550/arXiv.2306.05483.

Kim, Hyoungseok, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, and Hyun Oh Song. 2019. EMI: Exploration with Mutual Information. https://doi.org/10.48550/arXiv.1810.01176.

rhondamuse.com

Exploration Strategies in Reinforcement Learning: An Overview

Exploitation-Exploration Tradeoff

Classic Exploration Strategies

Exploration Problems

Sparse Rewards

The Noisy TV Problem

Intrinsic Motivation

Count-Based Methods

Empirical Count Function

Counting After Hashing

Density Model

Prediction-Based Methods

Forward Dynamics Approaches

Random Networks Approaches

Memory-Based Methods

References

Share the page:

Recent Post:

Exploring Ceres: The Ocean World of Our Solar System

Empowering Choices: The Importance of Saying No

Unlocking the Secrets to Earning $100 Daily with CPA Marketing

Navigating Cybersecurity: Tim Miller on SBOMs and Their Impact

Finding Balance: How to Ask for Space in a Relationship

Clinical Case Study #1 — The Persistent Chill

How Tesla Revolutionized the Automotive Sector with EVs

Embrace Your Spiritual Journey: 3 Key Insights from the Universe