Exploration Strategies¶

GFlowNets must discover diverse modes in the reward landscape. Without exploration, the policy can collapse to a small number of high-reward modes while missing others. torchgfn provides several complementary exploration mechanisms.

Epsilon-Greedy¶

Add uniform random actions with probability epsilon during sampling:

trajectories = gflownet.sample_trajectories(env, n=batch_size, epsilon=0.1)

This mixes the learned policy with a uniform distribution. Higher epsilon means more exploration but noisier training signal.

For graph environments, epsilon can be specified per action component using a dictionary:

from collections import defaultdict
epsilon = defaultdict(float)
epsilon[GraphActions.ACTION_TYPE_KEY] = 0.1
epsilon[GraphActions.EDGE_INDEX_KEY] = 0.05

trajectories = gflownet.sample_trajectories(env, n=batch_size, epsilon=epsilon)

Note: Epsilon-greedy makes training off-policy — you must set recalculate_all_logprobs=True when computing the loss.

See: train_hypergrid_simple.py (--epsilon), train_with_example_modes.py (per-component epsilon).

Temperature Scaling¶

Soften or sharpen the policy distribution by scaling logits before sampling:

trajectories = gflownet.sample_trajectories(env, n=batch_size, temperature=2.0)

temperature > 1: Flatter distribution, more exploration
temperature < 1: Sharper distribution, more exploitation
temperature = 1: No change (on-policy)

Like epsilon-greedy, temperature scaling makes training off-policy.

See: train_hypergrid_exploration_examples.py (systematic comparison of temperature values).

Temperature Annealing¶

For continuous environments, a common pattern is to start with high temperature (exploration) and decay to 1.0 (exploitation) over training:

for iteration in range(n_iterations):
    progress = iteration / (n_iterations // 2)  # Anneal over first half
    temperature = max(1.0, initial_temp * (1 - progress))
    trajectories = gflownet.sample_trajectories(env, n=batch_size, temperature=temperature)

See: train_box.py (BoxCartesianPFEstimator.temperature attribute, linearly decayed).

Exploration Variance (Continuous Environments)¶

For continuous action spaces, inject additional variance into the sampling distribution via scale_factor or exploration_std:

trajectories = gflownet.sample_trajectories(env, n=batch_size, scale_factor=1.5)

This widens the action distribution without changing its center. A schedule that decays from high to zero works well:

scale_schedule = torch.linspace(2.0, 0.0, n_iterations)

See: train_line.py (decaying scale_factor schedule), train_diffusion_rtb.py (exploration_std warm-down).

Noisy Layers¶

Add learnable noise to network weights for state-dependent exploration:

from gfn.utils.modules import MLP
module = MLP(input_dim, output_dim, n_noisy_layers=2)

Noisy layers inject parametric noise into the final layers of the policy network. Unlike epsilon-greedy (which is state-independent), noisy layers enable the network to learn where to explore.

See: train_hypergrid_exploration_examples.py (comparison with other strategies).

Local Search Sampling¶

The LocalSearchSampler implements the back-and-forth heuristic: from a terminal state, sample backward K steps to a junction state, then sample forward to a new terminal state. This refines existing trajectories rather than generating from scratch.

from gfn.samplers import LocalSearchSampler

sampler = LocalSearchSampler(pf_estimator, pb_estimator)
trajectories = sampler.sample_trajectories(
    env,
    n=batch_size,
    n_local_search_loops=2,
    back_ratio=0.5,
    use_metropolis_hastings=True,
)

n_local_search_loops: Number of refinement rounds per trajectory
back_ratio: Fraction of trajectory length to walk backward (controls search depth)
use_metropolis_hastings: Accept/reject refined trajectories based on a probabilistic criterion

See: train_hypergrid_local_search.py, train_box.py (with --sampler local_search).

Replay Buffers as Exploration¶

Replay buffers provide implicit exploration by reusing diverse past experience. See the Off-Policy Training guide for details on ReplayBuffer, TerminatingStateBuffer, and expert data warm-starting.

Comparing Strategies¶

train_hypergrid_exploration_examples.py provides a systematic comparison framework that runs 9 configurations (on-policy, replay buffer, epsilon variants, noisy layers, temperature, and combinations) across multiple seeds and plots mode discovery, L1 distance, and logZ error. This is the best starting point for understanding which strategies work for your problem.

Summary¶

Strategy	Off-policy?	State-dependent?	Best for
Epsilon-greedy	Yes	No	Simple baseline, discrete environments
Temperature	Yes	No	Softening/sharpening action selection
Noisy layers	Yes	Yes	Learnable, adaptive exploration
Scale factor	Yes	No	Continuous action spaces
Local search	No (refines)	Yes	Improving existing trajectories
Replay buffer	Yes	N/A	Reusing diverse past experience