Exploration Strategies¶
GFlowNets must discover diverse modes in the reward landscape. Without exploration, the policy can collapse to a small number of high-reward modes while missing others. torchgfn provides several complementary exploration mechanisms.
Epsilon-Greedy¶
Add uniform random actions with probability epsilon during sampling:
trajectories = gflownet.sample_trajectories(env, n=batch_size, epsilon=0.1)
This mixes the learned policy with a uniform distribution. Higher epsilon means more exploration but noisier training signal.
For graph environments, epsilon can be specified per action component using a dictionary:
from collections import defaultdict
epsilon = defaultdict(float)
epsilon[GraphActions.ACTION_TYPE_KEY] = 0.1
epsilon[GraphActions.EDGE_INDEX_KEY] = 0.05
trajectories = gflownet.sample_trajectories(env, n=batch_size, epsilon=epsilon)
Note: Epsilon-greedy makes training off-policy — you must set recalculate_all_logprobs=True when computing the loss.
See: train_hypergrid_simple.py (--epsilon), train_with_example_modes.py (per-component epsilon).
Temperature Scaling¶
Soften or sharpen the policy distribution by scaling logits before sampling:
trajectories = gflownet.sample_trajectories(env, n=batch_size, temperature=2.0)
temperature > 1: Flatter distribution, more explorationtemperature < 1: Sharper distribution, more exploitationtemperature = 1: No change (on-policy)
Like epsilon-greedy, temperature scaling makes training off-policy.
See: train_hypergrid_exploration_examples.py (systematic comparison of temperature values).
Temperature Annealing¶
For continuous environments, a common pattern is to start with high temperature (exploration) and decay to 1.0 (exploitation) over training:
for iteration in range(n_iterations):
progress = iteration / (n_iterations // 2) # Anneal over first half
temperature = max(1.0, initial_temp * (1 - progress))
trajectories = gflownet.sample_trajectories(env, n=batch_size, temperature=temperature)
See: train_box.py (BoxCartesianPFEstimator.temperature attribute, linearly decayed).
Exploration Variance (Continuous Environments)¶
For continuous action spaces, inject additional variance into the sampling distribution via scale_factor or exploration_std:
trajectories = gflownet.sample_trajectories(env, n=batch_size, scale_factor=1.5)
This widens the action distribution without changing its center. A schedule that decays from high to zero works well:
scale_schedule = torch.linspace(2.0, 0.0, n_iterations)
See: train_line.py (decaying scale_factor schedule), train_diffusion_rtb.py (exploration_std warm-down).
Noisy Layers¶
Add learnable noise to network weights for state-dependent exploration:
from gfn.utils.modules import MLP
module = MLP(input_dim, output_dim, n_noisy_layers=2)
Noisy layers inject parametric noise into the final layers of the policy network. Unlike epsilon-greedy (which is state-independent), noisy layers enable the network to learn where to explore.
See: train_hypergrid_exploration_examples.py (comparison with other strategies).
Local Search Sampling¶
The LocalSearchSampler implements the back-and-forth heuristic: from a terminal state, sample backward K steps to a junction state, then sample forward to a new terminal state. This refines existing trajectories rather than generating from scratch.
from gfn.samplers import LocalSearchSampler
sampler = LocalSearchSampler(pf_estimator, pb_estimator)
trajectories = sampler.sample_trajectories(
env,
n=batch_size,
n_local_search_loops=2,
back_ratio=0.5,
use_metropolis_hastings=True,
)
n_local_search_loops: Number of refinement rounds per trajectoryback_ratio: Fraction of trajectory length to walk backward (controls search depth)use_metropolis_hastings: Accept/reject refined trajectories based on a probabilistic criterion
See: train_hypergrid_local_search.py, train_box.py (with --sampler local_search).
Replay Buffers as Exploration¶
Replay buffers provide implicit exploration by reusing diverse past experience. See the Off-Policy Training guide for details on ReplayBuffer, TerminatingStateBuffer, and expert data warm-starting.
Comparing Strategies¶
train_hypergrid_exploration_examples.py provides a systematic comparison framework that runs 9 configurations (on-policy, replay buffer, epsilon variants, noisy layers, temperature, and combinations) across multiple seeds and plots mode discovery, L1 distance, and logZ error. This is the best starting point for understanding which strategies work for your problem.
Summary¶
Strategy |
Off-policy? |
State-dependent? |
Best for |
|---|---|---|---|
Epsilon-greedy |
Yes |
No |
Simple baseline, discrete environments |
Temperature |
Yes |
No |
Softening/sharpening action selection |
Noisy layers |
Yes |
Yes |
Learnable, adaptive exploration |
Scale factor |
Yes |
No |
Continuous action spaces |
Local search |
No (refines) |
Yes |
Improving existing trajectories |
Replay buffer |
Yes |
N/A |
Reusing diverse past experience |