gfn.utils.trust_pcl =================== .. py:module:: gfn.utils.trust_pcl .. autoapi-nested-parse:: Trust-PCL ↔ RTB parameter conversion utilities. Deleu et al. (2025, arXiv:2509.01632) proved that Relative Trajectory Balance (RTB) is mathematically equivalent to Trust-PCL, an off-policy reinforcement learning method with KL regularization toward a reference policy (Nachum et al., NeurIPS 2017). **The core identity** (Proposition 3.1): .. math:: \mathcal{L}_{\text{Trust-PCL}}(\phi, \psi) = \alpha^2 \,\mathcal{L}_{\text{RTB}}(\phi, \psi) where :math:`\alpha = 1/\beta` is the Trust-PCL temperature. **What this means:** Training a GFlowNet with RTB is *exactly* the same optimization problem as training a policy with Trust-PCL. The same parameters are updated, the same gradients flow, and the same fixed point is reached. Only the loss scale differs by the constant :math:`\alpha^2`. **Parameter correspondence:** +-----------------------+------------------------------------+-------------------------------------+ | Concept | RTB (GFlowNet) | Trust-PCL (RL) | +=======================+====================================+=====================================+ | Temperature | :math:`\beta` | :math:`\alpha = 1/\beta` | +-----------------------+------------------------------------+-------------------------------------+ | Learned scalar | :math:`\log Z_\psi` | :math:`V^{\text{soft}}_\psi(s_0) | | | | = \alpha \cdot \log Z_\psi` | +-----------------------+------------------------------------+-------------------------------------+ | Trainable model | Posterior :math:`p_\phi` | Policy :math:`\pi_\phi` | +-----------------------+------------------------------------+-------------------------------------+ | Fixed reference | Prior :math:`p_\theta` | Reference :math:`\pi_{\text{ref}}` | +-----------------------+------------------------------------+-------------------------------------+ | Target distribution | :math:`p(x) \propto p_\theta(x) | :math:`\pi^*(a|s) \propto | | | \cdot r(x)^\beta` | \pi_{\text{ref}}(a|s) | | | | \exp(Q^{\text{soft}}/\alpha)` | +-----------------------+------------------------------------+-------------------------------------+ **Derivation sketch:** The RTB balance condition for a trajectory :math:`\tau` is: .. math:: \log Z_\psi + \log p_\phi(\tau) = \beta \log r(x_T) + \log p_\theta(\tau) Multiplying both sides by :math:`\alpha = 1/\beta`: .. math:: \alpha \log Z_\psi + \alpha \log p_\phi(\tau) = \log r(x_T) + \alpha \log p_\theta(\tau) Rearranging with :math:`V^{\text{soft}}_\psi(s_0) = \alpha \log Z_\psi`: .. math:: -V^{\text{soft}}_\psi(s_0) + \sum_t r_t + \alpha \sum_t \log \frac{\pi_{\text{ref}}(a_t|s_t)}{\pi_\phi(a_t|s_t)} = 0 This is exactly the Trust-PCL consistency condition (Nachum et al. 2017, Equation 3). The KL regularization term :math:`\alpha \sum_t \log(\pi_{\text{ref}} / \pi_\phi)` emerges naturally from the ratio of prior to posterior trajectory log-probabilities in the original RTB equation — no separate KL penalty is added; it is an intrinsic consequence of the balance condition. .. rubric:: References Deleu et al. "Relative Trajectory Balance is equivalent to Trust-PCL" (2025, arXiv:2509.01632). Nachum et al. "Trust-PCL: An Off-Policy Trust Region Method for Continuous Control" (NeurIPS 2017, arXiv:1707.01891). Venkatraman et al. "Amortizing intractable inference in diffusion models for vision, language, and control" (NeurIPS 2024, arXiv:2405.20971). Functions --------- .. autoapisummary:: gfn.utils.trust_pcl.rtb_to_trust_pcl_params gfn.utils.trust_pcl.trust_pcl_to_rtb_params Module Contents --------------- .. py:function:: rtb_to_trust_pcl_params(logZ, beta) Convert RTB parameters to Trust-PCL parameters. :param logZ: RTB log-partition function :math:`\log Z_\psi`. :param beta: RTB reward scaling :math:`\beta`. :returns: - ``alpha = 1 / beta`` — Trust-PCL temperature - ``v_soft_s0 = alpha * logZ`` — soft value function at :math:`s_0` :rtype: Dictionary with keys ``"alpha"`` and ``"v_soft_s0"`` Example:: >>> rtb_to_trust_pcl_params(logZ=2.0, beta=0.5) {'alpha': 2.0, 'v_soft_s0': 4.0} .. py:function:: trust_pcl_to_rtb_params(alpha, v_soft_s0) Convert Trust-PCL parameters to RTB parameters. :param alpha: Trust-PCL temperature :math:`\alpha`. :param v_soft_s0: Soft value function at :math:`s_0`, i.e. :math:`V^{\text{soft}}_\psi(s_0)`. :returns: - ``beta = 1 / alpha`` — RTB reward scaling - ``logZ = v_soft_s0 / alpha`` — RTB log-partition function :rtype: Dictionary with keys ``"beta"`` and ``"logZ"`` Example:: >>> trust_pcl_to_rtb_params(alpha=2.0, v_soft_s0=4.0) {'beta': 0.5, 'logZ': 2.0}