The limits of passive filtering

In the previous posts on Probabilistic Language Programming (PLP) and its connection to energy-based models, we established that inference-time search is fundamentally about estimating and sampling from a verifier-reweighted target distribution:

\begin{equation} p_{\mathcal{D}}(\tau \mid x) \propto \pi_{\mathcal{D}}(\tau \mid x)\Phi(\tau) \end{equation}

where $\pi_{\mathcal{D}}$ is the base autoregressive proposal and $\Phi$ is the verifier potential. We explored how the continuation value $V(s) = \log Z(s)$ acts as local log-evidence, a free-energy functional guiding the generation process.

However, all filtering-based search methods—from simple rejection sampling to tree search and Sequential Monte Carlo (SMC)—suffer from a fundamental limitation: they are passive. They can only reallocate probability mass among the trajectories actually generated by the proposal $\pi_{\mathcal{D}}$. If the target distribution requires sampling a trajectory that the base model assigns effectively zero probability, filtering fails entirely. We diagnosed this previously as the Bellman support gap.

Recent work (Mukherjee et al., 2025) beautifully formalizes this exact problem as a transport problem governed by three interacting constraints. They constrain the optimal policy to a set of policies that are sufficiently covered by the reference LLM, bounded by a $\chi^2$-divergence constraint $\beta$. Within this framework, the effectiveness of any sampling algorithm $\mathfrak{A}$ is governed by:

  1. The generator’s coverage ($s_{\text{ver}}$): The probability mass the base model natively places on the verifier’s correct set.
  2. The verifier’s Region of Convergence (ROC): Quantified by Youden’s index $J = \text{TPR} - \text{FPR}$, representing the quality of the imperfect verifier’s reward signal.
  3. The sampling algorithm’s sub-optimality: Defined as the difference in average reward between the theoretically optimal constrained policy $\nu^\star$ and the empirical sampling distribution $\nu_{\mathfrak{A}}$, namely SubOpt $(\mathfrak{A}) = \mathbb{E}{\nu^\star}[r^\star] - \mathbb{E}{\nu_{\mathfrak{A}}}[r^\star]$.

When the generator lacks coverage for high-reward trajectories ($s_{\text{ver}} \to 0$), we enter a “transport regime” where the sub-optimality strictly increases, bounded below by the Optimal Transport Cost (OTC) dictated by the Hamming distance between the reference and target distributions. In PLP terms: when the support gap is large, taking more samples just gives you more shots at the same blind spot. To genuinely overcome this, we need a mechanism not just to filter mass, but to move it. We need continuous optimization over the generative process itself.

The Schrödinger Bridge formulation

To formalize this “moving of mass,” we can look to the Schrödinger Bridge (SB) problem. Originally posed by Erwin Schrödinger in 1931, it asks: given a stochastic system that evolves from an initial distribution $p_0$ to a final distribution $p_T$, what is the most likely path the system took if we observe it instead ending up at some different terminal distribution $p^\star_T$?

For inference-time steering, we can map this directly onto the language modeling process (Ksenofontov et al., 2025):

  • The prior process: The standard autoregressive decoding of the base LLM, $\pi_{\mathcal{D}}(\tau \mid x)$.
  • The target marginal: The verifier-reweighted distribution we actually want to sample from, $p_{\mathcal{D}}(\tau \mid x)$.

The Schrödinger Bridge seeks the optimal measure $\pi$ over sequences that minimizes the Kullback-Leibler divergence to the prior process, subject to matching the target terminal marginal:

\begin{equation} \min_{\pi} \mathrm{KL}(\pi \parallel \pi_{\mathcal{D}}) \quad \text{s.t.} \quad \pi(\text{final}) = p_{\mathcal{D}} \end{equation}

Solving this entropy-regularized optimal transport problem gives us the exact dynamic we need: the path of least resistance to steer the LLM’s autoregressive cascade towards high-reward outcomes.

Doob’s $h$-transform and continuation values

The mathematical beauty of the Schrödinger Bridge is how it is solved. The optimal steered process can be derived using Doob’s $h$-transform. For a Markov process, Doob’s $h$-transform modifies the forward transition probabilities by multiplying them by a harmonic “backward message” $h(s)$ and normalizing:

\begin{equation} \pi(y \mid s) = \pi_{\mathcal{D}}(y \mid s) \frac{h(s \oplus y)}{h(s)} \end{equation}

In our context, what is this backward message $h(s)$? It is exactly the partition function of the remaining trajectory, which we identified in PLP as the exponential of the continuation value!

\begin{equation} h(s) = Z(s) = \exp(V(s)) \end{equation}

This reveals a profound equivalence: Doob’s $h$-transform is the theoretically optimal way to twist an autoregressive proposal to hit a verifier-weighted target, and the twisting function is precisely the soft-value function $V(s)$.

Recent work on Twisted Sequential Monte Carlo for language models (Zhao et al., 2024) relies precisely on this insight. By training or approximating an auxiliary value function $V(s)$ and using it to twist the proposal distribution during SMC, we transition from passively filtering dead-ends to actively steering the generation path.

Table 1. The conceptual mapping between continuous generative modeling via Schrödinger Bridges and inference-time steering in PLP.
Continuous Generative Modeling Inference-Time LLM Steering PLP Equivalent
Prior stochastic process Autoregressive decoding $\pi_{\mathcal{D}}$ Proposal
Target marginal $p_T^\star$ Verifier-weighted $p_{\mathcal{D}}$ Target
Backward message $h(s)$ Twisting function / Value function Partition function $Z(s)$
Free Energy / Reward Critic score / Verification result Log-potential $\log \Phi$
Wasserstein Gradient Step Iterative self-correction / Reflexion sample with factor / guarantee constraints

Reflexion and self-correction as Wasserstein Gradient Flow

If Doob’s $h$-transform gives us the optimal path, how do empirical techniques like Reflexion or iterative self-correction attempt to approximate it?

We can view iterative text refinement as a discrete, text-space approximation of a Wasserstein Gradient Flow (WGF). In continuous probability, WGF describes how a probability distribution evolves over time to minimize a functional (like Free Energy) by moving particles along the gradient of the landscape, heavily penalized by the Wasserstein distance (cost of transport).

In a typical self-correction loop (like Reflexion (Shinn et al., 2023)):

  1. The model generates a draft sequence $\tau_k$.
  2. A verifier (critic) evaluates $\tau_k$ and generates a critique string $c_k$.
  3. The model generates a new sequence $\tau_{k+1}$ conditioned on the draft and the critique.

Mathematically, this is an attempt to perform a gradient descent step in the space of probability distributions:

\begin{equation} \rho_{k+1} = \mathop{\mathrm{argmin}}_{\rho} \left[ \mathcal{F}(\rho) + \frac{1}{2\eta} \mathcal{W}_2^2(\rho, \rho_k) \right] \end{equation}

Here, $\mathcal{F}(\rho)$ represents the Free Energy (how well the text satisfies the verifier), and $\mathcal{W}$ is the Wasserstein distance. The “critique” string $c_k$ acts as a low-bandwidth, text-based surrogate for the continuous gradient $\nabla \mathcal{F}(\rho_k)$, pointing the model towards the region of lower free energy. The conditional generation step $p(\tau_{k+1} \mid \tau_k, c_k)$ is the numerical integrator executing the step.

Dealing with tokens is a blessing and a curse

As theoretically elegant as framing self-correction as text-space WGF is, defining the Wasserstein distance between tokens is not exactly clear.

First, text space is discrete and highly non-smooth. You cannot take an infinitesimal “step” in text space. The Wasserstein metric between two text strings is loosely defined by edit distance or semantic similarity, neither of which supports smooth vector fields.

Second, textual critiques are incredibly low-bandwidth gradients. A natural language critique (e.g., “The second step of your math proof forgot to carry the 2”) must compress a complex, high-dimensional energy landscape into a narrow bottleneck of discrete tokens. This leads to inefficient optimization, often resulting in mode collapse where the model mindlessly agrees with the critic without actually improving the objective.

Third, hallucinations are failure modes of the integrator. When an agent confidently commits to a bad trajectory despite feedback, it is analogous to the step size $\eta$ being too large, causing the integrator to overshoot the local minimum of the free energy landscape and land in a high-confidence, low-accuracy mode.

This is why purely text-based iterative refinement often hits a ceiling. It is trying to do continuous gradient descent using a hammer and chisel.

The path forward: latent space steering

The hidden gem here—the logical synthesis of PLP, Doob’s $h$-transform, and Schrödinger Bridges—is that the optimization shouldn’t happen in discrete token space at all. It should happen in the continuous latent space of the model.

If we want to reliably fix the Bellman support gap without relying on passive filtering or clunky text-space iterations, we need to shift the Schrödinger Bridge flow into the continuous embedding space (e.g., the KV-cache of the Chain-of-Thought states).

Recent research is already converging on this realization. The community is beginning to explore continuous steering of LLM hidden states, avoiding discrete token bottlenecks. For example, some approaches use low-dimensional manifold gradients to steer reasoning trajectories towards higher-quality regions in latent space (Sun & others, 2026). Other methods focus on lightweight interventions applied directly to the key-value cache, constructing steering vectors from reasoning traces to induce chain-of-thought without fine-tuning (Huang & others, 2025). These methods form part of a broader shift toward formalizing latent space steering (Xu & others, 2025).

Instead of generating tokens, criticizing them, and generating new tokens, an advanced inference-time algorithm leveraging these continuous principles would:

  1. Roll out a latent trajectory.
  2. Evaluate the continuous gradient of the verifier’s free energy with respect to these continuous latents.
  3. Apply a Wasserstein gradient step (e.g., via continuous flow matching or diffusion) directly to the continuous internal states to repair the support gap.
  4. Only decode into discrete text after the latent probability mass has been successfully transported to the target distribution.

By moving inference-time steering from text-space heuristics to latent-space physics, we stop guessing at the Schrödinger Bridge and start directly integrating it.

References

  • Mukherjee, A., Bullo, M., Basu, D., & Gündüz, D. (2025). Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality. Arxiv:2510.18982.
  • Ksenofontov, V., Gushchin, A., Burnaev, E., & Korotin, A. (2025). Categorical Schrödinger Bridge Matching. Proceedings of Machine Learning Research, 267. https://proceedings.mlr.press/v267/ksenofontov25a.html
  • Zhao, S., Brekelmans, R., Makhzani, A., & Grosse, R. (2024). Probabilistic inference in language models via twisted sequential monte carlo. Arxiv:2404.17546. https://doi.org/10.48550/arxiv.2404.17546
  • Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Arxiv:2303.11366. https://doi.org/10.48550/arxiv.2303.11366
  • Sun, H., & others. (2026). GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients. Arxiv:2601.10229.
  • Huang, Y., & others. (2025). KV Cache Steering for Controlling Frozen LLMs. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=ryC6pXobCw
  • Xu, Y., & others. (2025). A Unified Understanding and Evaluation of Steering Methods. Arxiv:2502.02716.

Further reading

Read more in the science topic.