Approximate sampling and inference in LLMs

In the previous note on Probabilistic Language Programming and energy-based models, I argued that there is a deep analogy between LLM scaffolds and verifier-reweighted distributions. The sharper statement is simple and useful:

PLP is inference-time approximation of the free energy.

The central object of PLP is the fundamental equation that we reproduce via a programmatic scaffold building a trace $\tau$, arriving at the relation:

\begin{equation} p_{\mathcal{D}}(\tau \mid x) \propto \pi_{\mathcal{D}}(\tau \mid x)\,\Phi(\tau,x). \end{equation}

A scaffold is a way to first sample traces from a proposal distribution $\pi_{\mathcal{D}}(\cdot \mid x)$ induced by the deployed model, and then reshape that mass with potentials supplied by verifiers, judges, or heuristics.

To make the rest precise, let us start from the basic PLP objects. Fix a deployment setup $\mathcal{D}$ (model, decoding hyperparameters etc.), an input prompt $x$, and a complete execution trace $\tau$. In PLP, the forward execution of the workflow induces a proposal distribution $\pi_{\mathcal{D}}(\tau \mid x)$ and the verifier, judge, or preference specification induces a nonnegative potential $\Phi(\tau,x)$. When the prompt $x$ is fixed, I will often write $\Phi(\tau)$ instead of $\Phi(\tau,x)$ to shorten formulas.

The semantic target is hence modeled as the product of two competing forces: the proposal force $\pi_{\mathcal{D}}(\tau \mid x)$ pushing the exploration of different trajectories in the semantic space, and the verifier force $\Phi(\tau,x)$ keeping the proposal on track with a warp signal:

\begin{equation} p_{\mathcal{D}}(\tau \mid x) = \frac{\pi_{\mathcal{D}}(\tau \mid x)\Phi(\tau,x)}{Z_{\mathcal{D}}(x)} \quad Z_{\mathcal{D}}(x)=\sum_{\tau} \pi_{\mathcal{D}}(\tau \mid x)\Phi(\tau,x) \label{eq:fundamental}\tag{1}. \end{equation}

The normalization factor $Z_{\mathcal{D}}(x)$ is the partition function, namely the total verifier-weighted mass over all traces at fixed input and deployment state.

If we stick to the soft reinforcement learning literature we could now define the reference distribution (proposal) and the global reward $R$ as:

\begin{equation} p_{\mathrm{ref}}(\tau \mid x) := \pi_{\mathcal{D}}(\tau \mid x), \qquad R(\tau,x) := \log \Phi(\tau,x), \end{equation}

with the usual convention $R(\tau,x)=-\infty$ when $\Phi(\tau,x)=0$.

The same semantic target becomes a softargmax distribution over the reward weighted proposals, a convention that is already well defined in many soft reinforcement learning studies (Levine, 2018; Blondel et al., 2025):

\begin{equation} p_{\mathcal{D}} (\tau \mid x) = \frac{p_{\mathrm{ref}}(\tau \mid x)\exp(R(\tau,x))} {\sum_{\tau’} p_{\mathrm{ref}}(\tau’ \mid x)\exp(R(\tau’,x))}. \end{equation}

This is exactly the reference-measure energy-based form that appears in KL-regularized maximum-entropy reinforcement learning and in the recent ARM/EBM equivalence of Blondel et al. (Blondel et al., 2025) when instead of simply the answer $\mathbf{y}$, we include the trace $\tau$ that naturally includes the possible scaffold architectures.

So the PLP proposal is the reference model used for proposal exploration in the sequences landscape (as in energy based models) and the PLP potential is the exponentiated version of the additive energy correction, namely the feedback mechanism that could drive the exploration of better solutions toward the semantic target.

Hence, the central intractable object that any prompt engineer is unknowingly working with is the partition function:

\[Z_{\mathcal{D}}(x)=\sum_{\tau} \pi_{\mathcal{D}}(\tau \mid x)\Phi(\tau,x) \tag{2},\]

or, put more precisely, its logarithm hence the free energy!

scaffold_integration
Figure 1: Integration over traces. Approximating the continuation partition function

There is a difference in the equation above between integrating over the trace space and integrating over the string space. The trace space is a richer and vaster environment that one could, at least in theory, optimize over. Over the last years, people as been involuntarily integrating the trace space in search of better and better approximation of the (finite but very large) sum $\sum_\tau (\cdot)$ above!

Depending on convention, this object can be read as a log-evidence, a soft value, or as minus a free-energy objective. I will keep the $\log Z_{\mathcal{D}}(x)$ sign convention throughout, because it is the natural one for search and reweighting. Remember: any prompting technique or programmatic scaffold is just a way to approximate it.

This is the core conceptual jump between post-training alignment and inference-time compute. When we deploy a scaffold, we are not invoking a mysterious faculty of “reasoning” or treating the model as an anthropomorphic oracle. We are approximating, at inference time, a log-partition over future traces that the raw autoregressive policy cannot cheaply marginalize in one forward pass.

inference_time_reweighting
Figure 1: Inference-Time Trace Reweighting. (Left) The raw autoregressive policy induces a broad, unconstrained proposal distribution over reasoning traces. I write this baseline law as $q(\tau \mid x)$ to distinguish it from the deployed scaffold proposal $\pi_{\mathcal{D}}(\tau \mid x)$. (Center) A verifier, judge, or heuristic defines a non-negative potential field $\Phi(\tau, x)$ (orange), smoothly warping the energy landscape of available paths and highlighting promising basins. (Right) The normalized semantic target distribution $p(\tau \mid x)$. By applying the correct inference-time scaffold, probability mass (represented by line thickness and color intensity) is shifted away from dead ends and concentrated onto a smaller subset of high-value continuations.

When we deploy a scaffold, we are not invoking a mysterious faculty of “reasoning” or treating the model as an anthropomorphic oracle. We are approximating, at inference time, a log-partition over future traces that the raw autoregressive policy cannot cheaply marginalize in one forward pass.

The proposal is the reference measure

This exact rewriting clarifies the relation between post-training and inference-time engineering.

At training time, one tries to distill the verifier-reweighted target directly into the model weights. At inference time, PLP keeps the proposal fixed and approximates the same reweighted target procedurally by sampling, branching, judging, filtering, and resampling.

The two views are therefore not competing stories. They are two computational routes toward the same normalized distribution.

One route is:

\[\text{change the weights so that } q_{\theta} \approx p_{\mathcal{D}}.\]

Here $q_{\theta}$ denotes a parametric model distribution over traces (or over answers) with parameters $\theta$.

The other is:

\[\text{keep } \pi_{\mathcal{D}} \text{ as the proposal and approximate } p_{\mathcal{D}} \text{ at runtime}.\]

This makes the scope of PLP much clearer. It is the semantics-first theory of what to do when the target is known only through a potential and the proposal is accessible mainly through a sampling oracle. In other words, PLP is the runtime side of the same mathematics that post-training methods try to absorb into parameters.

There is also an important caveat here. If the potential $\Phi$ is produced by an imperfect LLM judge, then the resulting free-energy landscape is judge-relative, not necessarily truth-relative (Lee et al., 2025): the geometry is still real, but it is the geometry of the deployed verifier. Simply speaking, an imperfect judge is a process that warps the energy landscape itself.

The closer bridge is path-integral control

At this point it is tempting to import Friston’s free energy principle wholesale (Friston, 2010). I think there is a closer and cleaner bridge for the present argument: Kappen’s path-integral view of stochastic optimal control (Kappen, 2005).

For a class of noisy control problems with quadratic control cost, Kappen showed that the nonlinear Hamilton-Jacobi-Bellman equation can be linearized through a log transform of the cost-to-go. Let $\xi$ denote a physical state and $t$ denote time in his continuous-time setup (this $\xi$ is not the prompt $x$ elsewhere in the note). Let $\lambda>0$ denote the temperature parameter that appears in Kappen’s log transform (it ties control cost to noise strength in his construction (Kappen, 2005)). If $\Psi(\xi,t)$ denotes the forward diffusion partition function in that setting, then the optimal cost-to-go reads

\[J(\xi,t) = -\lambda \log \Psi(\xi,t),\]

so the control problem becomes a log-partition over future trajectories.

PLP has the same algebraic structure, but the state is a trace prefix. Let $s$ denote such a prefix (a string state in the MDP picture). Define the continuation partition

\[Z(s)=\sum_{\tau \succ s} \pi_{\mathcal{D}}(\tau \mid s)\,\Phi(\tau), \qquad V(s)=\log Z(s),\]

where $\tau \succ s$ means that the complete trace $\tau$ extends $s$, and $\pi_{\mathcal{D}}(\tau \mid s)$ is the conditional proposal law for those completions.

The PLP potential need not be a literal Boltzmann factor. Earlier we wrote $R(\tau,x)=\log \Phi(\tau,x)$, so always $\Phi(\tau)=\exp(R(\tau))$ in log space. To align PLP with Kappen’s path-cost form, suppose in addition that we can write

\[R(\tau)=-\frac{S(\tau)}{\lambda}\]

for some nonnegative path cost functional $S(\tau)$ and a temperature $\lambda>0$. Equivalently,

\[\Phi(\tau)=\exp\!\left(-\frac{S(\tau)}{\lambda}\right).\]

With that identification, the PLP continuation value $V(s)=\log Z(s)$ matches Kappen’s log-partition structure up to sign and scale. Writing $J_{\mathrm{PLP}}(s)$ for the corresponding cost-to-go under the same sign convention as $J(\xi,t)$,

\[J_{\mathrm{PLP}}(s)=-\lambda\, V(s).\]

The parameter $\lambda$ here is the same kind of object as Kappen’s temperature: it sets the units that convert log-masses into costs.

This is the closest control-theoretic bridge in this post. Both formalisms start from a reference law over futures, reweight those futures by an exponential score, and summarize the remaining downstream options in a log-partition. The comparison to active inference remains useful, but it is a step further away from the runtime mechanics discussed here.

Table 1. The same formal structure viewed from four literatures. The Kappen row is the closest control-theoretic analogue for the present note. The Friston row remains an analogy, not an identity.
view base law correction or evidence term state scalar operational meaning
PLP proposal $\pi_{\mathcal{D}}(\tau \mid s)$ potential $\Phi(\tau)$ or energy correction $R=\log \Phi$ $V(s)=\log \sum_{\tau \succ s} \pi_{\mathcal{D}}(\tau \mid s)\Phi(\tau)$ future verifier-weighted continuation mass
path-integral control uncontrolled diffusion or reference dynamics path cost $S$ with weight $e^{-S/\lambda}$ $J(\xi,t)=-\lambda \log \Psi(\xi,t)$ stochastic cost-to-go under noise (Kappen, 2005)
ARM/EBM reference measure or local policy sequence reward / energy soft value $V_q(s)$ under local policy $q$ future summary that makes local logits look ahead (Blondel et al., 2025)
active inference prior or generative model likelihood / sensory evidence $-F$ (variational free energy) or log evidence, depending on sign convention quantity whose optimization reduces surprise (Friston, 2010; Friston et al., 2012)

Read this way, a scaffold that branches, calls tools, or queries a verifier is not merely producing more text. It reallocates probability mass across future traces in much the same broad sense that a stochastic controller reallocates mass across future trajectories. That is the level at which the analogy is strongest.

What the factor primitive is really estimating

The new paper by Blondel et al. shows that local autoregressive logits must absorb a future-looking soft value term (Blondel et al., 2025). To avoid colliding with the PLP proposal notation $\pi_{\mathcal{D}}$, let me call that local quantity $Q(s,y)$ instead of $q(s,y)$. Here $s$ is again a trace prefix and $y$ is the next token (or decoded action) that extends $s$ by one step.

For a trace prefix or state $s$, the continuation partition is the $Z(s)$ introduced above:

\[Z(s) := \sum_{\tau \succ s} \pi_{\mathcal{D}}(\tau \mid s)\Phi(\tau),\]

where the sum runs over all complete future continuations extending the current prefix. The continuation value is

\[V(s) := \log Z(s).\]

This represents the exact prefix free energy at $s$. While this raw log-partition produces an extensive quantity that can suffer from length bias across traces of vastly different lengths, it establishes the correct geometric structure of the search problem. In agentic tool-use, this is typically replaced by an intensive operator like MellowMax to ensure scale-free comparisons, as we explore in a subsequent post. Moreover, because pre-training acts as a backward dynamic programming pass, the autoregressive model caches this future value directly in its immediate logits. Evaluating the one-step soft value at inference time therefore does not strictly require expensive Monte Carlo rollouts; the forward pass simply reads the internalized global energy.

Now the meaning of the PLP factor primitive becomes much sharper. Whenever we score a partial chain of thought, a partial program, a partial proof sketch, or an intermediate plan, we are not really trying to estimate “local goodness” in isolation. We are trying to estimate how much future verifier-weighted mass remains reachable from that prefix.

That is why simple local fluency is often useless for hard tasks. A prefix can look elegant and still lead into a dead end. Conversely, a clumsy-looking prefix can be extremely valuable if it opens a broad basin of correct continuations.

In this sense, factor is best understood as a runtime surrogate for continuation free energy. Tree-of-Thoughts, self-consistency, Reflexion, self-backtracking, and many other scaffolds (Wang et al., 2022) (Yao et al., 2023) (Shinn et al., 2023) (Yang et al., 2025) are all different numerical schemes for approximating the same intractable object:

\[V(s)=\log \sum_{\tau \succ s} \pi_{\mathcal{D}}(\tau \mid s)\Phi(\tau).\]
Local score plus continuation free energy
Figure 2. A local decision only becomes meaningful once it is augmented by the free energy of its downstream subtree. The practical role of runtime scaffolds is to estimate that future mass better than plain next-token decoding can.

This viewpoint turns a vague design question into a measurable one. Rather than describing intermediate heuristics as vague “critics”, we can ask a concrete question:

How good is this heuristic as an estimator of continuation free energy?

That criterion is concrete enough to compare heuristics, train better surrogates, and evaluate search procedures on common ground.

Why raw free energy should not be factored at every step

Index the unfolding trace by time steps $t=0,1,\ldots,T$. Let $s_t$ denote the prefix after $t$ steps, let $y_t$ denote the token (or action) taken at step $t$, and let $r(s_t,y_t)$ denote any additive reward used in a reinforcement-learning view of the same trajectory.

If we repeatedly add the raw continuation value $V(s_t)$ at many intermediate steps, we generally double count future mass. A long trace would then accumulate multiple copies of essentially the same downstream partition, and that would change the target in an uncontrolled way.

The right object is not raw future value, but a telescoping shaping term. Let $G(s)$ be a heuristic potential on prefixes (I use $G$ here to avoid clashing with Kappen’s partition notation $\Psi(\xi,t)$). Then the semantics-preserving way to inject it is

\[r'(s_t,y_t) = r(s_t,y_t) + G(s_{t+1}) - G(s_t),\]

so that along a full trace

\[\sum_t r'(s_t,y_t) = \sum_t r(s_t,y_t) + G(s_T)-G(s_0).\]

The extra terms telescope.

This matters because it tells us how to guide search without repeatedly rewarding the same future basin over and over again. In reinforcement learning this is the logic of potential-based shaping. In PLP it gives a principled answer to how one should place soft factors on intermediate states.

This suggests a clean semantic distinction in PLP. The factor primitive has two roles:

  1. Target-defining factors, which really modify the semantic target.
  2. Shaping factors, which are introduced only to improve inference and should telescope or otherwise preserve the intended target up to controllable boundary terms.

The distinction is mathematically clean and practically important. It separates modeling choices from inference aids and makes it easier to see when a scaffold has changed the task itself.

A Bellman-style support gap

In the PLP paper, the support gap is defined at the level of completed outputs. That is useful, but the ARM/EBM connection suggests a sharper diagnostic at the level of prefixes.

Many failures of greedy decoding happen before the final answer becomes unreachable. They begin when the system underestimates the future value of a promising prefix and therefore never enters the right basin.

This suggests the prefix-level discrepancy

\[\Delta_V(s) := V^\star(s) - \widehat{V}(s),\]

where $V^\star(s)$ is the ideal continuation free energy under the semantic target and $\widehat{V}(s)$ is the value implicitly assigned by the deployed heuristic, judge, or local policy.

This quantity measures a local value mismatch in the same variational geometry. It asks whether the deployed system assigns enough mass to the good continuations that remain reachable from $s$. In that sense, a large positive $\Delta_V(s)$ means that the prefix contains more downstream evidence than the system currently credits it with (Friston, 2010).

That is the Bellman support gap. The prefix lies above a rich continuation basin, but the deployed system fails to see it. A scaffold can then discard a promising branch too early, before the good mass has time to unfold.

This perspective suggests a more informative diagnostic than answer-level accuracy alone. Instead of asking only whether the correct answer appears in samples, we should also ask whether promising prefixes are persistently undervalued. That would tell us more directly when beam search, tree search, or verifier-guided branching are likely to buy useful compute.

Tempered targets, delayed choice, and symmetry breaking

Another immediate consequence is that PLP admits a natural temperature family. Given the same proposal $\pi_{\mathcal{D}}$ and potential $\Phi$, define

\[p_\beta(\tau \mid x) \propto \pi_{\mathcal{D}}(\tau \mid x)\Phi(\tau,x)^\beta, \qquad 0 \le \beta \le 1.\]

At $\beta=0$ we recover the raw proposal. At $\beta=1$ we recover the original semantic target from the fundamental equation \eqref{eq:fundamental}. Intermediate $\beta$ values define softened bridges between exploration and strict verification.

If we write the potential as $\Phi(\tau, x)=e^{R(\tau, x)}$, then

\[p_\beta(\tau \mid x) \propto \pi_{\mathcal{D}}(\tau \mid x)e^{\beta R(\tau, x)},\]

so $\beta$ plays the role of an inverse temperature. Small $\beta$ gives a broad, high-entropy law. As $\beta$ increases, mass concentrates on higher-reward trajectories.

Kappen’s 2005 analysis clarifies the control meaning of this continuation quantity (Kappen, 2005). In his path-integral treatment, the optimal stochastic policy can change qualitatively as the noise level or the time-to-go changes. The important example in the paper is a delayed-choice problem with two slits or targets. When the product of noise level and time-to-go is large, the optimal controller steers toward the middle and postpones the final commitment. Only later does the symmetry break and one route become preferable.

This lesson transfers directly to scaffold design. When early reasoning states still support several plausible continuation basins, hard commitment to one branch can be suboptimal even from a rational control viewpoint. A good scaffold should often keep several futures alive a little longer and let the symmetry break later, once extra evidence, tool outputs, or verifier signals separate the basins more clearly.

This gives a clean interpretation of annealed search, progressive filtering, verifier ramp-up, and soft-to-hard planning schedules. Population methods such as importance weighting, sequential Monte Carlo, or twisted SMC (Zhao et al., 2024) can then be read as procedures that move mass through a sequence of tempered targets (Tokdar & Kass, 2010; Loula et al., 2025; Zhao et al., 2024).

This also gives a principled language for iterative answer refinement. Progressive-Hint Prompting and later work on refined answer distributions can be read as sequential procedures that repeatedly sharpen an empirical answer law rather than trusting a single first-pass sample (Zheng et al., 2023; Pal et al., 2024).

Many tree-search and multi-sample workflows fail because they apply an overly sharp verifier too early. The result is weight collapse. A small number of trajectories dominate before the system has explored enough of the space.

The free-energy perspective suggests a principled fix: start from a high-entropy, low-$\beta$ regime and only gradually sharpen the potential. Temperature schedules then become a controlled way of delaying commitment in a noisy planning problem.

From free energy to a decomposition policy

The companion note Scaffolding is all you need made the reliability point. AND-style decompositions add essential failure points, whereas OR-style search buys alternatives. Kappen’s control picture sharpens that argument.

If a node keeps several alternative solution basins alive, the effective continuation cost has the schematic OR form

\[J_{\mathrm{OR}}(s)\approx -\lambda \log \sum_{b\in\mathcal{B}(s)} \exp\!\left(-\frac{J_b(s)}{\lambda}\right) + \Delta_{\mathrm{sel}}(s),\]

where $\mathcal{B}(s)$ indexes disjoint basins of future traces (for example, distinct high-level plans), $J_b(s)$ is the cost-to-go if the scaffold commits to basin $b$, and $\Delta_{\mathrm{sel}}$ summarizes selection and verification costs. The temperature $\lambda$ is the same scale as in $\Phi(\tau)=\exp(-S(\tau)/\lambda)$ whenever that representation is used. This is the option-value term. Several basins can coexist, and uncertainty can make delayed commitment rational.

If a node instead commits to mandatory subtasks that all must succeed, the effective continuation cost is closer to

\[J_{\mathrm{AND}}(s)\approx \sum_{i=1}^K J_i(s) + \Delta_{\mathrm{valid}}(s)+\Delta_{\mathrm{comp}}(s),\]

where $K$ is the number of subtasks, $J_i(s)$ is the cost-to-go carried by the $i$th child interface after decomposition, and $\Delta_{\mathrm{valid}}$ and $\Delta_{\mathrm{comp}}$ summarize decomposition validity and composition risk. This is the same essential-node tax that appeared in the reliability note, now written in cost language rather than failure-probability language.

The practical consequence is simple. A good scaffold should not jump directly from “one-shot answer” to “mandatory decomposition”. It should choose among four control modes:

  1. answer directly when direct answer reliability is already high;
  2. use OR-style search when several whole-solution basins can be compared by a verifier or selector;
  3. delay commitment when the landscape is still multimodal and extra information is cheap;
  4. use AND-style decomposition only after symmetry breaks, or when child interfaces are locally verifiable and composition is stable.

Decomposition is therefore not a default formatting choice. It is a control action whose value depends on uncertainty, horizon, verifier quality, and the geometry of the continuation landscape.

Replica thinking and the geometry of reasoning paths

Another useful consequence comes from the replica trick, a standard device in statistical physics for analyzing log partition functions.

For a token prefix $s$, the log partition function is again

\begin{equation} V(s)=\log Z(s). \end{equation}

It turns out that formally, one can rewrite it with the replica trick as

\begin{equation} V(s) = \lim_{n\to 0}\frac{Z(s)^n-1}{n}. \end{equation}

For integer $n$, the quantity $Z(s)^n$ is a sum over $n$ replicated future continuations. In PLP language, this looks almost natural: it is a plate(n) over future reasoning traces conditioned on the same prefix.

Of course, the formal limit $n\to 0$ is not yet a practical inference algorithm. But finite replicas are already illuminating.

Write the normalized continuation law downstream of $s$ as

\[p(\tau \mid s) = \frac{\pi_{\mathcal{D}}(\tau \mid s)\,\Phi(\tau)}{Z(s)}.\]

Suppose we draw two independent samples from $p(\tau \mid s)$. The probability that the two draws land on the same complete trace $\tau$ is

\begin{equation} C_2(s) := \sum_{\tau \succ s} p(\tau \mid s)^2. \end{equation}

and its inverse law

\begin{equation} N_{\mathrm{basins}}(s):=\frac{1}{C_2(s)} \end{equation}

can be read as an effective number of continuation basins.

This quantity has an immediate interpretation. If $N_{\mathrm{basins}}(s)\approx 1$, then most samples collapse into the same hidden plan. Self-consistency then produces many surface variations of the same mistake because the continuation landscape remains concentrated in one basin. If $N_{\mathrm{basins}}(s)$ is large, several qualitatively distinct reasoning paths contribute downstream, and additional samples can provide genuinely new evidence.

Single versus multiple reasoning basins
Figure 3. A single verbalized path may give the illusion of diversity while remaining trapped in one hidden basin. Finite-replica thinking asks a deeper question: how many distinct high-mass continuation families are actually contributing downstream of a prefix?

This connects directly to the dependence analysis already present in PLP. In Probabilistic Language Programming, $\rho$ denotes a pairwise correlation between scaffold outputs and $K_{\mathrm{eff}}$ denotes an effective sample size that adjusts the nominal draw count for dependence. Replica overlap suggests a more geometric, prefix-level version of the same story.

One practical consequence follows: replicated reasoning traces can improve answers and can also measure the ruggedness of the continuation landscape itself.

If two or more replicas keep collapsing into the same basin, the main bottleneck is inferential diversity rather than sample count. That points to different interventions: new decompositions, different retrieved evidence, alternative latent strategies, or a different verifier placement.

Implications for prompting

Once seen through this lens, many prompting techniques become much easier to classify.

Chain-of-thought introduces a latent sequential state, allowing the model to externalize part of the free-energy computation into the token stream. That interpretation is now explicit in work that treats rationales as latent variables and optimizes answer likelihood by marginalizing over them (Phan et al., 2023), and in work that casts CoT adaptation more broadly as amortized inference over intractable posteriors (Hu et al., 2024). This also puts a clear bound on what one should expect from pure chain-of-thought at inference time: externalizing rationales can help the model access structure that was already latent in its training distribution, but it does not manufacture arbitrary new competence. The method is strongest when training has already shaped a useful high-reward landscape over latent traces, and correspondingly weaker as a route to robustly out-of-distribution reasoning.

Self-consistency samples multiple reasoning paths and marginalizes them at the answer level. Later variants such as Progressive-Hint Prompting and refined answer distributions can be read as more deliberate ways of refining that sampled answer distribution over multiple rounds (Zheng et al., 2023; Pal et al., 2024; Wang et al., 2022).

Tree-of-Thoughts allocates compute across prefixes, attaches heuristic or uncertainty estimates to partial states, and searches over the resulting frontier (Yao et al., 2023; Mo & Xin, 2023; Zhou et al., 2024). In older approximate-inference language, this is closer to Monte Carlo tree search with value guidance than to a new cognitive primitive (Buesing et al., 2020).

Reflexion and self-backtracking repeatedly revise the proposal so that more mass moves toward regions with better downstream verifier-weighted continuation mass (Shinn et al., 2023) (Yang et al., 2025).

Finally, SMC-style steering methods make the probabilistic interpretation explicit. They treat LM control as sampling from an unnormalized target distribution and use learned twist or future-value estimates to guide particles toward high-mass regions (Zhao et al., 2024) (Loula et al., 2025).

In short:

Prompting techniques are numerical methods for approximating the right partition function over traces under finite compute.

This perspective places chain-of-thought, search, refinement, and particle methods within the same mathematical frame.

Conclusions

Autoregressive models can successfully plan ahead only when their local decisions already contain a compressed summary of the future partition over continuations, the soft-continuation value as in (Blondel et al., 2025). When that summary is imperfect, inference-time scaffolds can spend runtime compute, search, tools, and verification to approximate the missing continuation values more accurately.

Kappen’s control perspective sharpens the design rule. Under uncertainty, the right scaffold often delays commitment, keeps several continuation basins alive, and turns search into hard decomposition only once the landscape becomes easier to separate.

The practical question is therefore not only which scaffold is most accurate, but which control policy over answer, search, delay, and decomposition best approximates verifier-weighted free energy under a finite compute budget.


References

  • Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. Arxiv:1805.00909. https://arxiv.org/pdf/1805.00909
  • Blondel, M., Sander, M. E., Vivier-Ardisson, G., Liu, T., & Roulet, V. (2025). Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction. Arxiv:2512.15605. https://arxiv.org/abs/2512.15605
  • Lee, C., Zeng, T., Jeong, J., Sohn, J.-yong, & Lee, K. (2025). How to Correctly Report LLM-as-a-Judge Evaluations. Arxiv:2511.21140. https://arxiv.org/abs/2511.21140
  • Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. https://doi.org/10.1038/nrn2787
  • Kappen, H. J. (2005). Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 2005(11), P11011–P11011.
  • Friston, K., Samothrakis, S., & Montague, R. (2012). Active inference and agency: optimal control without cost functions. Biological Cybernetics, 106(8-9), 523–541. https://doi.org/10.1007/s00422-012-0512-8
  • Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. Arxiv:2203.11171.
  • Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Arxiv:2305.10601. https://doi.org/10.48550/arxiv.2305.10601
  • Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Arxiv:2303.11366. https://doi.org/10.48550/arxiv.2303.11366
  • Yang, X.-W., Zhu, X.-Y., Wei, W.-D., Zhang, D.-C., Shao, J.-J., Zhou, Z., Guo, L.-Z., & Li, Y.-F. (2025). Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models. Arxiv:2502.04404. https://doi.org/10.48550/arxiv.2502.04404
  • Zhao, S., Brekelmans, R., Makhzani, A., & Grosse, R. (2024). Probabilistic inference in language models via twisted sequential monte carlo. Arxiv:2404.17546. https://doi.org/10.48550/arxiv.2404.17546
  • Tokdar, S. T., & Kass, R. E. (2010). Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1), 54–60.
  • Loula, J., LeBrun, B., Du, L., Lipkin, B., Pasti, C., Grand, G., Liu, T., Emara, Y., Freedman, M., Eisner, J., & others. (2025). Syntactic and semantic control of large language models via sequential monte carlo. Arxiv:2504.13139, abs/2504.13139. https://doi.org/10.48550/arxiv.2504.13139
  • Zheng, C., Liu, Z., Xie, E., Li, Z., & Li, Y. (2023). Progressive-hint prompting improves reasoning in large language models. Arxiv:2304.09797, abs/2304.09797. https://doi.org/10.48550/arxiv.2304.09797
  • Pal, S., Chételat, D., Zhang, Y., & Coates, M. (2024). Refining answer distributions for improved large language model reasoning. Arxiv:2412.13292, abs/2412.13292. https://doi.org/10.48550/arxiv.2412.13292
  • Phan, D., Hoffman, M. D., Dohan, D., Douglas, S., Le, T. A., Parisi, A., Sountsov, P., Sutton, C., Vikram, S., & A Saurous, R. (2023). Training chain-of-thought via latent-variable inference. Advances in Neural Information Processing Systems, 36, 72819–72841. https://doi.org/10.48550/arxiv.2312.02179
  • Hu, E. J., Jain, M., Elmoznino, E., Kaddar, Y., Lajoie, G., Bengio, Y., & Malkin, N. (2024). Amortizing intractable inference in large language models. Arxiv:2310.04363, abs/2310.04363. https://doi.org/10.48550/arxiv.2310.04363
  • Mo, S., & Xin, M. (2023). Tree of uncertain thoughts reasoning for large language models. Arxiv:2309.07694, abs/2309.07694. https://doi.org/10.48550/arxiv.2309.07694
  • Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., & Wang, Y.-X. (2024). Language agent tree search unifies reasoning, acting, and planning in language models. Proceedings of the 41st International Conference on Machine Learning, 235, 62138–62160. https://proceedings.mlr.press/v235/zhou24r.html
  • Buesing, L., Heess, N., & Weber, T. (2020). Approximate inference in discrete distributions with Monte Carlo tree search and value functions. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 108, 624–634. https://proceedings.mlr.press/v108/buesing20a.html

Further reading

Read more in the science topic.