Kevin Murphy’s treatment of inference on pp. 435-471 of Probabilistic Machine Learning: Advanced Topics gives us a clean gold standard for what exact and approximate inference algorithms are supposed to mean. Those pages cover the forwards-backwards algorithm for hidden Markov models, sum-product and max-product message passing on trees, factor-graph belief propagation, loopy belief propagation and its caveats, variable elimination, and the junction-tree viewpoint. They do not present these methods as vague metaphors. They present them as specific operators with specific mathematical guarantees.
This post audits the current src/plp/inference implementation against that standard.
The conclusion is mixed, but encouraging. PLP has already made one very important move correctly: it separates a proposal law induced by prompt-generating sample sites from factor potentials induced by soft scoring functions. That is the right abstraction for probabilistic language programming. But once we move from semantics to algorithms, several current engine names and correctness claims are too strong. In particular, today’s LoopyParticleBP is not belief propagation in Murphy’s sense, and today’s MaxProduct is not max-product message passing. They are useful particle heuristics, but they are not the textbook algorithms their names suggest.
If PLP wants to become the gold standard for mathematically correct prompting methods, the path is clear: keep the current chain-graph semantics, but distinguish sharply between
- exact message-passing algorithms when the local distributions are explicit and tractable, and
- particle or sampling heuristics when the local distributions are black-box LLM conditionals.
That distinction will make the library both more honest and more powerful.
The textbook baseline
Murphy’s pages 437-441 present the forwards and forwards-backwards recursions for a discrete HMM. The central equations are
\[\alpha_{t|t-1}(j) = \sum_i A_{i,j}\alpha_{t-1}(i),\] \[\alpha_t(j) = \frac{1}{Z_t}\lambda_t(j)\alpha_{t|t-1}(j), \qquad Z_t = \sum_j \lambda_t(j)\alpha_{t|t-1}(j),\] \[\beta_{t-1}(i) = \sum_j A_{i,j}\lambda_t(j)\beta_t(j),\]and
\[\gamma_t(j) \propto \alpha_t(j)\beta_t(j).\]The point is not just that these equations exist. The point is that they are exact dynamic programs on chain-structured models. Murphy is explicit that normalizing the forwards message at every step is the right implementation, both mathematically and numerically.
On pp. 448-453 the same idea is generalized from chains to trees and then to factor graphs. For a tree, the sum-product equations are
\[\mathrm{bel}_s(z_s) \propto \psi_s(z_s)\prod_{t \in \mathrm{ch}_s} m_{t \to s}(z_s),\] \[m_{s \to t}(z_t) = \sum_{z_s}\psi_{st}(z_s, z_t)\,\mathrm{bel}_s(z_s).\]For a factor graph, Murphy writes the message updates as
\[m_{x \to f}(x) = \prod_{h \in \mathrm{nbr}(x)\setminus\{f\}} m_{h \to x}(x),\] \[m_{f \to x}(x) = \sum_{x_{\setminus x}} f(x_{\mathrm{nbr}(f)})\prod_{x' \in \mathrm{nbr}(f)\setminus\{x\}} m_{x' \to f}(x'),\] \[\mathrm{bel}(x) \propto \prod_{f \in \mathrm{nbr}(x)} m_{f \to x}(x).\]On pp. 455-457 he then makes the usual caution explicit: loopy BP can work well, but it may fail to converge, and damping or residual scheduling are engineering tools for managing that fixed-point iteration. Finally, pp. 462-468 give the exact variable-elimination and junction-tree story: exact inference is exponential in treewidth, elimination order matters, and message passing on trees is dynamic programming that avoids redundant recomputation.
That is the standard. So how does PLP compare?
What PLP has done right
The core semantic move in PLP is good, and it is more important than it may look at first glance. The package models a prompt program as a directed proposal together with soft factors:
\[p(\tau) \propto \pi_D(\tau)\,\Phi(\tau).\]Since factors in PLP contribute additive log-weights, the more explicit form is
\[p(\tau) \propto \pi_D(\tau)\exp\left(\sum_f \ell_f(\tau_{\mathrm{pa}(f)})\right),\]where each factor returns a log-potential contribution. That matches the code exactly:
raw_logf = fn(state, *args, **kwargs)
scaled_logf = raw_logf * scale
new_state = state.copy()
new_state.log_weight += scaled_logf
This is the right abstraction for probabilistic language programming because prompt sites naturally behave like conditional proposal kernels, while critique, preference, format, consistency, or task-success signals naturally behave like soft factors.
PLP also gets several structural decisions right.
First, it represents the model as a chain graph with directed sample-parent edges and undirected factor cliques. That is a principled way to unify directed prompt generation with undirected scoring couplings.
Second, the DAG-based engines enforce a topological order. That is the correct structural precondition for ancestral sampling:
for site_name in order:
particles = await propose_site_for_particles(
graph, site_name, particles, **sample_kwargs,
)
apply_newly_ready_factors(graph, particles, applied)
Third, the implementation stores weights in log-space and uses stable normalization helpers such as logmeanexp and systematic resampling. Those are good numerical choices.
Fourth, choose_engine uses the moralized graph as a structural signal. That is a reasonable first heuristic. It is not the full exact-inference story, but it is much better than pretending the graph structure does not matter.
So the foundation is not the problem. The problem is that the current algorithms are sometimes named or described as though they were Murphy’s algorithms, when they are really proposal-based particle approximations.
The PLP target distribution
Before critiquing the engines, it helps to state clearly what they are trying to approximate.
If the @sample sites define a forward proposal
and the @factor sites contribute log-potentials \(\ell_f\), then the unnormalized target is
Under that interpretation, simple importance sampling from \(\pi_D\) with weight
\[w(\tau) = \exp\left(\sum_f \ell_f(\tau_f)\right)\]is mathematically coherent. This is one of the strongest ideas in the current package: the proposal is the prompt-generation law, and the factor stack is the correction from “what the model says” to “what we want.”
But this also means that exact message passing requires more than the current API exposes. Murphy’s algorithms assume explicit local probability tables, potentials, or messages that can be multiplied, summed, and maximized over. For open-ended LLM text outputs, PLP often has only a black-box sampler. Once that is true, exact forwards-backwards or exact sum-product are no longer automatically available.
This is not a flaw. It is a fact. The mistake is when the library uses the names of exact message-passing algorithms for engines that do not implement those equations.
Engine-by-engine audit
Forward: right as ancestral sampling, wrong as posterior inference
The current Forward engine is a valid ancestral sampler on a DAG. It samples each site in topological order and applies factors after the relevant parents exist. As a sampler from the proposal law \(\pi_D\), it is correct.
But it is not Murphy’s forwards algorithm. Murphy’s forwards algorithm propagates a belief state \(\alpha_t(j) = p(z_t=j \mid y_{1:t})\) using matrix-vector recursion and normalization at each step. PLP’s Forward propagates particles, not filtered marginals.
That difference matters. The code says:
"""
Correct when the sample-parent graph is acyclic (Bayes net).
If factors are present, they are applied but weights are ignored.
"""
This is too loose. If there are no factors, then yes: the engine returns exact samples from the directed model. If factors are present, the engine is still a correct ancestral sampler from \(\pi_D\), but it is not an inference algorithm for the posterior target \(p(\tau) \propto \pi_D(\tau)\Phi(\tau)\). The factors are merely recorded; no resampling, normalization, or marginalization uses them.
So the right mathematical description is:
\[\tau^{(k)} \sim \pi_D(\tau), \qquad \log w^{(k)} = \sum_f \ell_f(\tau_f^{(k)}),\]with no posterior correction step. That is useful, but it should be called what it is: forward proposal sampling, not textbook filtering.
ImportanceSampling: coherent Monte Carlo, but not exact tree inference
The current ImportanceSampling engine is mathematically defensible, and in spirit it is the cleanest approximation in the package. It samples from the forward proposal and then weights by the factors:
That is exactly what the code does before its optional final resampling step.
This is good probabilistic programming practice. But the docstring says it is “correct” for acyclic parents with tree-shaped factor cliques. Relative to Murphy, that is still too strong.
On a tree-structured discrete graphical model with explicit local probabilities, Murphy’s answer is not importance sampling. It is sum-product, which gives exact marginals in linear time in the number of nodes and quadratic time in the local state size. The existence of a tree does not make importance sampling exact. It only makes exact dynamic programming available.
So ImportanceSampling is a good approximate engine for PLP’s black-box proposal setting, but it should be described as
- asymptotically consistent as \(K \to \infty\) under the PLP target above,
- finite-sample and variance-sensitive, and
- an approximation even on tree-structured graphs unless the proposal already equals the posterior.
That is not a small wording issue. It is the difference between a Monte Carlo approximation and an exact inference algorithm.
SMC: a reasonable particle method, but outside the scope of these Murphy pages
The current SMC engine performs sequential propose-weight-resample updates across a topological order:
where \(\mathcal{F}_t\) denotes factors that become ready at step \(t\).
This is a sensible particle method for long directed chains with local factors. The implementation also acknowledges its own weakness: if important factors arrive late, early resampling can kill good trajectories too soon. That is exactly the right warning.
But Murphy pp. 435-471 do not present SMC as the reference algorithm here. On those pages, the exact chain-structured gold standard is forwards-backwards, not particle filtering. So the right comparison is not “SMC is wrong.” The right comparison is “SMC is an approximate fallback for cases where exact message passing is unavailable or impractical.”
For PLP, that can be the correct engineering choice, especially because LLM outputs are open-ended. But the library should say so explicitly: SMC is an approximation for black-box prompt kernels, not the canonical exact solution that Murphy develops for discrete chains.
LoopyParticleBP: the biggest naming error in the module
This is the engine whose current name diverges most sharply from Murphy’s theory.
Murphy’s belief propagation, whether on trees or factor graphs, is a local message-passing algorithm. It updates messages \(m_{s \to t}\) or \(m_{f \to x}\) using sums, products, and possibly maxima over neighboring state spaces. The whole point of BP is that it passes localized summaries across edges.
PLP’s LoopyParticleBP does something else:
for t in range(T):
particles = self._score_all_factors(graph, particles)
log_z = logmeanexp([p.log_weight for p in particles])
log_evidence.append(log_z)
particles = systematic_resample(particles, K)
if t < T - 1:
particles = await repropose_by_groups(
graph, particles, groups, sample_kwargs,
)
This is not message passing. It is repeated full-particle scoring, resampling, and reproposal. No message objects exist. No variable-to-factor or factor-to-variable messages exist. No fixed-point equations of the form Murphy gives on p. 453 are implemented.
That does not mean the method is useless. It may be a practical heuristic for cyclic prompt systems. But mathematically it is much closer to a population Monte Carlo or rejuvenated particle heuristic than to belief propagation.
The docstring goes further and says:
Across rounds this is Gibbs sampling on an implicit MRF defined by the conditional kernels \(K_i(y_i \mid y_{\mathrm{pa}(i)})\).
This is not justified by the code. A true Gibbs sampler draws each variable from its full conditional under the target distribution. Here, each site is redrawn from its prompt proposal, then the whole particle is rescored, then particles are resampled. There is no acceptance ratio and no proof that the update kernel leaves the desired target invariant. So the method should not be described as Gibbs unless that invariant-measure argument is actually established.
This is the clearest place where PLP should become more conservative and more precise. A better name would be something like LoopyParticleReweighting, PopulationReproposal, or ParticleGraphHeuristic.
MaxProduct: useful best-of-\(K\), but not Murphy’s max-product
Murphy’s max-product algorithm replaces summation by maximization in a message-passing recursion and is closely related to Viterbi on chains. It computes max-marginals or MAP configurations through dynamic programming.
PLP’s MaxProduct does this instead:
indexed = sorted(
enumerate(particles), key=lambda pair: pair[1].log_weight, reverse=True,
)
top = [p for _, p in indexed[:K]]
This is top-\(K\) selection among an oversampled set of proposal trajectories. In other words, if the engine draws \(N\) ancestral samples, then it returns
\[\operatorname{TopK}\left\{\tau^{(1)}, \ldots, \tau^{(N)}\right\}\]ranked by total factor score. That is a perfectly reasonable best-of-\(K\) search heuristic. But it is not max-product message passing.
There is also a direct inconsistency between documentation and implementation. The file header claims a cyclic fallback based on Gibbs chains, but the implementation immediately enforces acyclicity and raises on cyclic parents. So even as documentation, the current file overpromises.
Again, the right fix is not to delete the engine. The right fix is to rename and reframe it, perhaps as TopKProposalSearch or BeamMAP.
What Murphy’s theory says PLP is currently missing
The biggest gap is not a bug in the current engines. It is the absence of the exact algorithms Murphy puts at center stage.
Missing exact chain inference
There is no engine in src/plp/inference that implements exact forwards-backwards smoothing or Viterbi-style dynamic programming on discrete chain models. That means the package currently lacks the exact analog of the most classical inference problem in the whole chapter.
For prompt programs with finite state spaces or enumerable local supports, PLP should have this.
Missing exact tree inference
There is no true sum-product or max-product implementation over tree-structured factor graphs. Today, a two-node tree with a pairwise factor is sent to ImportanceSampling by choose_engine. Murphy would send that model to exact sum-product immediately.
The current choice is understandable because LLM outputs are usually not enumerable. But the library should separate two cases:
- black-box sample sites, where importance sampling may be the only practical option, and
- explicit discrete sites, where exact sum-product should be available and preferred.
Missing variable elimination and junction tree
Murphy’s pp. 462-468 make an important point that PLP does not yet exploit: loops do not imply “use loopy BP.” Some loopy graphs still admit efficient exact inference when the induced width is small enough. That is the whole point of variable elimination and the junction-tree algorithm.
Current engine selection is therefore too blunt:
if graph.has_cycle_in_moralised_graph():
return LoopyParticleBP
That is a fine heuristic for black-box prompt spaces. It is not a mathematically complete inference policy.
A precise summary of what is right, wrong, and improvable
The cleanest way to summarize the audit is in a table.
| PLP component | Verdict | Why |
|---|---|---|
| Chain-graph semantics | Right | The split between proposal law and factor potentials is mathematically sound and well suited to prompt programs. |
| Log-weight factors | Right | Additive log-factors correspond naturally to multiplicative potentials. |
| DAG enforcement for ancestral engines | Right | Topological ordering is the correct structural requirement. |
| Stable numerics and systematic resampling | Right | Good implementation discipline. |
Forward naming as “forward inference” |
Misleading | It is ancestral proposal sampling, not Murphy’s filtered belief recursion. |
ImportanceSampling correctness claims |
Too strong | It is a valid approximation, not exact tree inference. |
LoopyParticleBP name |
Wrong | It does not implement BP messages or Murphy’s factor-graph equations. |
LoopyParticleBP “Gibbs” claim |
Unjustified | The code does not sample exact full conditionals of the target. |
MaxProduct name |
Wrong | It is top-\(K\) sampled search, not max-product dynamic programming. |
choose_engine cycle heuristic |
Incomplete | It ignores exact VE/JTA possibilities on low-treewidth loopy graphs. |
| Test suite | Not enough for inference claims | Mostly structural and smoke tests; not enough exact-posterior validation. |
The most important implementation snippets
To make the comparison concrete, it helps to place the implementation beside the theory.
Murphy’s factor-graph BP on p. 453 is explicitly local:
m_{x -> f}(x) = prod_{h in nbr(x) \ {f}} m_{h -> x}(x)
m_{f -> x}(x) = sum_{x' in nbr(f) \ {x}} f(x, x') prod m_{x' -> f}(x')
bel(x) propto prod_{f in nbr(x)} m_{f -> x}(x)
PLP’s current “LBP” engine is global and particle-based:
for p in particles:
state = p
for factor in graph.factors:
state = apply_factor(factor, state)
out.append(state)
Murphy’s max-product is dynamic programming over local maxima. PLP’s current MaxProduct is ranking over sampled trajectories:
N = max(K * 2, 4)
particles = [base.copy() for _ in range(N)]
...
top = [p for _, p in indexed[:K]]
And Murphy’s exact tree and junction-tree story says that cached local messages are what avoid repeated work. PLP’s current exact-style caching is limited to “apply each factor once when ready” in DAG sweeps, which is useful but not a replacement for general message reuse:
if all(p in particles[0].vars for p in factor.parents):
particles[:] = [apply_factor(factor, p) for p in particles]
applied.add(factor.name)
This is operationally clean, but it is not variable elimination, not forwards-backwards, and not junction-tree propagation.
What PLP should do next to become the gold standard
The right roadmap is not to force everything into exact message passing. The right roadmap is to make the boundary between exact and approximate inference mathematically explicit.
1. Rename the current stochastic engines honestly
This is the fastest high-value improvement.
Forward should become something like AncestralSampling or ForwardProposal.
ImportanceSampling can keep its name, but its docstring should say “approximate posterior inference by self-normalized importance sampling.”
LoopyParticleBP should be renamed to something that does not claim belief propagation, unless it is rewritten to actually pass messages.
MaxProduct should become TopKProposalMAP, BestOfKMAP, or BeamMAP.
Honest names are not cosmetic. They prevent users from making false theoretical assumptions.
2. Add an exact discrete-inference layer
To match Murphy properly, PLP should add exact engines for cases where local supports are tractable.
The minimal exact layer is:
ForwardBackwardfor discrete chains,Viterbifor MAP on chains,TreeSumProductfor tree-structured factor graphs,TreeMaxProductfor tree MAP,VariableEliminationfor exact marginals on small-treewidth graphs,JunctionTreefor reusable exact marginals when many queries are needed.
The key architectural requirement is that a sample site must optionally expose more than a sampler. It must expose either
- an enumerable support together with local log-probabilities, or
- a tractable local potential representation.
Without that, Murphy-style exact algorithms are impossible on open-ended LLM outputs.
3. Introduce a two-tier API: black-box sites vs explicit sites
PLP currently treats all sample sites as generative black boxes. That is realistic for LLM calls, but it collapses together two very different use cases.
The library should support:
- black-box prompt sites, for which only particle methods are available, and
- explicit probabilistic sites, for which exact message passing becomes available.
This two-tier design would let PLP be mathematically exact where possible and pragmatically approximate where necessary.
4. If keeping a loopy particle engine, define its invariant distribution
If a particle rejuvenation method is going to remain in the core library, it should come with a precise target and update kernel.
At minimum, the documentation should specify whether each round is intended to approximate
\[p(\tau) \propto \pi_D(\tau)\exp\left(\sum_f \ell_f(\tau_f)\right),\]and by what argument the update kernel preserves or approximates that target.
If no such argument exists, the method should be presented as a heuristic search-and-reweight scheme, not as Gibbs or BP.
5. Add exact-vs-approximate tests on small closed-form models
This is the second fastest high-value improvement.
Right now the tests are mostly structural or smoke-based. To make inference claims credible, PLP should include small models with closed-form answers:
- two-state HMMs with known filtering and smoothing marginals,
- tiny tree factor graphs where exact sum-product can be computed by hand,
- small loopy graphs where exact marginals are available by brute-force enumeration,
- MAP tests where Viterbi or VE gives the known optimum.
Then each approximate engine can be benchmarked against exact reference values.
6. Move choose_engine from topology-only to capability-aware
Murphy’s chapter is really about a triple:
- graph structure,
- algebra of local factors, and
- query type.
PLP currently looks mostly at structure. A stronger selector would ask:
- is the graph acyclic?
- are local state spaces finite and enumerable?
- is the moralized graph a tree?
- what is the induced width?
- is the query marginal, evidence, or MAP?
Only then should it pick between exact dynamic programming, exact elimination, or stochastic approximation.
The deeper opportunity for PLP
The most exciting part of this audit is not the criticism. It is the opportunity.
Prompt engineering has lacked a semantics-first library that says, cleanly:
\[\text{prompting method} = \text{proposal family} + \text{factor family} + \text{inference algorithm}.\]PLP is already very close to having that language. The proposal side is there. The factor side is there. The missing step is to align the algorithm names and guarantees with classical probabilistic inference.
If that alignment happens, PLP can become more than a useful experimentation library. It can become the place where prompting methods are described with the same mathematical discipline that we expect in graphical models:
- exact when exact is possible,
- approximate when approximation is necessary,
- explicit about the target distribution,
- explicit about invariance, bias, consistency, and convergence,
- and benchmarked against textbook reference algorithms.
That is how probabilistic language programming stops being a metaphor and becomes a real inference discipline.
Final verdict
PLP has done the hardest conceptual part correctly: it treats prompt generation as a probabilistic proposal process and prompt evaluation as factor-based reweighting on a chain graph. That is a strong and publishable abstraction.
What it has not done yet is implement Murphy’s exact inference algorithms under those names. The current module mostly contains
- ancestral proposal sampling,
- importance sampling,
- sequential Monte Carlo,
- a loopy particle heuristic, and
- a top-\(K\) MAP heuristic.
Those are valuable tools. But they should be presented as such.
The path to becoming the gold standard is therefore straightforward:
- keep the current PLP semantics,
- rename the approximate engines honestly,
- add exact discrete message-passing and elimination algorithms where tractable,
- expose the local support and scoring interfaces needed for exact inference,
- and validate every approximate engine against exact toy references.
Murphy’s chapter gives the blueprint. PLP already has the right object. The next step is to make the algorithms match the mathematics as closely as the abstraction already does.
Let's talk!
I'm Carlo Nicolini — I am interested on the reliability of AI reasoning systems (interpretability, inference-time methods, probabilistic language programming) and on quantitative portfolio optimization (I am a maintainer of skfolio). If you're working on something in these areas and think we might collaborate, chat, discuss, I'm happy to talk about it!
The best way to reach me is on via DM on LinkedIn.