1. Introduction
Sequential decision making under uncertainty is a central problem across artificial intelligence, robotics, and cognitive science. The dominant formalism for decades has been dynamic programming, which casts planning as the optimization of a value function via the Bellman equation [8, 83]. A parallel tradition, which has grown from a theoretical curiosity into a major research programme, recasts planning and control not as optimization problems but as inference problems in probabilistic graphical models. Under this view, an optimal agent does not maximize cumulative reward directly. It conditions a probabilistic model on the event of behaving optimally and performs posterior inference to recover a plan or policy. This conceptual reframing, broadly termed planning as inference, has yielded new algorithms, tighter theoretical connections between control and estimation, and practical systems that rival or exceed the performance of conventional reinforcement learning methods [54, 87, 93].
The appeal of the inference perspective is both theoretical and practical. Theoretically, it reveals deep dualities between stochastic optimal control and probabilistic estimation [48, 89], provides principled mechanisms for exploration via information-theoretic objectives [26, 33], and connects sequential decision making to the broader toolkit of approximate inference (variational methods, Monte Carlo sampling, message passing, and expectation-maximization) [54, 68, 93]. Practically, algorithms derived from this viewpoint (Soft Actor-Critic [33], model-predictive path integral control [99], and diffusion-based planners [44]) have achieved state-of-the-art results in continuous control, robotic manipulation, and locomotion.
This scoping review addresses the research question of how planning problems have been formulated and solved as inference problems in probabilistic models, and what the key algorithmic frameworks, theoretical foundations, and empirical advances are in this paradigm. We cover the period from 2006 to 2026, spanning artificial intelligence, robotics and control, probabilistic machine learning, and cognitive science. The lower bound is motivated by a cluster of foundational publications in 2006 (Todorov's linearly solvable MDPs [87] and Toussaint and Storkey's graphical model formulation [93]) that crystallized the modern form of the framework, although we reference earlier antecedents where relevant [4, 17].
The survey is organized as follows. Section 2 establishes background and definitions. Section 3 examines the theoretical foundations of the control-as-inference framework, including KL-regularized control, linearly solvable MDPs, path integral methods, and expectation-maximization approaches. Section 4 covers active inference and the free energy principle as a planning framework grounded in neuroscience. Section 5 treats maximum entropy reinforcement learning and soft methods, which have produced the most widely deployed algorithms from this paradigm. Section 6 reviews model-based probabilistic planning, including Gaussian process dynamics, world models, generative trajectory models, and probabilistic programming. Section 7 provides a cross-cutting analysis of connections and tensions between themes. Section 8 identifies open problems and future directions, and Section 9 concludes.
The single most important takeaway. The choice of inference method in planning as inference is not a computational convenience but a constitutive design decision. It shapes the resulting agent's epistemic and pragmatic behavior [49, 54, 59]. Curvature-sensitive Gaussian approximations generate ambiguity-avoiding active inference agents, while curvature-blind ones do not [49]. Variational families, sampling schemes, and amortization choices in soft reinforcement learning collectively determine policy expressiveness and exploration [33, 54]. The next generation of inference-based planners will be evaluated as much by their inference approximations as by their generative models.
2. Background and Definitions
The planning-as-inference framework operates at the intersection of several formalisms. A Markov decision process (MDP) is defined by states (\(s \in \mathcal{S}\)), actions (\(a \in \mathcal{A}\)), a transition kernel (\(p(s' \mid s, a)\)), a reward function (\(r(s, a)\)), and a discount factor (\(\gamma\)). The objective is a policy (\(\pi(a \mid s)\)) maximizing expected cumulative reward. In the partially observed setting (POMDP), the agent receives observations (\(o\)) rather than states, introducing a belief-state computation. The standard approach solves the Bellman equation for a value function (\(V(s)\)) or action-value function (\(Q(s, a)\)), from which a policy is extracted [8, 83]. Multiagent extensions (Dec-POMDPs) introduce multiple agents with independent observation histories, yielding NEXP-complete planning problems even for two agents [50].
Planning as inference augments the MDP with binary optimality variables (\(\mathcal{O}_t \in \{0, 1\}\)) at each time step, where (\(p(\mathcal{O}_t = 1 \mid s_t, a_t) \propto \exp(r(s_t, a_t))\)) [54]. The joint probability of a trajectory (\(\tau = (s_1, a_1, \ldots, s_T, a_T)\)) under the graphical model, conditioned on all optimality variables being 1, encodes the planning problem. Computing the posterior (\(p(\tau \mid \mathcal{O}_{1:T} = 1)\)) or the per-step marginal (\(p(a_t \mid s_t, \mathcal{O}_{t:T} = 1)\)) yields the optimal policy. This formulation transforms policy computation into posterior inference in a probabilistic graphical model [54, 82, 93].
KL-regularized control refers to optimal control problems where the objective includes a Kullback-Leibler divergence penalty between the controlled and uncontrolled dynamics [47, 87]. This regularization is what makes the control problem exactly solvable as an inference problem. A linearly solvable MDP is the special case where this KL structure linearizes the Bellman equation [87]. The free energy principle posits that biological agents minimize a variational free energy functional, and active inference extends this to action selection, where planning amounts to minimizing expected free energy over future trajectories [25, 27]. Maximum entropy reinforcement learning augments the standard RL objective with an entropy bonus (\(\mathcal{H}[\pi(\cdot \mid s)]\)) at each step, yielding policies that are both reward-maximizing and maximally stochastic [32, 54, 102]. This objective is dual to the inference formulation, with the entropy bonus arising naturally from the KL structure of the graphical model [54].
Probabilistic graphical models represent joint distributions over random variables through graph structure (directed Bayesian networks or undirected Markov random fields). Inference amounts to computing marginal or conditional distributions, typically intractable in exact form and hence approximated via variational inference (minimizing a KL divergence between an approximate and true posterior), message passing, or Monte Carlo sampling. Variational free energy, an upper bound on negative log-evidence, serves as the central objective in variational inference, providing both a tractable optimization target and an information-theoretic interpretation of the gap between model and data [48, 54].
This review covers work that explicitly formulates planning or control as inference in a probabilistic model. We exclude pure model-free reinforcement learning that does not invoke inference-based reasoning, Bayesian optimization for black-box functions, and inference methods for supervised or unsupervised learning that lack a sequential decision-making component. We include work from the active inference tradition when it addresses planning and action selection, but not purely perceptual applications of the free energy principle. Approaches that use the graphical model view to integrate motion planning with trajectory estimation, or to unify reasoning across abstraction levels, are included because they draw the same formal equivalence between plans and posteriors.
3. Theoretical Foundations
3.1 The Graphical Model Formulation
The central insight of the planning-as-inference paradigm is that the Bellman optimality equation can be recovered as a message-passing computation in a suitably defined graphical model. Early intimations appeared in the EM-based RL literature of the late 1990s, where reward-weighted likelihood maximization was shown to yield policy improvement [17, 65]. Attias [4] showed that plans in goal-directed tasks could be obtained by inference in a directed graphical model over observations, actions, and goals, an early formulation that predated the now-dominant optimality-variable construction. The modern form crystallized in work by Toussaint and Storkey [93], who constructed a graphical model over state-action trajectories augmented with binary reward events, and demonstrated that backward messages correspond to value functions while forward messages correspond to state visitation distributions. This bidirectional message-passing view unifies dynamic programming (backward pass) with trajectory sampling (forward pass) within a single probabilistic framework [91, 93].
The generality of this formulation has been appreciated gradually. Subsequent formalizations showed that the specific choice of how to encode reward (as log-probability of an optimality variable, as exponential tilting of a reference measure, or as a likelihood in an EM procedure) determines which inference algorithm recovers which control algorithm [54, 68]. The mapping is remarkably tight. Belief propagation yields soft value iteration, variational inference yields policy gradient methods, and expectation-maximization yields reward-weighted regression [48, 54]. This taxonomy, developed most comprehensively in Levine's [54] tutorial and reinforced by the graphical-model-and-variational-inference survey of Sun et al. [82], has become the standard reference for the field. The framework accommodates discrete and continuous state-action spaces, finite and infinite horizons, and fully and partially observed settings, although computational tractability varies sharply across these cases [68, 93].
A critical subtlety is the role of the reference policy (or prior policy). In the graphical model, the distribution over actions before conditioning on optimality is a prior (\(\pi_0(a \mid s)\)), and the posterior policy deviates from this prior only insofar as evidence from optimality variables demands [28, 54]. When the prior is uniform, the framework reduces to standard maximum entropy control. When the prior is a previously learned policy, the framework yields KL-regularized policy updates reminiscent of trust-region methods [1, 28, 75]. This prior-posterior structure provides a natural language for curriculum learning, hierarchical control, and transfer, although these applications have been explored unevenly [28, 86].
3.2 KL-Regularized Control and Linearly Solvable MDPs
The mathematical backbone of planning as inference is the theory of KL-regularized stochastic optimal control. The key result, developed independently by Todorov [87, 88] and Kappen [47], is that when the cost of control is measured as the KL divergence between controlled and uncontrolled dynamics, the nonlinear Bellman equation linearizes. Defining a desirability function (\(z(s) = \exp(-V(s))\)) where (\(V\)) is the optimal cost-to-go, the Bellman equation becomes linear in (\(z\)), solvable by matrix exponentiation or eigenvalue methods in discrete settings [87] and by path integral computation in continuous settings [47].
This linearization is not merely a mathematical convenience. It establishes a formal duality between optimal control and Bayesian inference. Todorov [89] proved that the KL-regularized control problem is dual to a maximum-likelihood estimation problem, and that the optimal controller can be recovered by inferring the posterior distribution over trajectories given observations generated by an optimally controlled process. This duality is exact, not approximate, and holds for a broad class of stochastic systems [48, 89]. The practical consequence is that efficient inference algorithms (sampling, variational methods, spectral methods) can be directly repurposed for control [48].
The linearly solvable MDP framework has been extended in several directions. Dvijotham and Todorov [21] generalized it to handle state-dependent control costs and constraints. Todorov [90] showed connections to information-theoretic bounded rationality, where agents with limited computational resources naturally exhibit KL-regularized behavior. The compositionality of the linear framework (that optimal policies for composite tasks can be obtained by combining the desirability functions of subtasks) has been exploited for multi-task and hierarchical control [73, 90]. Despite its elegance, the framework's restriction to KL-structured costs limits direct applicability, and bridging this gap to general reward functions is an active area of research (Section 8).
3.3 Path Integral Control
The continuous-time, continuous-state limit of KL-regularized control leads to path integral control, where the optimal control signal is expressed as an expectation over stochastic trajectories weighted by their exponentiated costs [47, 84]. The path integral formulation is attractive because it converts a Hamilton-Jacobi-Bellman PDE into a sampling problem. One simulates forward trajectories under the uncontrolled (or noise-injected) dynamics, evaluates their costs, and takes a cost-weighted average to obtain the control [84, 99].
This approach has been particularly influential in robotics, where high-dimensional continuous control is the norm. The model-predictive path integral (MPPI) framework [99] combines path integral control with model-predictive control, sampling trajectories from a learned or known dynamics model and recomputing the control at each step. MPPI has demonstrated strong performance on aggressive autonomous driving [100], quadrotor flight [60], and manipulation tasks [9]. Its main limitation is computational. Good performance requires many sampled trajectories, and sample complexity grows with task horizon and dimensionality. Importance sampling and covariance adaptation have been proposed to ameliorate this [67, 99], but scaling to very high-dimensional problems remains challenging.
The connection between path integrals and inference is deep. The trajectory-weighting step in path integral control is formally equivalent to importance-weighted inference in a graphical model where trajectories are latent variables and costs serve as likelihood terms [48, 54]. This insight has motivated the use of more sophisticated inference methods (sequential Monte Carlo [66], Stein variational gradient descent [52, 81], and amortized importance sampling) as drop-in improvements to the basic path integral computation.
3.4 Expectation-Maximization Approaches
An influential subfamily of planning-as-inference algorithms uses the expectation-maximization (EM) algorithm to iteratively improve a policy. In the E-step, trajectories are sampled and weighted by exponentiated returns. In the M-step, a parametric policy is fit to the weighted samples by maximum likelihood. This template, which includes reward-weighted regression [65], relative entropy policy search [64], and cross-entropy methods as special cases, makes the inference interpretation of planning explicit. The E-step performs inference (computing the posterior over trajectories given high reward), and the M-step performs learning (updating the policy to match this posterior).
The EM view has been particularly fruitful for deriving practical policy search algorithms with principled update rules. Maximum a Posteriori Policy Optimisation (MPO) [1] uses the EM decomposition to separate policy evaluation (a supervised learning problem) from policy improvement (a KL-constrained optimization), achieving competitive performance on continuous control benchmarks with excellent stability properties. Hoffman et al. [42] extended the EM approach to continuous MDPs with arbitrary reward structures, while Vlassis et al. [97] demonstrated Monte Carlo EM for model-free robot control. The EM perspective also provides a natural framework for incorporating constraints, priors, and hierarchical structure into policy optimization [1, 68].
A limitation of EM-based approaches is the familiar one. EM is prone to local optima, and the quality of the solution depends on initialization and the expressiveness of the policy class. The EM viewpoint nonetheless continues to generate new algorithms, most recently in diffusion-based planning (Section 6.4), where the iterative denoising process can be interpreted as a form of iterative posterior refinement [44].
3.5 Probabilistic Programming as a Planning Language
The inference-based view of planning finds a natural computational substrate in probabilistic programming, which provides general-purpose languages for specifying generative models and performing automated inference. Belle and Levesque [7] argued that probabilistic planning (planning in domains with stochastic actions, partial observability, and noisy sensors) can be expressed directly as probabilistic programs, with plan synthesis delegated to the language's inference engine. This offers expressiveness (probabilistic programming languages accommodate rich generative models including continuous variables, recursive structures, and open-universe representations) and modularity (improvements to general-purpose inference engines automatically improve planning performance).
The probabilistic programming approach inherits the computational limitations of general-purpose inference. Current inference engines struggle with the long sequential dependencies characteristic of planning problems, and the expressiveness that makes probabilistic programs attractive as a modeling language also makes inference intractable without strong structural assumptions [7]. The tension between modeling flexibility and computational tractability remains a defining challenge for this approach, one that subsequent work in active inference and soft RL has addressed by imposing structured generative models that admit efficient inference at the cost of reduced generality.
3.6 Scaling to Multiagent Settings
The intractability of multiagent planning (NEXP-complete for Dec-POMDPs) has motivated the search for conditions under which the inference view yields practical algorithms. Kumar et al. [50] identified sufficient conditions under which multiagent planning can be decomposed into tractable inference sub-problems. By encoding agent interactions as factors in a graphical model and exploiting conditional independence structure, their approach transforms the exponential cost of joint policy enumeration into a polynomial-cost message-passing computation. The key insight is that many realistic multiagent problems exhibit interaction structure that is sparse relative to the worst case. Agents influence each other through shared state variables rather than through arbitrary joint policies, and this sparsity maps onto the factorization structure of a graphical model [50].
This line of work illustrates a recurring motif. Intractable planning problems become tractable when reformulated as inference in graphical models with exploitable structure. The source of tractability is not approximate inference per se but the identification of conditional independence relationships that the original planning formulation obscures. Whether similar structural insights can be identified for other classes of hard planning problems (continuous multiagent coordination, adversarial settings, open-ended environments) remains an important open question [29, 85].
4. Active Inference and the Free Energy Principle
4.1 From Perception to Planning
The free energy principle posits that self-organizing systems, including biological organisms, can be described as minimizing a variational free energy functional, thereby maintaining themselves in a restricted set of characteristic states [25]. What began as a unifying theory of brain function, encompassing perception, learning, and attention, was extended to action selection through active inference. Agents act to minimize the surprise (or free energy) associated with their sensory observations [25, 27]. Under this account, planning is not reward maximization but surprise minimization, with the agent's generative model encoding preferences as prior beliefs about future observations.
The relevance to planning as inference is direct. In active inference, an agent maintains a generative model (\(p(o_{1:T}, s_{1:T}, \pi)\)) over observations, hidden states, and policies. Planning amounts to evaluating the expected free energy (\(G(\pi)\)) of each candidate policy and selecting the policy that minimizes (\(G\)) [26, 27]. This evaluation is an inference computation. The agent infers the posterior distribution over states conditioned on the policy, then scores the policy by the expected divergence between predicted and preferred observations plus an epistemic term that drives exploration [26, 62]. The mathematical apparatus is variational inference, specifically variational message passing on a partially observed Markov decision process encoded as a factor graph [16, 27].
4.2 Expected Free Energy and Exploration
The expected free energy (EFE) objective is the distinctive contribution of active inference to the planning-as-inference landscape. It decomposes into two terms, a pragmatic (or instrumental) term that measures the divergence between predicted and preferred outcomes, analogous to expected reward, and an epistemic (or information-seeking) term that measures the expected information gain about hidden states, driving exploration [26, 59, 62]. This decomposition provides a principled, unified account of the exploration-exploitation trade-off that does not require auxiliary mechanisms such as epsilon-greedy exploration, upper confidence bounds, or intrinsic reward bonuses.
The formal properties of the EFE have been scrutinized in detail. Millidge, Tschantz, and Buckley [59] provided a systematic derivation showing that the EFE can be motivated from multiple starting points (as a KL divergence between predictive and preferred distributions, as a variational bound, or as an information-theoretic objective) but that no single derivation simultaneously justifies all of its properties. This analysis revealed that the EFE is not simply the negative expected reward plus information gain, but involves subtle interactions between the agent's model of the environment and its prior preferences [59]. Sajid et al. [72] provided a comparative analysis showing that active inference under EFE can recover Bayesian RL, information-directed sampling, and KL-regularized control as special cases depending on the choice of generative model and prior preferences.
Empirically, the exploration properties of the EFE have been demonstrated in grid-world navigation [26], T-maze tasks [27], multi-armed bandits [77], and ecological foraging simulations [41]. Active inference agents navigating ambiguous environments where multiple locations produce identical observations spontaneously generate epistemic actions that disambiguate their position, a behavior absent in greedy planners that select actions based solely on expected reward [49, 76, 95]. Schubert [76] extended this finding to autonomous reconnaissance missions, showing that EFE-based route planning naturally prioritizes areas of high evidential uncertainty, maintaining situational awareness without explicit exploration objectives. Most demonstrations remain in small-scale discrete domains, however, raising questions about the scalability of the EFE computation to high-dimensional problems (see Section 4.5).
4.3 Theoretical Connections and Critiques
Active inference's relationship to established frameworks has been a subject of both productive synthesis and ongoing debate. Several analyses have demonstrated formal equivalences between active inference and specific RL formulations under appropriate conditions. Tschantz et al. [94] showed that variational policy gradients, a deep RL algorithm derived from the active inference framework, recover standard policy gradient methods when the generative model is parameterized as a neural network and the variational bound is optimized by gradient descent. Millidge [58] demonstrated that deep active inference with amortized inference yields algorithms closely related to actor-critic methods, with the free energy objective playing the role of the critic.
These equivalences obscure important differences in perspective and emphasis. Active inference is fundamentally a model-based framework. It requires an explicit generative model of the environment, and planning is performed by simulating future trajectories under this model [16, 27]. This stands in contrast to model-free maximum entropy RL, where the inference interpretation is used to derive update rules but planning-time inference over future states is not performed [33]. The emphasis on generative models also means that active inference naturally handles partial observability, state estimation, and model learning within a single framework [62, 79], whereas these are typically handled by separate components in conventional RL architectures.
A particularly sharp unification result comes from Watson et al. [98], who showed that active inference reduces to partially observed control-as-inference when costs are defined in observation space. This mathematical equivalence has a provocative practical implication. Scalable, uncertainty-aware optimal control solvers developed in the probabilistic numerics community are not merely analogous to active inference but are valid implementations of it. The equivalence collapses the apparent gap between neuroscience-inspired active inference and engineering-oriented control-as-inference into a difference of emphasis rather than substance [98]. Sennesh et al. [78] reinforced this from the active inference side, deriving the proper time-averaged active inference objective from optimal control principles and formally reconnecting active inference to optimal feedback control theory. Their formulation recovers a sensorimotor objective that accommodates time-varying reference states, a capability absent from finite-horizon formulations, and demonstrates that standard discounted or finite-horizon EFE is best understood as a tractable approximation whose theoretical justification is weaker than typically acknowledged [78].
Critics have raised several concerns. The EFE's derivation has been called ad hoc, with the pragmatic and epistemic terms lacking a single coherent variational justification [10, 59]. The computational complexity of evaluating the EFE over all candidate policies grows combinatorially with the planning horizon, limiting applicability without approximations [16]. The claim that active inference subsumes all of RL as a special case remains contentious. While certain mappings exist, they require specific assumptions that do not hold universally [58, 72].
4.4 Approximation Shapes Behavior
A subtle but consequential finding is that the choice of approximate inference method in active inference is not an incidental implementation detail but is partially constitutive of the resulting behavior. Kouw [49] showed that when non-linear measurement functions are approximated with curvature-sensitive Gaussian methods (for example, second-order Taylor expansion), a state-dependent ambiguity term emerges automatically in the variational free energy, even without an explicit prior over preferred observations. The degree of ambiguity-avoidance embedded in the agent's planning is therefore directly determined by the choice of approximation scheme. Curvature-sensitive approximations produce ambiguity-avoidance, curvature-blind ones do not [49].
This observation inverts the usual relationship between model and algorithm in planning. In classical planning, the model specifies the problem and the algorithm solves it. The algorithm is a tool whose properties (convergence rate, memory usage) are distinct from the solution's properties (optimality, feasibility). In planning as inference, the choice of approximate inference scheme is partially constitutive of the solution, the resulting behavior. This entanglement is simultaneously a feature (it provides a principled design knob for shaping behavior) and a challenge (it means behavior analysis requires joint analysis of model and inference method, not model alone) [49, 54].
4.5 Scaling Active Inference
Recognizing the scalability challenge, recent work has pursued deep active inference, using neural networks for both the generative model and the inference computations. Fountas et al. [23] combined deep generative models with Monte Carlo tree search to scale active inference to visual environments, demonstrating performance comparable to model-based RL on Animal-AI and dSprites tasks. Catal et al. [12] used variational autoencoders as the generative model and showed that active inference agents can learn to navigate from pixel observations. Millidge [58] and Tschantz et al. [94] showed that amortized inference (training a neural network to directly output approximate posteriors) can bypass the combinatorial policy evaluation problem, yielding scalable deep active inference agents.
The introduction of sophisticated generative model architectures has further expanded the scope. Champion et al. [13] implemented active inference using variational message passing on factor graphs with neural network factors, demonstrating more principled approximate inference than direct gradient-based optimization. Paul et al. [63] proposed an active inference formulation compatible with transformer-based sequence models, and several groups have explored connections to world models in the Dreamer family [37, 71]. Van der Himst and Lanillos [96] coupled active inference with variational autoencoders to handle high-dimensional continuous observation spaces in partially observable domains, achieving performance comparable to or exceeding deep Q-learning. This is the first direct empirical evidence that the EFE framework is practically competitive with state-of-the-art deep RL in domains of non-trivial complexity, although deep active inference has not yet matched the performance of state-of-the-art deep RL on the hardest benchmarks [23, 72].
Integration with learned cognitive map representations has further relaxed the hand-crafted generative model constraint. Van de Maele et al. [95] combined EFE-based planning with clone-structured cognitive graphs that learn environmental structure from observation sequences. The resulting system acquires a world model from data and plans over it using EFE minimization, demonstrating that the free energy framework is compatible with data-driven representations rather than requiring hand-specification. CSCGs naturally capture the aliased observation structure that makes planning difficult, the same representational property that EFE-based planning exploits for epistemic action generation [95].
4.6 Extensions
Two recent extensions probe the boundaries of the active inference framework along orthogonal dimensions. Heins et al. [40] investigated whether collective behavior in multi-agent active inference systems can be understood as inference at the population level. Using spin glass models as a testbed, they established a formal correspondence between individual-scale free energy minimization and collective-scale Boltzmann distributions, effectively, a collection of active inference agents can implement sampling-based inference at the population level. This multi-scale equivalence is fragile. It breaks under modifications to individual generative models or interaction topology that might appear minor from an engineering perspective [40]. The implication is that emergent collective inference is a special rather than generic property of multi-agent active inference, a sobering finding for those hoping the free energy principle provides a universal bridge between individual and collective computation.
Schubert [76] extended active inference in a different direction, integrating Dempster-Shafer evidential belief functions into the generative model via pignistic probability transformations. This extension demonstrates that variational free energy can be computed against non-Bayesian uncertainty representations, belief functions that distinguish between ignorance and conflicting evidence in ways that standard probability distributions cannot. The practical motivation is autonomous reconnaissance, where sensor data may be ambiguous, contradictory, or absent, and the standard Bayesian apparatus of precise probability assignments may be epistemically overcommitted [76]. Stochastic shortest-path formulations further generalize the temporal structure. Baioumy et al. [6] showed that SSP MDPs, which do not require a fixed horizon or discount factor, can be solved as probabilistic inference problems, enabling both online and offline planning under the inference-as-planning paradigm. These extensions suggest that while the core free energy minimization imperative is robust, its interface with diverse uncertainty representations and temporal structures requires careful mathematical engineering.
5. Maximum Entropy Reinforcement Learning and Soft Methods
5.1 The Maximum Entropy Objective
The maximum entropy framework augments the standard RL objective with a causal entropy term, yielding (\(J(\pi) = \sum_t \mathbb{E}[r(s_t, a_t) + \alpha \mathcal{H}[\pi(\cdot \mid s_t)]]\)), where (\(\alpha\)) is a temperature parameter controlling the stochasticity-reward trade-off [32, 102]. This augmented objective has a direct inference interpretation. Maximizing (\(J(\pi)\)) is equivalent to computing the posterior over actions in the optimality-variable graphical model described in Section 2, with the temperature (\(\alpha\)) corresponding to the weight of the prior relative to the likelihood [54]. The entropy bonus emerges naturally from the KL divergence between the policy and a uniform reference distribution.
The theoretical grounding for maximum entropy RL is rich. Ziebart et al. [102] introduced the framework in the context of inverse reinforcement learning, showing that the maximum causal entropy distribution over trajectories is the unique distribution that matches feature expectations while remaining maximally uncertain about the demonstrator's intent. Todorov [87] and Kappen [47] independently arrived at equivalent formulations from the stochastic optimal control perspective. The convergence of these threads (from inverse RL, from linearly solvable MDPs, and from the graphical model formulation) provided strong theoretical support for the maximum entropy objective as the correct way to incorporate the inference perspective into RL [54, 61].
Several desirable properties follow from the entropy regularization. Policies are stochastic and multi-modal, capturing multiple near-optimal solutions rather than collapsing to a single deterministic action [32]. This stochasticity provides built-in exploration, a form of robustness to model misspecification, and improved compositionality across tasks [22, 33]. The connection to robustness is particularly notable. Eysenbach and Levine [22] proved that maximum entropy RL solves certain robust RL problems, providing a formal justification for the common intuition that stochastic policies are more robust than deterministic ones.
5.2 Soft Value Functions and Soft Q-Learning
The inference interpretation leads to soft counterparts of standard RL quantities. The soft value function (\(V_{\text{soft}}(s) = \alpha \log \sum_a \exp(Q_{\text{soft}}(s, a)/\alpha)\)), a log-sum-exp (softmax) over Q-values rather than a hard maximum, arises as the normalization constant (log-partition function) in the posterior over actions [32, 54]. The soft Bellman equation replaces the max operator with this soft operator, and convergence of soft value iteration follows from contraction arguments analogous to the standard case [32]. Fox et al. [24] developed an alternative soft update through G-learning, taming overestimation bias via KL regularization against a baseline policy.
Soft Q-learning [32] operationalized this framework using deep neural networks and amortized Stein variational gradient descent to sample from the energy-based optimal policy (\(\pi^*(a \mid s) \propto \exp(Q_{\text{soft}}(s, a)/\alpha)\)). While theoretically appealing, Soft Q-learning's reliance on approximate sampling from the energy-based policy limited its practical stability and scalability. The key algorithmic breakthrough came with Soft Actor-Critic (SAC) [33, 34], which replaced the sampling step with an explicit actor network trained to minimize the KL divergence to the soft Q-induced policy. SAC maintains separate Q-networks and a policy network, uses the reparameterization trick for gradient estimation, and includes automatic temperature tuning to adaptively set (\(\alpha\)).
The impact of SAC on the field has been substantial. It achieved state-of-the-art sample efficiency and asymptotic performance on continuous control benchmarks including MuJoCo locomotion, dexterous manipulation, and real-world robotic tasks [34, 35]. Its off-policy nature and entropy regularization provide better exploration and training stability compared to on-policy maximum entropy methods. SAC remains, as of 2026, one of the most widely used baseline algorithms in continuous RL, and its influence extends to downstream applications in robotic learning [45], autonomous driving, and game playing.
5.3 Policy Optimization as Posterior Inference
Beyond specific algorithms, the maximum entropy framework has illuminated the general relationship between policy optimization and posterior inference. Schulman, Chen, and Abbeel [74] demonstrated a formal equivalence between policy gradients and soft Q-learning, showing that both can be viewed as performing approximate inference in the same graphical model but with different variational families. Nachum et al. [61] further bridged the gap between value-based and policy-based methods through the path consistency learning framework, which enforces soft Bellman consistency along sampled trajectories, combining the strengths of both approaches.
The EM perspective on policy optimization, discussed in Section 3.4, was extended to the maximum entropy setting by Abdolmaleki et al. [1] in the MPO algorithm. MPO decouples policy improvement (computing the posterior over actions given the Q-function as likelihood) from policy fitting (a supervised learning problem), yielding an algorithm with excellent stability properties and strong performance on both simulated and real robotic tasks [1, 70]. The information-geometric structure of these KL-constrained updates has been analysed through the lens of natural gradient methods and trust regions [28, 75], revealing that many apparently distinct algorithms are performing different approximations to the same underlying inference computation. Sun et al. [82] consolidated these derivations in a tutorial that walks through the PGM structure of deep RL, explicitly showing how variational inference over trajectory PGMs recovers SAC-style entropy regularization and how uncertainty-driven exploration emerges from posterior variance over value functions.
A recent direction extends the maximum entropy principle to structured and hierarchical policies. Galashov et al. [28] introduced information asymmetry between levels of a hierarchy through KL-regularized objectives, encouraging higher levels to communicate only task-relevant information. Tirumala et al. [86] used the inference perspective to derive algorithms for learning default behaviours that serve as priors for downstream tasks. Grau-Moya et al. [30] extended soft Q-learning to two-player stochastic games. These developments demonstrate the continuing fertility of the inference viewpoint for generating new algorithmic ideas, even as the specific algorithms evolve.
5.4 Early Bayesian Value Functions
The soft RL branch of the planning-as-inference paradigm evolved in part from earlier efforts to treat value functions as objects of probabilistic inference. Xia et al. [101] proposed modeling action-value functions as latent variables under a Gaussian process prior, with observed rewards serving as likelihood terms. Bayesian posterior updates replaced standard TD learning. The Q-function posterior mean provided the value estimate, while the posterior variance furnished an exploration bonus, states with high value uncertainty are preferentially visited [101]. This is an early and concrete instantiation of planning as inference applied directly to the value function, predating the systematic PGM-based formulations that would later provide its theoretical context.
The significance of the GP-based approach extends beyond its immediate algorithmic contribution. By deriving exploration from posterior uncertainty rather than from ad hoc (\(\epsilon\))-greedy or Boltzmann strategies, it anticipates the central insight of maximum entropy RL. Entropy and uncertainty are not obstacles to be overcome but information-theoretic quantities to be managed within a coherent probabilistic framework [82, 101]. The limitation of the GP approach, poor scaling to high-dimensional state spaces due to cubic computational cost, motivated the shift to parametric approximations and eventually to the deep network-based methods that characterize modern soft RL.
5.5 Limitations and the Temperature Problem
Despite its successes, the maximum entropy framework faces well-known limitations. The temperature parameter (\(\alpha\)) controls the trade-off between reward maximization and entropy, and its optimal value depends on the reward scale, the effective action dimensionality, and the stage of training [34]. While automatic temperature tuning via a constrained optimization has proven effective in practice [34], the target entropy is itself a hyperparameter, and poor temperature schedules can lead to either insufficient exploration (low (\(\alpha\))) or excessive stochasticity (high (\(\alpha\))).
More fundamentally, the maximum entropy objective does not always align with the true desiderata of planning. In safety-critical settings, maximal entropy is precisely what one does not want. Deterministic, high-confidence policies are required [2]. The connection between entropy regularization and robustness, while formally established [22], holds only for specific classes of perturbations, and the practical robustness gains vary across domains. The assumption of a uniform reference policy is restrictive. When the natural baseline behavior is non-uniform, the standard entropy bonus can be inappropriate, motivating KL-regularized variants with learned priors [28, 86]. A recurring critique also concerns the distinction between undirected entropy-based exploration and directed epistemic exploration. Maximum entropy RL provides the former, while active inference's EFE decomposition targets the latter. The two are not equivalent, and Lee et al. [53] connected maximum entropy RL to state marginal matching as an alternative view of exploration.
6. Model-Based Probabilistic Planning
6.1 Integrated Reasoning and Structured Planning
The graphical-model view of planning provides a natural substrate for integrating reasoning across abstraction levels. Toussaint [92] demonstrated that probabilistic inference over a single graphical model can integrate motor control, grasp planning, and symbolic reasoning in a blocks-world domain, levels of abstraction that classical approaches treat with separate, loosely coupled algorithms. By representing kinematic constraints, contact physics, and high-level action sequences as factors in a joint distribution, inference-based planning achieves a degree of integration that hierarchical optimization approaches struggle to match. The key mechanism is bidirectional uncertainty propagation. Beliefs about high-level action feasibility propagate downward to constrain motor plans, while sensory evidence about physical state propagates upward to inform symbolic reasoning [92]. This bidirectional information flow is a natural consequence of inference in graphical models but must be explicitly engineered in classical planning architectures.
Toussaint [91] extended this line of work to robot trajectory optimization using approximate inference over a trajectory graphical model, while Solway and Botvinick [80] developed an account of goal-directed decision making as probabilistic inference, bridging planning as inference with models of prefrontal function in cognitive neuroscience. Botvinick and Toussaint [11] surveyed the cognitive-science implications of the planning-as-inference view, arguing that reformulating planning as inference clarifies long-standing puzzles about goal representation, habit formation, and hierarchical control. The emergent claim across these threads is that the inference formulation is not only computationally useful but cognitively natural, providing a common currency for integrating perception, action, and memory.
6.2 Gaussian Process Motion Planning and Trajectory Inference
A particularly robotic application of planning as inference casts continuous motion planning as posterior inference over a trajectory distribution with a Gaussian process prior. Dong et al. [20] introduced Gaussian process motion planning (GPMP), representing trajectories as samples from a GP prior defined by a linear stochastic differential equation. Obstacle avoidance, kinematic constraints, and goal-reaching terms enter as likelihoods, and maximum a posteriori inference via factor graphs recovers smooth, collision-free trajectories. Mukadam et al. [57] extended this to continuous-time trajectory estimation and planning, exploiting the sparse block-tridiagonal structure induced by the GP prior to enable efficient iSAM-style incremental inference. This approach unifies trajectory estimation (a classical SLAM problem) with trajectory optimization (a classical planning problem) under a single Bayesian umbrella.
Earlier entropy-regularized trajectory optimization methods such as CHOMP [69] and STOMP [46] can be understood retrospectively through the same lens. The cost-weighted averaging that defines STOMP's noisy rollouts is formally an importance-sampling estimate of the posterior over trajectories under a Gaussian prior, and CHOMP's functional gradient updates correspond to MAP inference in a related model [46, 48, 69]. Guided policy search [55] further connected trajectory optimization with policy learning, using trajectory-optimal controllers as teachers for a neural policy trained via KL-regularized supervision.
A more recent line of work generates distributions over trajectories rather than single solutions. Lambert et al. [52] proposed entropy-regularized motion planning via Stein variational inference, addressing a practical challenge that the theoretical literature often elides. Generating not a single optimal trajectory but a distribution over diverse, high-quality trajectories is often essential for downstream tasks such as imitation learning, where broad coverage of the trajectory space is needed for robust policy transfer. By formulating the trajectory distribution as a Boltzmann posterior with an energy function defined by collision costs, goal-reaching objectives, and smoothness priors, Lambert et al. [52] transformed motion planning into a sampling problem amenable to Stein Variational Gradient Descent (SVGD). The resulting algorithm produces a particle-based approximation to the posterior trajectory distribution that is both diverse (covering multiple homotopy classes in environments with obstacles) and high-quality, with each trajectory satisfying kinematic and collision constraints. The entropy regularization here serves a distinct purpose compared to soft RL. Rather than promoting exploration during learning, it ensures coverage of the solution space for downstream use.
6.3 Gaussian Process Dynamics and Probabilistic Neural Network Models
Model-based approaches that learn an explicit dynamics model and use it for planning make the inference interpretation particularly natural. Planning is inference over trajectories in a learned probabilistic model. The paradigmatic early work is PILCO [18], which uses Gaussian process dynamics models and performs analytic moment matching to propagate uncertainty through the planning horizon. By maintaining calibrated uncertainty estimates over dynamics, PILCO achieves remarkable sample efficiency, learning to swing up a cart-pole from scratch in under 30 seconds of interaction [18]. The GP-based approach makes the inference nature of planning explicit. Policy optimization in PILCO involves computing expected returns under the posterior predictive distribution of the GP, which is an integration problem. The cubic scaling of GPs with data size, the difficulty of handling high-dimensional observations, and the analytic intractability of moment matching for complex nonlinear systems limit scalability [15, 18], motivating the transition to neural-network-based dynamics models.
The transition from GP to neural network dynamics models required new approaches to uncertainty quantification. PETS (Probabilistic Ensemble Trajectory Sampling) [15] combined ensembles of probabilistic neural networks with trajectory sampling for planning, achieving sample efficiency competitive with PILCO but scaling to higher-dimensional tasks. The ensemble approach captures both aleatoric uncertainty (via the probabilistic outputs of each network) and epistemic uncertainty (via disagreement across ensemble members), providing a practical approximation to the posterior over dynamics [15].
Planning in these models uses sampling-based inference methods, typically random shooting or the cross-entropy method, to find action sequences that maximize expected return under the dynamics ensemble [15, 43]. This is directly analogous to path integral control (Section 3.3), with the learned ensemble replacing a known dynamics model. MBPO (Model-Based Policy Optimization) [43] extended this by using the learned model to generate synthetic data for model-free policy learning, effectively amortizing the planning computation into the policy network. The resulting algorithm achieved strong performance on MuJoCo tasks while retaining much of the sample efficiency of pure model-based methods [43].
The accuracy of the learned dynamics model is critical, and model errors compound over long planning horizons. This problem, studied under the rubric of model exploitation or model bias [15, 43, 51], has motivated shorter planning horizons, model ensembles, and uncertainty-aware planning objectives. The inference perspective provides a natural framework for addressing model bias. Rather than treating the learned dynamics as ground truth, one should marginalize over model uncertainty when computing expected returns, a computation that is itself an inference problem [15, 19].
6.4 Latent-Space World Models
A major advance in model-based probabilistic planning has been the development of world models that learn compact latent representations of the environment and plan in this latent space. Ha and Schmidhuber [31] demonstrated that agents can learn a variational autoencoder for perception and a recurrent dynamics model in latent space, then train a policy entirely within the dream of this world model. The Dreamer line of work [36, 37, 38, 39] formalized and scaled this approach, using the recurrent state-space model (RSSM) to learn latent dynamics and performing policy optimization by differentiating through the imagined trajectories.
The RSSM architecture separates deterministic and stochastic components of the state, allowing the model to capture both the predictable structure and the inherent uncertainty of the environment [36]. Planning in this framework involves sampling from the learned posterior over latent states (perception), rolling out trajectories under the learned dynamics (imagination), and optimizing the policy to maximize imagined returns (acting), a pipeline that maps cleanly onto the perception-inference-action loop of the planning-as-inference framework [37]. DreamerV3 [39] demonstrated that this approach, with appropriate architectural choices, achieves competitive or superior performance across a range of domains (Atari games, continuous control, Minecraft) using a single set of hyperparameters.
The latent-space planning approach has deep connections to active inference (Section 4), where the generative model over observations and hidden states is central. Several authors have noted that the Dreamer architecture can be interpreted as implementing a form of active inference, with the RSSM as the generative model and the actor-critic policy optimization as an amortized approximation to expected free energy minimization [23, 71]. Whether these connections are merely analogical or reflect a deeper structural equivalence remains debated, although the mathematical unification reported by Watson et al. [98] suggests the latter.
6.5 Generative Models for Trajectory Optimization
The most recent development in model-based probabilistic planning replaces explicit dynamics models with generative models over entire trajectories, making the inference interpretation literal. Planning is sampling from a conditional distribution over trajectories. The Diffuser framework [44] trains a diffusion model on a dataset of trajectories and plans by sampling from this model conditioned on desired properties (start state, goal state, or return). The denoising process of the diffusion model progressively refines a noisy trajectory into a coherent plan, naturally handling multi-modality and long-horizon dependencies.
This trajectory-level approach offers several advantages over step-by-step dynamics models. The diffusion model captures the joint distribution over entire trajectories, avoiding the compounding error problem of autoregressive rollouts [44]. Conditioning can be applied flexibly (on any subset of states, actions, or returns at any timestep) by composing the diffusion model with classifiers or by inpainting [3, 44]. Decision Diffuser [3] further demonstrated that conditioning a trajectory diffusion model on return, constraints, and skills enables flexible decision-making without explicit reward optimization.
The diffusion-for-planning paradigm has expanded rapidly. Chi et al. [14] applied diffusion models to visuomotor policy learning in robotic manipulation, achieving strong results on multi-modal manipulation tasks. Liang et al. [56] extended the framework to model-based offline RL with AdaptDiffuser, demonstrating self-evolving diffusion planners that adapt to new tasks via online rollouts. The theoretical foundation connects to score-based generative modeling and Langevin dynamics, which can be viewed as performing posterior sampling in the trajectory space, exactly the planning-as-inference computation, with the score function playing the role of the gradient of the log-posterior [3, 44]. Diffusion-based planning is computationally expensive due to the iterative denoising process, and its reliance on offline datasets limits applicability in online learning settings. Accelerating inference and extending to online, interactive settings are active areas of development.
6.6 Probabilistic Programming for Planning
A complementary line of model-based work embeds planning inside probabilistic programming languages. Belle and Levesque [7] showed that stochastic planning with partial observability can be expressed as a probabilistic program, with the language's inference engine delegated to produce policies. This approach offers extreme expressiveness (continuous variables, recursive structures, and open-universe semantics are all supported) at the cost of general-purpose inference inefficiency. The tension between expressive modeling and efficient inference is resolved on a case-by-case basis by imposing structural assumptions (factor graphs, conjugate priors, or conditional independence) that reduce the general inference problem to a tractable one [7, 50]. In practice, the probabilistic-programming-for-planning approach complements rather than replaces the more specialized frameworks. Its value is greatest when the model is rich, data is scarce, and tractability is not the binding constraint.
7. Cross-Cutting Analysis
7.1 The Control-Inference Duality as Unifying Thread
The most striking pattern across the themes surveyed is the repeated rediscovery and reformulation of the same fundamental insight. Stochastic optimal control and probabilistic inference are two views of the same mathematical structure. This duality, established rigorously by Todorov [89] and Kappen [47] for KL-regularized control, manifests differently across traditions. In the maximum entropy RL literature, it appears as the equivalence between soft Bellman equations and message passing [32, 54]. In active inference, it appears as the claim that perception, learning, and action all minimize a single free energy functional [25, 27]. In model-based planning, it appears as the use of posterior sampling over trajectories as a planning algorithm [15, 44]. Each community has developed its own notation, terminology, and preferred approximations, but the underlying mathematical object (a posterior distribution over actions or trajectories conditioned on optimality) is shared.
This convergence has been productive. Ideas flow between communities, and the same algorithm can be derived from multiple starting points, each derivation offering different insights. SAC can be derived from the soft Bellman equation [33], from an EM procedure [1], or from a variational inference perspective [54]. The EFE of active inference can be derived from a KL-regularized objective [59], from information-theoretic first principles [26], or from a Bayesian decision-theoretic framework [72]. The unification result of Watson et al. [98] collapses the distinction between active inference and control-as-inference when costs are specified in observation space. This redundancy of derivations provides cross-validation and suggests the framework has reached a degree of mathematical maturity.
7.2 Model-Free versus Model-Based Instantiations
A central tension in the field is between model-free and model-based instantiations of the inference view. Maximum entropy RL methods like SAC [33] use the inference framework to derive update rules but do not perform explicit inference over future states at decision time. The posterior computation is amortized into the policy network during training. Active inference [27] and model-based methods like Dreamer [37] and MPPI [99] perform online inference over future trajectories at planning time, using an explicit model of the world.
This distinction has practical consequences. Model-free inference-based methods inherit the computational efficiency of model-free RL (fast action selection at test time) but sacrifice the ability to adapt rapidly to new tasks or changes in the environment. Model-based methods can flexibly replan but face computational costs and model accuracy challenges [15, 43]. The trajectory of the field suggests a convergence. Methods like MBPO [43] and Dreamer [37] combine model-based planning with model-free policy learning, using the model to generate training data or compute targets but amortizing the final policy into a fast neural network. Diffusion-based planners [44] further blur the boundary by learning to sample from the posterior over trajectories without maintaining a separate dynamics model.
7.3 Exploration-Exploitation as Inference
A third cross-cutting theme is the treatment of exploration-exploitation as a problem naturally solved by the inference framework. All three major traditions offer accounts of exploration. Active inference provides the most explicit treatment. The epistemic term of the EFE directly quantifies the value of information [26, 77]. Maximum entropy RL provides exploration through the entropy bonus, which encourages broad action distributions and prevents premature commitment to suboptimal modes [33]. Model-based methods leverage posterior uncertainty in the dynamics model to drive directed exploration [15, 19].
These approaches differ in important respects. The EFE's epistemic drive is state-directed. It explicitly values visiting states where the agent's model is uncertain [26]. The entropy bonus in maximum entropy RL is action-directed. It encourages diverse actions regardless of their epistemic value [33]. Model-based uncertainty drives a form of Thompson sampling, sampling models from the posterior and planning optimistically under each sample [15]. Recent work has attempted to combine these mechanisms. Tschantz et al. [94] showed that active inference agents can be trained with deep RL algorithms while retaining the epistemic drive of the EFE. Lee et al. [53] connected maximum entropy RL to state marginal matching, providing an alternative perspective on exploration. The relationship between these approaches, and which provides the most efficient exploration in practice, remains an active area of investigation.
7.4 Approximation as Design
A recurring insight across all themes is that the choice of approximate inference method is not merely an implementation detail but a design decision that shapes agent behavior in substantive ways. Kouw [49] showed this most explicitly for active inference. Curvature-sensitive Gaussian approximations produce ambiguity-avoiding agents while curvature-blind approximations do not. The same principle operates in soft RL, where the choice of variational family determines the expressiveness of the resulting policy [54], and in motion planning, where the choice between SVGD and alternative samplers affects trajectory diversity [52]. The implication is that behavior analysis in planning as inference requires joint analysis of model and inference method, not model alone.
Exact inference is tractable only in restricted settings (small discrete MDPs, linear-Gaussian dynamics, KL-structured costs). In all other cases, approximate inference is necessary, and the choice of approximation (variational, sampling-based, amortized) fundamentally shapes the resulting algorithm and its properties [16, 54]. Variational methods (mean-field, structured variational inference) provide fast, deterministic approximations but can underestimate uncertainty and miss multi-modal posteriors [27, 54]. Sampling methods (MCMC, SMC, path integrals) can capture multi-modality but are computationally expensive and suffer from high variance in high-dimensional spaces [66, 99]. Amortized inference (training a neural network to approximate the posterior) offers fast test-time computation but requires substantial training data and may not generalize to out-of-distribution states [33, 58].
7.5 Scalability Trajectories
The evolution of scalability across the planning-as-inference literature follows a common arc. Initial formulations target discrete, small-scale domains [50, 91], followed by extensions to continuous state spaces via Gaussian process or variational methods [49, 101], and finally integration with deep learning for high-dimensional domains [33, 44, 52, 96]. Soft RL has progressed furthest along this arc, with SAC and related algorithms routinely applied to problems with hundreds of state dimensions. Active inference has reached medium-scale domains through deep variational methods [23, 96] but has not yet demonstrated the scalability of its soft RL counterparts on complex continuous control benchmarks. Model-based probabilistic planning occupies an intermediate position, with scalability varying dramatically depending on the generative model's structure [7, 39, 92].
7.6 Methodological Trends
Several methodological shifts are evident across the two-decade survey period. First, there has been a movement from exact or structured inference (belief propagation, EM) toward amortized and sampling-based methods (VAEs, SVGD, particle methods), mirroring the broader machine learning community's shift toward scalable approximate methods. Second, the unit of inference has expanded from single actions or policies to full trajectory distributions, reflecting an increasing appreciation that distributional properties (coverage, diversity, robustness) are as important as point optimality [44, 52]. Third, the boundary between model-based and model-free has blurred. Deep active inference [96] learns its generative model from data, while model-free soft RL implicitly maintains a soft model through its entropy-regularized value functions [54]. Fourth, cross-pollination between communities has accelerated. Unification results [78, 98] and tutorial expositions [54, 82] explicitly bridge the active-inference and soft-RL derivations that once felt distant.
| Theme | Inference formulation | Typical inference method | Exploration mechanism | Representative references |
|---|---|---|---|---|
| Control-as-inference foundations | Posterior over trajectories given optimality variables | Message passing, EM, variational | Entropy of prior policy | [54, 68, 82, 93] |
| KL-regularized / linearly solvable MDPs | Linearized Bellman via desirability function | Spectral, path integral | KL to uncontrolled dynamics | [21, 47, 84, 87, 89] |
| Active inference | Generative model over observations, hidden states, policies | Variational message passing, amortized VI | Epistemic term in EFE | [25, 26, 27, 62, 95, 98] |
| Maximum entropy RL | Posterior over actions with optimality likelihood | Actor-critic with entropy regularization | Policy entropy bonus | [22, 32, 33, 61, 102] |
| Model-based probabilistic planning | Inference over trajectories in learned model | MPPI, CEM, random shooting, diffusion sampling | Model-ensemble disagreement | [15, 37, 43, 44, 99] |
| Trajectory-level GP and Stein methods | Posterior over trajectories with GP or Boltzmann prior | MAP inference on factor graphs, SVGD | Posterior variance / diversity | [20, 52, 57, 69] |
8. Open Problems and Future Directions
Closing the approximation gap. The formal elegance of the planning-as-inference framework rests on exact inference being possible, at least in principle, but practical implementations rely on coarse approximations. The quality of these approximations is rarely characterized, and the relationship between approximation error in inference and suboptimality in control is poorly understood [16, 54]. Developing tighter error bounds for approximate planning-as-inference, analogous to the performance bounds available for approximate dynamic programming [8], is an important theoretical direction. The finding that approximation choices constitutively shape agent behavior [49] further demands a systematic taxonomy linking classes of approximate inference methods to classes of guaranteed behavioral properties.
Beyond KL-structured costs. The exact duality between control and inference holds only for KL-regularized costs [47, 87]. General reward functions require approximations that break the duality, and it is unclear how much is lost in this approximation [68]. Extending the exact duality to broader classes of cost functions, or characterizing the approximation error for general costs, would significantly expand the framework's theoretical reach. Non-Bayesian and hybrid uncertainty representations, such as Schubert's [76] Dempster-Shafer integration, point to a largely unexplored design space. Planning under deep uncertainty, where agents cannot assign precise probabilities to outcomes, is common in real-world applications but is poorly served by standard Bayesian formulations.
Scalable active inference. Active inference provides the most principled treatment of exploration-exploitation and model-based planning within the inference framework, but has not yet scaled to the complex, high-dimensional benchmarks routinely handled by deep RL [23, 72]. Whether this is a fundamental limitation (the computational cost of the EFE evaluation) or an engineering challenge (the need for better approximate inference and generative model architectures) remains to be determined. Integrating active inference with the large-scale world models and diffusion-based planning methods of Section 6 is a promising direction, and initial results from deep active inference with VAE encoders [96] suggest feasibility. Deep RL algorithms that maintain explicit epistemic value, perhaps through information gain estimation in learned latent models, could combine the scalability of soft RL with the directed exploration of active inference.
Online generative planning. Diffusion-based and other generative planning methods currently rely on offline datasets [3, 44]. Extending these approaches to online settings, where the generative model is continually updated from new experience while maintaining coherent planning, is a largely unsolved problem. The interaction between generative model training and data collection, an inference-flavored version of the exploration problem, requires new theoretical and algorithmic tools.
Multi-agent planning as inference. Most work in planning as inference considers single-agent settings. Extending the framework to multi-agent systems, where planning involves inference over the joint behavior of multiple agents, raises fundamental challenges related to equilibrium concepts, communication, and computational complexity [29, 85]. Initial work has explored game-theoretic extensions [30] and factored inference under sparsity [50], and Heins et al. [40] established a formal but fragile correspondence between individual free energy minimization and collective-scale Boltzmann distributions. Characterizing the conditions under which multi-agent active inference systems exhibit robust emergent coordination, using tools from statistical physics and network science, would clarify when emergent collective inference is a generic property rather than a special case [40].
Unification of temporal formulations. The gap between time-averaged and finite-horizon active inference [78] parallels similar gaps in RL (average-reward vs. discounted vs. finite-horizon formulations). A unified treatment that relates these temporal formulations within the inference-as-planning framework, perhaps via connections to ergodic theory or large-deviation principles, would resolve ongoing confusion about which results transfer across formulations and which are artifacts of particular temporal assumptions. Similar clarity is needed about the relationship between the stochastic shortest-path formulation [6], the discounted formulation, and the finite-horizon formulation, each of which is tractable under different inference strategies.
Cognitive science validation. Active inference and planning-as-inference have been proposed as models of human and animal cognition [5, 11, 27]. Rigorous empirical validation against behavioral and neural data, beyond the small-scale demonstrations typical of the current literature, is needed to assess whether these models provide genuinely predictive accounts of biological decision-making or merely flexible redescriptions [10]. Solway and Botvinick [80] and Baker et al. [5] offer concrete targets for such validation in the cognitive science tradition, but sustained cross-disciplinary benchmarking is still scarce.
Foundation-model planners through the inference lens. Large language models and diffusion models increasingly serve as implicit planners in robotics and embodied AI. Their behavior, sampling trajectories or action sequences from a learned conditional distribution, is a clear instance of planning as inference but sits outside the canonical theoretical framework. Extending the inference-as-planning formalism to cover foundation-model planners, and characterizing when their sampling behavior implements principled posterior inference versus heuristic generation, would unify contemporary empirical practice with the theoretical tradition surveyed here [44, 63].
9. Conclusion
Two decades of research have established planning as inference as a mature and productive paradigm for sequential decision-making. The theoretical foundations (the duality between KL-regularized control and probabilistic inference, the graphical model formulation, and the path integral perspective) provide a unified mathematical language that connects disparate algorithmic traditions in reinforcement learning, control theory, and cognitive science [47, 54, 87]. The three major research programmes within this paradigm (active inference, maximum entropy RL, and model-based probabilistic planning) have each contributed distinctive algorithmic ideas and empirical advances, from the exploration-driving expected free energy of active inference [26], through the practically dominant Soft Actor-Critic [33], to the trajectory-level generative planning of diffusion models [44]. Recent unification results [78, 98] have revealed that these trajectories converge mathematically, even as their communities remain sociologically distinct.
The most important insight of this body of work is that the choice to frame planning as inference is not merely a mathematical trick but a design choice with concrete algorithmic consequences. It yields policies that are naturally stochastic, exploratory, and composable, and it enables the direct application of the rich toolkit of approximate inference methods to decision-making problems. Crucially, the choice of inference method within this framework is itself constitutive of behavior [49, 54], making the design of planners a joint task over model and approximation. The central challenge going forward is scaling, making inference-based planning practical in the high-dimensional, partially observed, multi-agent settings that characterize real-world decision-making, while preserving the principled exploration, uncertainty-awareness, and cross-abstraction integration that motivate the inference view in the first place.
Citation
If you find this survey useful, please cite it as
@misc{planning_as_inference_survey_2026,
author = {Hu Tianrun},
title = {Planning as Inference in Probabilistic Models},
year = {2026},
publisher = {GitHub},
url = {https://h-tr.github.io/blog/surveys/planning-as-inference.html}
}
References
- Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. (2018). “Maximum a Posteriori Policy Optimisation.” ICLR 2018.
- Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). “Constrained Policy Optimization.” ICML 2017, pp. 22–31.
- Ajay, A., Du, Y., Gupta, A., Tenenbaum, J. B., Jaakkola, T., & Levine, S. (2023). “Is Conditional Generative Modeling All You Need for Decision Making?” NeurIPS 2023.
- Attias, H. (2003). “Planning by Probabilistic Inference.” AISTATS 2003.
- Baker, C. L., Saxe, R., & Tenenbaum, J. B. (2009). “Action Understanding as Inverse Planning.” Cognition, 113(3), 329–349.
- Baioumy, M., Mattamala, M., & Hawes, N. (2021). “On Solving a Stochastic Shortest-Path Markov Decision Process as Probabilistic Inference.” arXiv preprint arXiv:2109.05866.
- Belle, V., & Levesque, H. J. (2018). “Probabilistic Planning by Probabilistic Programming.” arXiv preprint arXiv:1801.08365.
- Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II, 4th ed. Athena Scientific.
- Bhardwaj, M., Sundaralingam, B., Mousavian, A., Ratliff, N., Fox, D., Ramos, F., & Boots, B. (2022). “STORM. An Integrated Framework for Fast Joint-Space Model-Predictive Control for Reactive Manipulation.” CoRL 2022.
- Biehl, M., Pollock, F. A., & Kanai, R. (2021). “A Technical Critique of Some Parts of the Free Energy Principle.” Entropy, 23(3), 293.
- Botvinick, M., & Toussaint, M. (2012). “Planning as Inference.” Trends in Cognitive Sciences, 16(10), 485–488.
- Çatal, O., Wauthier, S., De Boom, C., Verbelen, T., & Dhoedt, B. (2019). “Learning Perception and Planning with Deep Active Inference.” ICAART 2019.
- Champion, T., Grześ, M., & Bowman, H. (2022). “Branching Time Active Inference with Bayesian Filtering.” Neural Computation, 34(10), 2132–2144.
- Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., & Song, S. (2023). “Diffusion Policy. Visuomotor Policy Learning via Action Diffusion.” RSS 2023.
- Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). “Deep Reinforcement Learning in a Handful of Trials Using Probabilistic Dynamics Models.” NeurIPS 2018.
- Da Costa, L., Parr, T., Sajid, N., Vesber, S., Ryan, V., & Friston, K. (2020). “Active Inference on Discrete State-Spaces. A Synthesis.” Journal of Mathematical Psychology, 99, 102447.
- Dayan, P., & Hinton, G. E. (1997). “Using Expectation-Maximization for Reinforcement Learning.” Neural Computation, 9(2), 271–278.
- Deisenroth, M. P., & Rasmussen, C. E. (2011). “PILCO. A Model-Based and Data-Efficient Approach to Policy Search.” ICML 2011.
- Depeweg, S., Hernández-Lobato, J. M., Doshi-Velez, F., & Udluft, S. (2018). “Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-Sensitive Learning.” ICML 2018.
- Dong, J., Mukadam, M., Dellaert, F., & Boots, B. (2016). “Motion Planning as Probabilistic Inference Using Gaussian Processes and Factor Graphs.” RSS 2016.
- Dvijotham, K., & Todorov, E. (2010). “Inverse Optimal Control with Linearly-Solvable MDPs.” ICML 2010.
- Eysenbach, B., & Levine, S. (2022). “Maximum Entropy RL (Provably) Solves Some Robust RL Problems.” ICLR 2022.
- Fountas, Z., Sajid, N., Mediano, P. A., & Friston, K. (2020). “Deep Active Inference Agents Using Monte-Carlo Methods.” NeurIPS 2020.
- Fox, R., Pakman, A., & Tishby, N. (2016). “Taming the Noise in Reinforcement Learning via Soft Updates.” UAI 2016.
- Friston, K. (2010). “The Free-Energy Principle. A Unified Brain Theory?” Nature Reviews Neuroscience, 11(2), 127–138.
- Friston, K., Rigoli, F., Ognibene, D., Mathys, C., FitzGerald, T., & Pezzulo, G. (2015). “Active Inference and Epistemic Value.” Cognitive Neuroscience, 6(4), 187–214.
- Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., & Pezzulo, G. (2017). “Active Inference. A Process Theory.” Neural Computation, 29(1), 1–49.
- Galashov, A., Jayakumar, S. M., Hasenclever, L., Tirumala, D., Schwarz, J., Desjardins, G., Czarnecki, W. M., Teh, Y. W., Pascanu, R., & Heess, N. (2019). “Information Asymmetry in KL-Regularized RL.” ICLR 2019.
- Gmytrasiewicz, P. J., & Doshi, P. (2005). “A Framework for Sequential Planning in Multi-Agent Settings.” Journal of Artificial Intelligence Research, 24, 49–79.
- Grau-Moya, J., Leibfried, F., & Bou-Ammar, H. (2018). “Balancing Two-Player Stochastic Games with Soft Q-Learning.” IJCAI 2018.
- Ha, D., & Schmidhuber, J. (2018). “World Models.” NeurIPS 2018.
- Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). “Reinforcement Learning with Deep Energy-Based Policies.” ICML 2017.
- Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018a). “Soft Actor-Critic. Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” ICML 2018.
- Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2018b). “Soft Actor-Critic Algorithms and Applications.” arXiv preprint arXiv:1812.05905.
- Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., & Levine, S. (2019). “Learning to Walk via Deep Reinforcement Learning.” RSS 2019.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). “Learning Latent Dynamics for Planning from Pixels.” ICML 2019.
- Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). “Dream to Control. Learning Behaviors by Latent Imagination.” ICLR 2020.
- Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). “Mastering Atari with Discrete World Models.” ICLR 2021.
- Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). “Mastering Diverse Domains through World Models.” arXiv preprint arXiv:2301.04104.
- Heins, C., Millidge, B., Da Costa, L., Mann, R., Friston, K., & Couzin, I. (2022). “Spin Glass Systems as Collective Active Inference.” arXiv preprint arXiv:2207.06970.
- Heins, C., Millidge, B., Da Costa, L., Mann, R., Friston, K., & Couzin, I. (2024). “Active Inference and the Ecology of Foraging.” PLoS Computational Biology, 20(1), e1011727.
- Hoffman, M., de Freitas, N., Doucet, A., & Peters, J. (2009). “An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Reward.” AISTATS 2009.
- Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). “When to Trust Your Model. Model-Based Policy Optimization.” NeurIPS 2019.
- Janner, M., Du, Y., Tenenbaum, J. B., & Levine, S. (2022). “Planning with Diffusion for Flexible Behavior Synthesis.” ICML 2022.
- Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., & Levine, S. (2018). “Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation.” CoRL 2018.
- Kalakrishnan, M., Chitta, S., Theodorou, E., Pastor, P., & Schaal, S. (2011). “STOMP. Stochastic Trajectory Optimization for Motion Planning.” ICRA 2011.
- Kappen, H. J. (2005). “Path Integrals and Symmetry Breaking for Optimal Control Theory.” Journal of Statistical Mechanics. Theory and Experiment, 2005(11), P11011.
- Kappen, H. J., Gómez, V., & Opper, M. (2012). “Optimal Control as a Graphical Model Inference Problem.” Machine Learning, 87(2), 159–182.
- Kouw, W. M. (2024). “Planning to Avoid Ambiguous States through Gaussian Approximations to Non-Linear Sensors in Active Inference Agents.” arXiv preprint arXiv:2409.01974.
- Kumar, A., Zilberstein, S., & Toussaint, M. (2011). “Scalable Multiagent Planning Using Probabilistic Inference.” IJCAI 2011.
- Lambert, N., Amos, B., Yadan, O., & Calandra, R. (2020). “Objective Mismatch in Model-Based Reinforcement Learning.” L4DC 2020.
- Lambert, A., Fishman, A., Fox, D., Boots, B., & Ramos, F. (2021). “Entropy Regularized Motion Planning via Stein Variational Inference.” arXiv preprint arXiv:2107.05146.
- Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., & Salakhutdinov, R. (2019). “Efficient Exploration via State Marginal Matching.” arXiv preprint arXiv:1906.05274.
- Levine, S. (2018). “Reinforcement Learning and Control as Probabilistic Inference. Tutorial and Review.” arXiv preprint arXiv:1805.00909.
- Levine, S., & Koltun, V. (2013). “Guided Policy Search.” ICML 2013.
- Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M., & Luo, P. (2023). “AdaptDiffuser. Diffusion Models as Adaptive Self-Evolving Planners.” ICML 2023.
- Mukadam, M., Dong, J., Yan, X., Dellaert, F., & Boots, B. (2018). “Continuous-Time Gaussian Process Motion Planning via Probabilistic Inference.” International Journal of Robotics Research, 37(11), 1319–1340.
- Millidge, B. (2020). “Deep Active Inference as Variational Policy Gradients.” Journal of Mathematical Psychology, 96, 102348.
- Millidge, B., Tschantz, A., & Buckley, C. L. (2021). “Whence the Expected Free Energy?” Neural Computation, 33(2), 447–482.
- Mohamed, A., Amer, K., Elruby, M., Zeng, N., & Zahran, M. (2020). “Model-Predictive Path Integral Control Framework for Partially Observable Environments.” ICRA 2020.
- Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2017). “Bridging the Gap Between Value and Policy Based Reinforcement Learning.” NeurIPS 2017.
- Parr, T., & Friston, K. (2019). “Generalised Free Energy and Active Inference.” Biological Cybernetics, 113(5), 495–513.
- Paul, A., Isomura, T., & Friston, K. J. (2024). “Active Inference as a Unified Theory of Sequential Decision-Making.” arXiv preprint arXiv:2402.03955.
- Peters, J., Mülling, K., & Altun, Y. (2010). “Relative Entropy Policy Search.” AAAI 2010.
- Peters, J., & Schaal, S. (2007). “Reinforcement Learning by Reward-Weighted Regression.” ICML 2007.
- Piché, A., Thomas, V., Ibrahim, C., Bengio, Y., & Pal, C. (2019). “Probabilistic Planning with Sequential Monte Carlo Methods.” ICLR 2019.
- Pinneri, C., Sawant, S., Blaes, S., Achterhold, J., Stückler, J., Rolínek, M., & Martius, G. (2021). “Sample-Efficient Cross-Entropy Method for Real-Time Planning.” CoRL 2021.
- Rawlik, K., Toussaint, M., & Vijayakumar, S. (2012). “On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference.” RSS 2012.
- Ratliff, N., Zucker, M., Bagnell, J. A., & Srinivasa, S. (2009). “CHOMP. Gradient Optimization Techniques for Efficient Motion Planning.” ICRA 2009.
- Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., van de Wiele, T., Mnih, V., Heess, N., & Springenberg, J. T. (2018). “Learning by Playing. Solving Sparse Reward Tasks from Scratch.” ICML 2018.
- Safron, A. (2020). “An Integrated World Modeling Theory of Consciousness.” Frontiers in Artificial Intelligence, 3, 30.
- Sajid, N., Ball, P. J., Parr, T., & Friston, K. J. (2021). “Active Inference. Demystified and Compared.” Neural Computation, 33(3), 674–712.
- Saxe, A. M., Earle, A. C., & Roesner, B. (2017). “Hierarchy through Composition with Multitask LMDPs.” ICML 2017.
- Schulman, J., Chen, X., & Abbeel, P. (2017). “Equivalence Between Policy Gradients and Soft Q-Learning.” arXiv preprint arXiv:1704.06440.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). “Trust Region Policy Optimization.” ICML 2015.
- Schubert, J. (2025). “Active Inference for an Intelligent Agent in Autonomous Reconnaissance Missions.” arXiv preprint arXiv:2510.17450.
- Schwartenbeck, P., Passecker, J., Hauser, T. U., FitzGerald, T. H., Kronbichler, M., & Friston, K. J. (2019). “Computational Mechanisms of Curiosity and Goal-Directed Exploration.” eLife, 8, e41703.
- Sennesh, E., Hoffman, J., & van de Meent, J.-W. (2022). “Deriving Time-Averaged Active Inference from Control Principles.” arXiv preprint arXiv:2208.10601.
- Smith, R., Friston, K. J., & Whyte, C. J. (2022). “A Step-By-Step Tutorial on Active Inference and Its Application to Empirical Data.” Journal of Mathematical Psychology, 107, 102632.
- Solway, A., & Botvinick, M. (2012). “Goal-Directed Decision Making as Probabilistic Inference.” Psychological Review, 119(1), 120–154.
- Liu, Q., & Wang, D. (2016). “Stein Variational Gradient Descent. A General Purpose Bayesian Inference Algorithm.” NeurIPS 2016.
- Sun, X., Shen, M., Tintarev, N., & Bischl, B. (2019). “Tutorial and Survey on Probabilistic Graphical Model and Variational Inference in Deep Reinforcement Learning.” arXiv preprint arXiv:1908.09381.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning. An Introduction, 2nd ed. MIT Press.
- Theodorou, E., Buchli, J., & Schaal, S. (2010). “A Generalized Path Integral Control Approach to Reinforcement Learning.” Journal of Machine Learning Research, 11, 3137–3181.
- Tian, Z., Wen, Y., Gong, Z., Punber, F., Zhang, S., & Wang, J. (2020). “Learning to Communicate Implicitly by Actions.” AAAI 2020.
- Tirumala, D., Galashov, A., Noh, H., Hasenclever, L., Pascanu, R., Schwarz, J., Desjardins, G., Czarnecki, W. M., Ahuja, A., Teh, Y. W., & Heess, N. (2020). “Behavior Priors for Efficient Reinforcement Learning.” arXiv preprint arXiv:2010.14274.
- Todorov, E. (2006). “Linearly-Solvable Markov Decision Problems.” NeurIPS 2006.
- Todorov, E. (2007). “Linearly-Solvable Markov Decision Problems.” Advances in Neural Information Processing Systems, Vol. 19.
- Todorov, E. (2008). “General Duality Between Optimal Control and Estimation.” IEEE CDC 2008.
- Todorov, E. (2009). “Efficient Computation of Optimal Actions.” Proceedings of the National Academy of Sciences, 106(28), 11478–11483.
- Toussaint, M. (2009). “Robot Trajectory Optimization Using Approximate Inference.” ICML 2009.
- Toussaint, M. (2010). “Integrated Motor Control, Planning, Grasping and High-Level Reasoning in a Blocks World Using Probabilistic Inference.” ICRA 2010.
- Toussaint, M., & Storkey, A. (2006). “Probabilistic Inference for Solving Discrete and Continuous State Markov Decision Processes.” ICML 2006.
- Tschantz, A., Millidge, B., Seth, A. K., & Buckley, C. L. (2020). “Reinforcement Learning Through Active Inference.” arXiv preprint arXiv:2002.12636.
- Van de Maele, T., Verbelen, T., Mazzaglia, P., Ferraro, S., & Dhoedt, B. (2023). “Integrating Cognitive Map Learning and Active Inference for Planning in Ambiguous Environments.” arXiv preprint arXiv:2308.08307.
- van der Himst, O., & Lanillos, P. (2020). “Deep Active Inference for Partially Observable MDPs.” arXiv preprint arXiv:2009.03622.
- Vlassis, N., Toussaint, M., Kontes, G., & Piperidis, S. (2009). “Learning Model-Free Robot Control by a Monte Carlo EM Algorithm.” Autonomous Robots, 27(2), 123–130.
- Watson, J., Abdulsamad, H., & Peters, J. (2020). “Active Inference or Control as Inference? A Unifying View.” arXiv preprint arXiv:2010.00262.
- Williams, G., Aldrich, A., & Theodorou, E. A. (2017). “Model Predictive Path Integral Control. From Theory to Parallel Computation.” Journal of Guidance, Control, and Dynamics, 40(2), 344–357.
- Williams, G., Drews, P., Goldfain, B., Rehg, J. M., & Theodorou, E. A. (2018). “Information-Theoretic Model Predictive Control. Theory and Applications to Autonomous Driving.” IEEE Transactions on Robotics, 34(6), 1603–1622.
- Xia, Z., & Zhao, D. (2016). “Online Reinforcement Learning Control by Bayesian Inference.” IET Control Theory and Applications, 10(12), 1331–1338.
- Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). “Maximum Entropy Inverse Reinforcement Learning.” AAAI 2008.