Explaining the Free Energy Principle to my past self

 · 

26 min read

Crossposted on Substack
 
💡
Meta
Much of this article is basically me re-explaining the content in
Friston, 2019
and
Millidge et al., 2021
, but in a slower, restructured, more accessible way with added context.
Notes (important! please read! click on the triangle to expand)
  1. Look out for background sections
    1. Weird terminology, annoying jargon or new and important concepts will be underlined. If you’re confused, look for dedicated background toggles in each section that look like the following:
      Background on X, Y, Z
  1. Footnotes are in the comments (displayed on the right side of the page)
  1. Intended audience
    1. I wrote this for my past self who really wanted to understand FEP, but only knew single-variable calculus, a bit of probability, and “learned” linear algebra only from watching 3b1b videos. This is meant to be accessible. Let me know if something is confusing or unclear.
    2. If you’re really really really interested in this topic, I recommend that you dig through the papers yourself (I’ve provided links in the references section). This explainer post is meant to be semi-introductory. My guess is reading this would help you dig through the papers faster, if you currently don’t know anything about how the FEP is derived.
  1. Nitpicking, nerd-sniping and asking the author questions is highly encouraged!
    1. Reach me at [email protected]
  1. Take breaks. Eat snacks. This is really long. Reading it in one sitting might possibly kill you.
Table of contents
Acknowledgments
Thank you to Jan Kulveit for clearing up my initial confusions around FEP; to Mihaly Barasz for explaining basic information theory to me; and to the poor people in the SPARC ‘24 server that had to read all my terrible drafts
 
If you are reading this sentence, then you exist.
What information can we deduce from this fact? It turns out, quite a lot. We can deduce that the universe is governed by laws such that your existence was possible, which constrains the value of some fundamental physical constants. This is the anthropic principle.
The key moves here are to (1) make a really obvious observation – to notice that “hey, I exist!” – (2) to formalize that observation somehow, and (3) squeeze the formalization really hard to see what else you can deduce about the world based on it.
 
But why stop here? Why stop at observing that you exist? Bacteria also exist. So do fungi, and other humans, and cities, and plants… Maybe we can generalize what we did with the anthropic principle to all of these “things” somehow, i.e. we can (1) observe that these “things” exist, (2) formalize this observation, (3) deduce some cool stuff about how the world must be, given this observation.
This is the core idea of Karl Friston’s Free Energy Principle (FEP). The specific question we are asking here is:
If things exist, what must they do?
 
To answer this question, we need to formalize what exactly is meant by
  • things exist,
  • what must things do
First, what kinds of things are we even talking about? We are curious about the kind of thing that is self-organizing and maintains its “thingness” regardless of external perturbations, up to a point. We are not for the moment concerned with particles or snowflakes, even though they are things in some sense (although we can apply similar logic to them, since they are still things - that are a lot worse at maintaining their thingness.) We are concerned with things like bacteria, plants, humans, cultures… etc.
By what must things do, we just mean “what must the thing’s mechanics be”, where “mechanics” refers to how the states that comprise a thing “move” or change through space and time.
 
We can now rephrase the extremely vague question we had at the start into a less vague but still really vague question:
If self-organizing things that maintain their thingness exist, what must the mechanics of those things be?
Overall, we will try to answer this question by:
  1. Formalizing “thingness”
  1. Slotting the formalized definition into a general mathematical description of physical systems
  1. Rearranging a bunch of equations and interpreting what pops out
Eventually, we will get to a theorem like “self-organizing things that maintain their thingness must minimize a quantity called variational free energy“. Hence why the FEP is called the “Free Energy” principle.
 
Also, note that the FEP is unfalsifiable since it is mathematically true. It is a principle, not a theory.
  • The right question is not to ask “is the FEP true?”. Because the answer to this question is yes, it is tautologically true in math-land, like all theorems, if you include all the assumptions.
  • The right question is to ask “do the assumptions of the FEP apply to the system I am modeling?”. If the assumptions do in fact hold true, you can then construct falsifiable theories about the system by making specific modeling choices (i.e. picking a specific generative model – this sentence will make sense later).
 

I. What’s a thing?

In this section we are trying to define precisely what it means for a thing to exist (in space, over some timescale) by formulating a set of “thingness conditions”.

Bounding things in space

Intuitively, for a thing to exist in space, there must be:
  • Stuff inside the thing – ex: lungs, plant cells, stem cells…
  • Stuff outside the thing (i.e. the environment in which the thing exists) – ex: petri dish…
  • Stuff that separates the inside from the outside (i.e. the thing’s boundary) – ex: cell membrane, skin, sensory receptors…
Once you’ve defined the boundary of the thing you’re talking about, you’ve basically defined everything else, since now you know that reality can be neatly carved into three subsets: (1) a boundary, (2) stuff on one side of the boundary, and (3) stuff on the other side of the boundary. Usually, we call the side of the boundary with less stuff in it “inside”, while the side that has more stuff is called “outside”.
 
The boundary divides everything into three regions: “inside”, “outside”, and “the region on the boundary”.
The boundary divides everything into three regions: “inside”, “outside”, and “the region on the boundary”.
Now to formalize this idea. If we model the system of interest as a bunch of random variables, some of which influence other random variables causally (specifically if we model the system as a causal graph of random variables), then we can define:
  • internal states as the subset of states that comprise the “inside” of the system; we pick these states manually based on what we define as “inside”
  • blanket states as the subset of states which insulate/separate the internal states from the rest of the network, i.e. the “boundary” of the system; we know what these states are after picking the internal states
  • external states as the subset of states which are not the internal states and not the blanket states, i.e. the states left over after considering the internal and blanket states
 
Background on causal graph of random variables
What does it mean for event to cause event ? A reasonable first stab at defining causation probabilistically might be to say that event causes event iff. event occurring raises the probability of event occurring, i.e.
But there are two issues with this:
The statement above is only true iff. , which means that under our definition of causation causing is equivalent to causing . Our definition of causation should be asymmetrical: causing should not imply that causes .
1 - Show that
2 - Show that
We can use the same proof as above and just swap and .
The statement also holds true if there is some confounding variable that raises both the probability of and , or if there is some variable between and (i.e. causes which causes . But ideally, we’d only want our definition of causation to hold when directly causes .
So we need some additional conditions on our definition of causation to satisfy (1) asymmetry and (2) deal with confounding variables.
Here’s one way to fix our definition of causality. We can say that (directly) causes iff. three conditions are met:
, i.e. occurring raises the probability of occurring
Same condition as above.
The time at which occurs, must be before the time at which occurs,
This makes our definition of causation asymmetrical since time only flows one way.
There are no variables that screen off from
A variable screens off and iff. . There are two cases in which this might be true:
  • is causally between and , i.e. the causal structure of the system being modeled is , where indicates causes .
  • is a common cause (or confounder) of and , i.e. the causal structure of the system being modeled is .
 
We can express this notion of probabilistic causation visually with a directed acyclic graph. If causes , there will be an arrow pointing from to . Because of the screening off condition stated above, we know that if – on the graph – we have some structure then by definition (1) causes and causes so (2) screens off from so (3) . This also means that .
 
 
 
Visually:
 
The blanket states – also called the Markov blanket of the system – are defined such that
where = internal states, = external states. This means that to describe the probability distribution on external states or internal states, we only need to know the blanket states and nothing further (see lines 1 and 2). This also means that the distributions on and are independent, conditional on blanket states . In most cases – but not always – blanket states correspond to physical barriers that enclose a system (e.g. a cell membrane).
We can further partition the blanket states into sensory states and active states , where sensory states are blanket states that influence but are not influenced by internal states, and active states are blanket states that are influenced by but do not influence internal states. If the system being modeled is a human, examples of sensory states are the states that characterize sensory receptors (e.g. rods and cones in the eye), and examples of active states are the states that characterize physical actuators (e.g. muscle cells). Visually:
notion image
Broadly speaking, all of the waffling we’ve done above allows us to partition all the states we’re considering into subsets of conceptual significance (e.g. the subset corresponds to the boundary between the system and its environment. Specifically, we can write since . This is important because now we’ve identified subsets and – let’s group them together and call them autonomous states – that we care about. We care about because they comprise the “inside” of the thing , as well as how the thing can influence its environment (through ). We can now be more specific with our initial question, and turn it from
If self-organizing things that maintain their thingness exist, what must the mechanics of those things be?
to
If self-organizing things that maintain their Markov blanket exist, what must the mechanics of those things’ autonomous states be?
 
Thus far, we’ve formalized what it means for a system to be statistically separated (and distinct) from its environment in space. Our answer is that, for the system to exist continually, it has to maintain a Markov blanket that surrounds its internal states. This Markov blanket assumption is the first “thingness condition”.

Bounding things in time

By stipulating that the system we’re considering maintains its Markov blanket over time, we’ve not only implicitly bounded the system in space, but also implicitly bounded the system in time. (Because Markov blankets are not maintained forever!) But is this enough?
Consider the scenario:
  • The system we’re modeling is a human that maintains its Markov blanket. (We’re modeling at a sufficiently high level of abstraction that the constantly dying skin cells don’t count as the Markov blanket not being maintained.)
  • The human joins a cult and goes through some serious trauma that they never recover from.
Arguably, on some intuitive level, the human before the trauma and the human after the trauma are not the same “thing”, even though the human’s Markov blanket was maintained.
 
From this we can infer that we need more stringent bounds on “thingness” in time, and that more specifically we need to formalize the notion of a constant phenotype (since constant Markov blanket does not necessarily imply constant phenotype). We can do this pretty easily by assuming that the probability distribution over all the states is constant, i.e.
Note that this implicitly defines phenotype to be equivalent to a specific probability distribution function. This assumption is called the steady state condition, because the probability distribution function doesn’t change w.r.t. time.
 
🔄
Recap
In this section we’ve formalized what it means to be a “thing” (more precisely, what it means to be the “same thing”) over some timescale. The two thingness conditions are:
  • Markov blanket assumption: you need a constant Markov blanket, and
  • Steady state condition: you need to have a probability distribution that doesn’t change w.r.t time.
We can now rephrase our really general initial question
If self-organizing things that maintain their thingness exist, what must the mechanics of those things be?
into a less general question
Given a self-organizing thing that maintains its Markov blanket, what must the mechanics of that thing’s autonomous states be at steady state?
 

II. From extremely general physics to slightly less general but still really general physics

In this section, we’ll (1) consider an extremely general description of a physical system, (2) introduce the thingness conditions established in the previous section one by one, and (3) reshuffle/rearrange a bunch of equations to see what we get.

Langevin dynamics

Our starting point is to assume that our system follows Langevin dynamics, i.e. that the change of states with respect to time can be described with the following differential equation
where is a vector of all the states we’re considering, is some function (a vector field), and is a random variable drawn from a normal distribution with variance where is some scalar constant. Essentially we’re assuming that the dynamics of the states we’re describing (i.e. how the states we’re describing change over time) can be split into two parts: a part described by some function , plus another part that consists of well-behaved randomness .
 
We can rewrite the equation above in terms of the rate of change w.r.t. time on the probability density of states , instead of on the rate of change w.r.t. time on the states themselves through the Fokker-Planck equation.
where and . (Why we’re doing this will become clear in the next section.)
 
Background on partial derivatives , gradient operator , vector fields
Say we have some function and we want to describe how it changes over time. In the single-variable case, e.g. for some function we can just calculate . In the multi-variable case, e.g. , we need a new concept to do the same thing: the partial derivative.
If we want to know how changes in the direction we take the partial derivative w.r.t. and write , which is computed by applying the usual rules of differentiation to while treating as a constant; and we compute if we want to know how changes in the direction.
Visually:
But what if we wanted to talk about how changes in general, and not how it changes specifically w.r.t or ? It would be useful to have some single quantity that packages all the partial derivatives of into one. This is basically the gradient operator , which is defined as a vector of all the partial derivatives:
The gradient operator takes as input a scalar field – i.e. a function that maps inputs to output – and outputs a vector field (a function that maps inputs to outputs). Visually:
On the left is the gradient field (i.e. the result of the gradient operator); on the right is the scalar field that the gradient operator is being applied to.
Source: https://youtu.be/v0_LlyVquF8?si=O8hjwbuk29R18hcd
On the left is the gradient field (i.e. the result of the gradient operator); on the right is the scalar field that the gradient operator is being applied to. Source: https://youtu.be/v0_LlyVquF8?si=O8hjwbuk29R18hcd
 
 
Now, we have a really general description – the Fokker-Planck equation – that describes how lots of physical systems behave in terms of the probability distribution on the states of the system. The next step is to slot in our “thingness assumptions” and to see what comes out.
 

Solving the Fokker-Planck equation

Recall from before that the second thingness condition was the steady state assumption, i.e. that . We can directly apply this as an initial condition to the Fokker-Planck equation above:
Now we just need to solve this nasty-looking differential equation and find ! Right?
But wait…
  • We have two unknowns – the probability density and the flow – and only one equation. So if we did solve the differential equation, we’d have to find in terms of somehow.
  • Also, do we even care about at all? Remember that our initial question was “Given a self-organizing thing that maintains its Markov blanket, what must the mechanics of that thing’s autonomous states be at steady state?” where “mechanics” here means something like “how does the thing change w.r.t time” (so we can actually replace “mechanics” with “dynamics”). Well, is quite literally the flow function that defines how changes w.r.t. time – it’s the only non-stochastic term in both the Langevin and Fokker-Planck differential equations. So technically, finding an expression for would answer our question exactly.
So our actual goal now is to find an expression for in terms of that, when plugged into the equation above, satisfies the steady-state condition (i.e. it makes ) and then to figure out what that expression means.
 
It turns out that has the properties we want, where
  • is an antisymmetric matrix (i.e. a matrix equal to its negative transpose ) and the gradient of is orthogonal to the gradient of the log density , i.e. .
 
We can verify this:
Proof that initial condition is satisfied (from Appendix C of
Millidge et al., 2021
)
We will show that
Explanation of last line
The gradient of is orthogonal to the gradient of the log density by definition, which means the dot product of and is zero, i.e. . If we apply the chain rule to we get . Since is just a scalar – the probability density is a function that maps a vector to a scalar – this means .
 
 
The expression for the vector field we have above is actually a generalization of the Helmholtz decomposition, a.k.a. the fundamental theorem of vector calculus, which states that any “well-behaved” vector field can be split into the sum of two vector fields, where one of these vector fields is curl-free and the other is divergence-free.
 
Background on Helmholtz decomposition
Divergence and curl
Divergence and curl are vector operators, i.e. functions that take vector fields as inputs. Suppose we have some vector field . Then:
  • Divergence is a measure of how much that vector field behaves like a “source” or a “spring” at a given point. If the vector field behaves like a “sink” or a “plughole”, the divergence at that point is negative. More broadly, it is a measure of whether the rate at which things flow “out of” that point is faster than the rate at which things flow “into” that point. Divergence turns vector fields into a scalar field.
Source: Wikipedia
Source: Wikipedia
  • Curl is a measure of how much that vector field rotates counterclockwise at a given point. If it rotates clockwise at a given point, the curl is negative. (This is based on the right hand rule.) Curl turns an input vector field into another vector field, where
    The Helmholtz decomposition states that any “well-behaved” vector field can be written as
    where is a curl-free vector field, i.e.
    • In our expression for this corresponds to .
    and is a divergence-free vector field, i.e. .
    • In our expression for this corresponds to .
     
    This decomposition is quite intuitive visually (left = curl-free, right = divergence-free):
     
     
    We actually have a preliminary answer to our question in the form of an expression for :
    But what does this expression actually mean? Right now it’s just a pile of symbols. We’ll explore this in the next section.
     

    The Helmholtz decomposition

    We now have an expression that describes the flow function – i.e. how all the states we’re considering change – at steady-state:
    Graphing what this looks like (in three dimensions) will give us more intuition and insight on what this expression means, and the significance of each component.
    Visually:
    notion image
    We can thus see the geometric significance of the and parts of the flow
    • The part climbs the log density
    • The part, also called the solenoidal flow, circulates around the log density
    The former means that the flow on the states will try to maximize the log density . But recall that, all the way back in the Langevin equation, there was another component of the change in states w.r.t time : the normally distributed, random fluctuations with variance . So while the component of the flow tries to ascend the log density, will try to oppose the component of the flow by descending the log density.
    Why is this significant? Well, because the “things” we are considering – humans, bacteria, plants … – only remain “things” (can only maintain their Markov blankets in) a small number of states (call these states the characteristic states of a “thing”). If I teleported a person from Massachusetts to Egypt, they’d feel quite hot, but still survive and maintain their Markov blanket due to the part of the flow. If I teleported this person to the sun, however, their Markov blanket would dissolve. This is because I’ve changed the internal states of the person – specifically, their temperature state – so much that I’ve moved the “thing” out too far from its small region of characteristic states.
    The kinds of things we’re considering are able to continue to exist in the face of small fluctuations (Massachusetts → Egypt), but dissolve when faced with very large fluctuations (Massachusetts → Sun). This concept of resistance to external perturbations offers an intriguing way to quantify how “thing-like” (agent-like?) a given “thing” is.
     
    Background – but why do and do that on the graph?
    Why does climb up the log density?
    Recall that is defined as a scaled identity matrix, so the only effect it has on is that it scales up the magnitude of the vectors (the direction of the vectors remain unchanged).
    Also recall from earlier that the gradient operator on some scalar field by definition outputs a vector field, where each vector points in the direction of steepest ascent w.r.t the original scalar field.
    Thus, is a vector field where all the vectors point towards the peak of the probability density scaled by magnitude , which means that this component of the flow “climbs up” the density.
    Why does circulate around the log density?
    Recall that is defined as an antisymmetric matrix, i.e. . Antisymmetric matrices are linear transformations that (geometrically) result in infinitesimal rotations, which is why the portion of the flow circulates around the log density. It’s quite easy to show why the infinitesimal rotation matrix (which rotates a given vector by ) is antisymmetric in two dimensions.
    The matrix below rotates a given vector by :
    We want to find a matrix that rotates an input vector by as . As , and . Thus our original rotation matrix becomes
    We can rewrite this as
    where is the identity matrix, and is the matrix on the right. Note that is antisymmetric!
    Hopefully this example gives you some idea of why antisymmetric matrices represent infinitesimal rotation.
     
     
    The Helmholtz decomposition of the flow function, in addition to telling us that the the flow on the states will try to maximize also tells us something else.
    Specifically, the solenoidal flow tells us that our system is not only at steady-state, but actually at a specific type of steady-state: non-equilibrium steady state (NESS).
    • At equilibrium steady-state, the behavior of a system is time-reversible. That is, if you observe the dynamics of the system at equilibrium, you cannot tell if time is running forwards or backwards (ex: inside a cup, when tea has fully dissolved into milk).
    • Non-equilibrium steady states are not time-reversible: the dynamics of the system at equilibrium do not look the same if time runs forwards vs backwards (ex: inside a cup, when tea has fully dissolved into milk, but someone is constantly stirring the tea; if you reverse the dynamics, the stirring will go from clockwise to counterclockwise, or vice versa).
    The solenoidal flow indicates that we are at NESS because this part of the flow renders the system time-irreversible – if you reverse clockwise motion it looks counterclockwise, and vice versa. The portion of the flow, on the other hand, is completely time reversible (since opposes it exactly by descending on the log density with the same magnitude ; on average, the flows will look the same if time runs forwards vs backwards).
    Why is the fact that we’re at NESS and not ESS important? Well, it actually reflects a pretty deep property of the types of things we’re considering. Maintaining a Markov blanket against external perturbations – as, for example, biological systems do – requires a constant influx of energy. This is why biological systems are called dissipative structures: they increase the entropy of the environment to maintain constant entropy within the system.
     
    Finally, notice that is equivalent to . The former expression tells us the flow at NESS will try to maximize the log density ; the latter expression tells us the flow at NESS will try to minimize the negative log density since the portion of the flow (which originally climbed up the log density) has been flipped. Why I’m pointing this out might seem mysterious and trivial for now, but I promise it will make sense later. The second interpretation is actually quite important.
    Now we are left with some questions. What’s the significance of ? And why does the flow minimize this quantity at NESS? The answer to these questions will (finally!!) lead us to variational free energy.
     
    🔄
    Recap
    In this section we’ve (1) laid out a really general setup that describes a wide range of physical systems, (2) slotted in our first thingness condition (the steady-state assumption) and (3) interpreted the resultant equation that describes the dynamics at steady state.
    • Langevin dynamics: A wide range of physical systems can be described by the differential equation where is some flow function and is normally distributed noise with variance .
    • Fokker-Planck equation: We can rewrite the Langevin equation in terms of the change of probability density on states w.r.t. instead of the change of the states themselves w.r.t. . This is useful because it allows us to neatly slot in one of our thingness conditions.
      • Steady-state assumption: The steady-state thingness conditions is that . Thus we can set the Fokker-Planck equation to and solve for in terms of . This directly answers our initial question, since is the flow function that describes the dynamics of the thing at steady-state.
    • Helmholtz decomposition: We find via the Helmholtz decomposition that satisfies the steady-state condition, i.e. it makes .
      • The part of the flow climbs the log density, allowing the thing to stay within the small set of its characteristic states (which keeps its Markov blanket intact) in the face of random fluctuations .
      • The part of the flow circulates around the log density, making the dynamics time-irreversible, rendering the steady-state a non-equilibrium steady state (NESS). This property reflects the fact that biological systems are dissipative structures that perform processes to increase the entropy of their environment to maintain the system’s internal entropy.
    Our initial question
    Given a self-organizing thing that maintains its Markov blanket, what must the dynamics of that thing’s autonomous states be at steady state?
    now has a preliminary answer: the dynamics of that thing are described by the equation
    or equivalently as
    But notice that:
    • Our equation describes the flow on all the states, whereas we really only care about the flow on autonomous states. We haven’t applied the Markov blanket condition yet! (We will do this later.)
    • We don’t know what the significance of is, and why the dynamics at NESS minimize this quantity. We will explore this in the next section.
     

    III. Interlude: variational free energy

    In this section I’ll be trying to explain an important concept – variational inference – from the bottom up. Eventually this will lead us to why is important. Things might not immediately make sense or connect for a while, but they will eventually.

    I want to infer stuff but I hate adding

    Suppose that the world has a simple causal structure , that we can only observe but not , and that we’re trying to compute the posterior probability . (Thus is called a hidden state and is called an observable state.) Assume we only have access to some generative model factored in terms of the prior distribution and the likelihood distribution, i.e. .
    To compute exactly we’d have to use Bayes theorem:
    We’d have to compute the term in the denominator by marginalizing over the joint probability distribution, i.e. calculating . In many cases this quantity is intractable because is really large, so the sum is hard to calculate. It gets even worse if we’re dealing with probability densities instead of probability distributions and – ew, disgusting integrals.
    So… how can we compute without calculating ?
    The solution is to turn this inference problem into an optimization problem. That is, suppose we create some simpler probability distribution parametrized by , . And further suppose we compute by adjusting such that the approximate distribution gets closer and closer to . This way we won’t have to directly calculate at all!

    KL Divergence

    But first, we need a way to measure the difference between two probability distributions so that we have a metric to minimize. The standard way to measure how different a probability distribution is from another distribution is KL divergence, which is defined as the expected value on of , i.e.
    Note that KL divergence is not symmetric, i.e. and that KL divergence is always non-negative.
     
    Background on KL divergence
    There are many ways to (explain) and arrive at KL divergence. This is only one of them, and probably(?) the most relevant in this context.
    The big picture idea here is that measures how surprised you would be, if you found out that was the true probability distribution instead of . To rephrase this another way: you’re treating as the territory and as your map, and measuring how surprised you’d be (by the territory) if you treated your map as the territory.
    The first thing we need to do is formalize a measure of surprise. There’s already a concept for this called surprisal from information theory defined as .
    From here, it’s quite straightforward to see (intuitively) why the definition of KL divergence measures what we want it to measure. We’re taking the expectation over because we want to take the weighted average of how surprised we are over each possible state of reality.
     
    The quantity that we will be trying to minimize is . But wait… why not go the other way? Why not minimize instead? KL divergence is not symmetric, so how do we know which way is “right”?
    We actually cannot go the other way, i.e. we cannot compute . Because the fact that we don’t know is why we’re trying to minimize KL divergence in the first place! Thus we are forced to minimize the reverse KL divergence , instead of the forward KL divergence .
    We still have a problem, though. Calculating still requires us to know which… we obviously don’t, because this is what we are trying to compute in the first place. So somehow, we need to rewrite the reverse KL divergence into a quantity we do know how to minimize.

    Rewriting KL divergence

    In the last line, the disappears since does not depend on and sums to because it is a probability distribution.
    Notice that the final expression we obtain has two parts, one of which is dependent on and the other of which is not:
    So since we’re trying to adjust to minimize , we actually only care about minimizing the part of the expression that depends on . Let’s give this part a special name: variational free energy (VFE), and denote it . (Yay! We finally got to the “free energy” part of “Free Energy Principle”.)
    Notice that
    What just happened here? After a bunch of symbol shuffling, we’ve rewritten the quantity we initially wanted to minimize by adjusting but couldn’t, , into a quantity that we actually can minimize: . We’ve figured out how to approximate the posterior when you can’t marginalize over to calculate : just minimize VFE.
    There turns out to be lots of other interesting ways to rewrite VFE: some have interesting theoretical implications and map on nicely to biological theories and theories of cognition; others are practically useful for computing it.
    Also notice that if we arrange the last line of our initial attempt to rewrite we get
    This decomposition is impractical if we want to calculate VFE, but is conceptually significant for our attempt to understand how the FEP is derived. Notice that
    In other words, is a lower bound on VFE. This is very important. Keep this in mind.
     
    🔄
    Recap
    In this section we’ve established that (1) minimizing VFE is equivalent to approximate Bayesian inference – i.e. computing the posterior probability without using Bayes theorem and (2) is a lower bound on VFE.
    Our initial question was
    Given a self-organizing thing that maintains its Markov blanket, what must the dynamics of that thing’s autonomous states be at steady state?
    We are quite close to answering it! In the next section, we will:
    • Find an expression for the flow on autonomous states at NESS, specifically (rather than our previous equation which expressed the flow on all the states at NESS)
    • Conceptually link that expression with VFE and interpret what this means
     

    IV. Finally answering the question!

    In this section, we will finally obtain a formal answer to our question posed at the beginning.

    Marginal flow lemma

    Given a self-organizing thing that maintains its Markov blanket, what must the dynamics of that thing’s autonomous states be at steady state?
    We can almost answer our question with the Helmholtz decomposition we obtained earlier:
    The only thing left to do now is to find an equation that describes the flow of autonomous states at NESS. But to do this, we first need to define what it even means to express the flow of a subset of states.
    Suppose we want to express the flow of some subset of states , conditioned on us knowing the values of some other subset of states . To compute this quantity , we have to marginalize out the effects of , i.e. we need to integrate over the joint distribution and weight the integrand by . Thus we can define as
    The second equality holds due to the marginal flow lemma which we prove below.
     
    Proof of marginal flow lemma
    We want to prove .
    We will start by obtaining an expression for by partitioning the Helmholtz decomposition into a system of equations
    Notes
    • and are matrices of all zeroes because is a scaled identity matrix
    • by definition so
    Substituting the expression for into the definition of marginal flow (line 3), we get
    Notes
    • Line 3: substitute expression for from the partitioned matrix system of equations above
    • Line 4: application of chain rule on
    We can simplify the integral
    Notes
    • Line 4: we can take out of the integral because it doesn’t depend on the variable we’re integrating over
    • Line 5: the integral over the entire domain of any probability density is
    • Line 6: the average change of any probability density over the entire domain is .
      Substituting back into our expression for we get
      Notes
      • Line 3: because chain rule
       
       
       
      If we further assume that the second solenoidal coupling term , we can express the marginal flow of purely in terms of the gradient of

      Free energy lemma

      We can apply the marginal flow lemma to finally answer our question.
      To find the flow of autonomous states given that we know the values of all the states that make up the system (internal states, active states, sensory states), we can apply the marginal flow lemma, setting and to get
      or equivalently
      This expression means that at NESS, the flow of autonomous states given the values of the states that make up the system will descend on, or minimize .
      Recall that is the lower bound on VFE. This follows from our definition above of if we set the hidden variable to and the observable variables to .
      Thus if we assume that which implies then
      In other words, we can interpret the flow of autonomous states at NESS as descending on (or minimizing) variational free energy, i.e. approximating approximate Bayesian inference on the external states. This is the free energy lemma! Yay, we have finally arrived at what we’ve set out to show!
       
      Let’s take a step back to see what we’ve actually done here. We’ve formalized the reason for why things that maintain their thingness appear to model (or represent) their environments. Our answer is: for a certain definition of thing, the FEP is true, which means that the flow on the autonomous states of the thing at NESS approximate approximate Bayesian inference on hidden (external) states. This is very much related to Ashby’s good regulator theorem.
      We’ve also established what things must do, by virtue of their thingness: any thing which fits our definition of thing must minimize variational free energy at NESS. This tells us a surprising amount, because VFE can be decomposed in a surprisingly large number of (meaningful) ways:
       
      There is a way in which we’ve looped back around to right where we started. We started by assuming that (1) things exist over time, and (2) asking ourselves what things must do given this. The conclusion we’ve obtained is essentially a (very precise) tautology: the free energy lemma formally states that “things which exist over time will continue existing over time” ”things will minimize VFE”.
      But we have also, somewhat paradoxically, gained some new information from this tautology. We’ve obtained a general method for modeling what things must do falsifiably (which fall under the FEP definition of thingness): (1) specify the internal, external, sensory and active states of the thing of interest and (2) specify the probability distribution over them.
      Finally, the FEP is also useful because it suggests that every behavior a “thing” exhibits (even behaviors conventionally thought of to have different objectives, such as action and perception) can be cast as some form of minimization of variational free energy. This has inspired various process theories that aim to make more concrete predictions such as active inference, predictive coding and even theories of autism.
      Personally, I think that the FEP is interesting because it prompts us to zoom out drastically on the scale of abstraction and generality – and in doing so we now have a better idea of what, where, and how to zoom in when trying to predict how particular systems will behave.