Summary of "The Brain’s Learning Algorithm Isn’t Backpropagation"

Concise summary — main ideas and lessons

The video contrasts backpropagation (the dominant machine‑learning learning algorithm) with a biologically plausible alternative called predictive coding.
Backprop is mathematically powerful but incompatible with several biological facts about real brains. Predictive coding provides a framework that better matches neurophysiology and can also serve as a useful machine‑learning algorithm.
Core technical viewpoint:

An energy‑based formulation: the brain (or network) minimizes a global “energy” equal to the sum of squared prediction errors; inference and learning proceed by locally reducing that energy.

The fundamental learning problem is credit assignment: determining which synapses to change to improve performance.
Backpropagation solves credit assignment via automatic differentiation and a backward pass of error signals, but this requires biologically unrealistic mechanisms:
- Discrete, separated phases (forward pass, error computation, backward pass, then simultaneous weight updates). Real neurons do not “freeze” activity to perform a backward pass.
- Precise global coordination and temporally ordered updates across individual neurons. Brains operate with local autonomy, not cell‑by‑cell globally coordinated sequences.
- The weight‑transport problem: backprop requires exact symmetry between forward and feedback weights, which is not plausibly available in biological tissue.
Because brains process information slowly, continuously, and massively in parallel — without circuitry for per‑neuron phase control — exact backprop is unlikely to be implemented in cortex.

Basic idea: the brain continually predicts incoming sensory data; only unexpected signals (prediction errors) require extensive processing. This saves metabolic cost and supports efficient inference.
Architecture: hierarchical layers in which each layer predicts the activity of the layer below.
- Top‑down connections carry predictions.
- Bottom‑up connections carry prediction errors (differences between actual and predicted activity).
Energy function: total energy = sum over units of squared prediction errors. The system evolves to minimize that energy.

Representational neurons: encode the current “state” or prediction passed down the hierarchy.
Dedicated error neurons: explicitly encode prediction error for each representational neuron. Predictive coding requires explicit error‑encoding units.

Each representational neuron updates its activity to reduce its local prediction error and to better predict the layer below.
Intuition for the update: activity change ≈ negative of that neuron’s local prediction error plus a weighted sum of prediction errors from the layer below — a compromise between aligning to top‑down predictions and improving bottom‑up predictions.
Error neurons compute a local comparator: error = representational activity − prediction (prediction = weighted sum of activities from the layer above).
All updates are local: neurons only need their own activity, the paired local error neuron, and inputs from immediate neighboring layers.

Weight changes follow a local, Hebbian‑like rule:
- Δw ∝ presynaptic_activity × postsynaptic_error (often with a negative sign reflecting descent on the energy).
This rule comes from taking gradients of the energy with respect to weights (steepest‑descent).

Predictive coding benefits from symmetric forward and backward connections, but exact symmetry is not strictly necessary.
Feedforward and feedback synapses, when trained locally with similar rules, can converge approximately toward useful alignment; approximate symmetry suffices in practice.
Nonlinearities complicate exact symmetry; empirical and theoretical work suggests approximate symmetry works well enough.

Clamp sensory inputs at the bottom layer (fix these nodes to data).
Optionally clamp the top layer to labels for supervised learning.
Let activities (both representational and error neurons) iteratively relax via local dynamics until equilibrium (an energy minimum) is reached.
Apply local weight updates: Δw ∝ presynaptic_activity × postsynaptic_error.
Repeat across examples; weights gradually encode statistical structure. - For generative sampling: unclamp the top/output layer and run dynamics to equilibrium to synthesize data consistent with the learned model. - For classification: freeze weights and let the network settle; read out labels from top‑layer activities.

Biological plausibility: continuous, parallel, local computation and local learning rules resemble cortical physiology and plasticity (Hebbian‑like).
No need for global phase switching; inference and learning can occur simultaneously and continuously.
Highly parallelizable and potentially computationally efficient.
May mitigate catastrophic forgetting and, in some cases, find better solutions than an algorithm focused only on a global output loss.

Real brains have richer, more complex connectivity than the simplified hierarchical model used in many predictive‑coding accounts.
Nonlinear activation functions complicate theoretical equivalences with backprop and make perfect weight symmetry unlikely.
The account does not prove that the brain uses predictive coding; it provides a plausible, testable alternative consistent with many observations.

Predictive coding reframes inference and learning as local energy minimization over prediction errors using dedicated error‑encoding neurons and local weight updates.
It addresses two major biological conflicts with backprop — the need for discontinuous processing/phases and strict global coordination — while offering machine‑learning advantages worth exploring further.
Predictive coding is a promising bridge between neuroscience and next‑generation learning algorithms.

Video narrator / host (unnamed in the subtitles; the channel’s presenter who references an earlier video).
Brilliant.org (sponsor mentioned in the video).