So far reality has been depicted as a disorderly mist, fractured with a scatter of low-entropy pockets – “systems” – that feed on each other in a swirling, co-adaptive dance towards ever-increasing complexity. We have seen that probability theory, and its role in information theory, represents a powerful body of concepts with which we can begin to formalize how systems interact and learn about each other. But probability theory can be soul-crushingly difficult to grasp, and its mathematical simplicity makes the slippery logic no less aggravating. Let us therefore recapitulate on some probability fundamentals.
“Probability”, we have seen, is a ratio representing the relative frequencies of some states out of all the states possible, for some unpredictable dynamical system (referred to as “random variable”). By “state” we mean a human category of states, since no two states are objectively identical, and the head/tail distinction has significance only for humans. The relative frequencies are normally obtained empirically, through observation, because we usually lack insight into the system’s internal dynamics. Textbook problems that run along the lines of “You have 34 lollipops in a bag, and 12 of them are strawberry-flavored…” are therefore misleadingly contrived. In real life, we rarely get to peek inside the bag. Instead, the internal logic of the system dynamics cryptically causes random initial conditions to converge on certain long-term states, and these “attractors” will be relatively more frequent in our data.
Caught in the present, an observer’s predicament is to predict what the random variable’s future state will be. Its best guess must be based on these obtained probability-values. However, often an observer may be curious about the probability of some complex event, and this is where basic probability theorems enter the equation. The observer could, for example, collapse several outcome states into a single category (or conjunctions) and wonder about its relative frequency. This event is known as a disjunction and amounts to the probability of event A OR event B occurring. The relevant probabilities depend crucially on whether the systems lurking behind the variables are assumed to be interdependent:
The fact that a hypothesized predictor “co-occurs” with an outcome can be potentially confusing. Normally, a hypothesis precedes the data in time. However, if you think of them as co-occurring in short-term memory, where the predictor is still present, they are symmetrical while still distinct. The conjunction P(hypothesis AND data) is therefore equal to P(data AND hypothesis) – a property known as “commutativity” – while P(data|hypothesis) does usually not equal P(hypothesis|data). The ratio of the party nights followed by a hangover does not equal that of hangovers followed by party nights, for example. So while having the same mathematical status, the two – termed “likelihood” and “posterior probability”, respectively – vary greatly in their usefulness. Moreover, likelihood is often much easier to obtain than posterior probability, since it is based only on observed data, but using the following trick that exploits the commutativity of conjunction, a posterior probability can be calculated based on the prior (let hypothesis be A and data be B):
The equation is known as Bayes’ theorem. Intuitively, Bayes’ theorem tells us how to interpret evidence in the context of previous knowledge. “Likelihood” corresponds to plausibility – how well the evidence fits the hypothesis – but to estimate the probability of a hypothesis, plausibility must be weighted by the base rate of the hypothesis. It allows us to reason backwards from measurements, to find the most probable cause of the data.
For example, in a noisy room, the observer may hear a sequence of words, constituting acoustic data, which he is uncertain about whether to interpret as “por mi chica” or “pour me chicken”. “Pour me chicken” is, considered on its own, more plausible, for 80% of the times you have encountered someone meaning that, it has sounded just like that. However, living as you are in a Spanish-speaking country, among all the phrases you have heard, “por mi chica” is considerably more frequent. Therefore, it will have a higher posterior probability, and you are wise to opt for this interpretation.
Iterated Bayesian inference constitutes learning. For example, the virtue of the cryptographic “one-time pad” mentioned earlier is that nothing is learned, since the prior and posterior probabilities are equal. Difference between prior and posterior means that the probability-value has been revised, that the event had self-information, and that something has been learned.
Digital implementations of Bayesian inference is what allows software such as Siri, Google Translate and Amazon’s recommender system – as studied in disciplines like “robotics”, “machine learning” and “artificial intelligence” – to modify themselves as a result of past interactions, and, from the comfort of their silicon shells, infer systematic structures with which they have never been in direct contact. Given its impressive achievements in these fields, Bayesian inference provides a sort of “existence proof” that complex behavior can emerge from physical systems, thanks to the richness of incoming data, without the aid of some mystical force. This makes it extremely attractive to argue that it constitutes something of a grand organizational principle with which we can begin to understand how the brain and complex adaptive systems generally model their environment in order to maintain minimal entropy and minimal surprise.
One Bayes-centered account is the “predictive coding” approach in computational neuroscience. According to this approach, the many hundreds of sensory megabits that at every waking moment flow into the brain from peripheral receptors are compared with sensory states that the brain has predicted to receive as input, based on the long-term models that it has stored in synaptic weights. The goal is to maximize the correlation (i.e. mutual information) between internal state and external state, so that, for example, the firing pattern for “apple” only occurs when an actual apple is in sight. To achieve this, the brain must assess and minimize the discrepancy between sensory data and its own predictions. The image is one of an almost Darwinian struggle between candidate hypotheses that are evaluated both based on their fitness (i.e. how small the discrepancy is) and how well they have fared historically, so as to continually update their posterior probabilities.
The perceptual system is known to be hierarchical in structure, to mirror the multi-layered reality it is meant to model. Causal invariances in external reality occur at different spatio-temporal frequencies. Some are long-term, like the seasons; some are mid-term, like how the irritable behavior of a friend tends to foreshadow a tantrum. Others are almost instantaneous, like how a certain pattern of light would change if you tilt your head. Sensory cortex therefore has distinct levels of processing that vary in their grain and extent, to bind together an internal representation bottom-up, all the way from individual receptor signals, via basic geometric features, to complex conceptual categories. What the “predictive coding” approach argues, based on evidence of feedback connections between the levels, is that multiple competing hypotheses simultaneously cascade in the opposite direction: from a vague, general gist of the situation, to fine-grain predictions about what low-level attributes will be perceived in particular areas of the optical field. Then, at each level, the sensed state is compared with the state predicted by a higher, slower-changing level, and this discrepancy gives rise to a feedback signal called “prediction error”.
Candidate hypotheses vary in their prior probabilities, which weigh the corresponding error signals. After Bayesian calculations have taken place, the best-performing hypothesis provides an error signal to the level above it, thus feeding a posterior probability into the likelihood-value higher up, influencing what hypothesis will perform best there. If the error-signal climbs through many levels, indicating that high-level predictions have been far off the mark, then the hypothesis will be changed globally, but if it gains only a little, only local, short-term hypotheses will have to respond and be updated. Finally, in order to actually recognize the stimulus, the high-level source of the winning hypotheses – those with the highest posterior probability and the smallest error-signal – would have to be identified and determine perceptual content.
Presumably, once a plausible model emerges, competing models are inhibited via lateral connections through some form of positive feedback mechanism. This may be illustrated with Douglas Hofstadter’s concept of “parallel terraced scan”, which he used in a computer program able to flexibly draw analogies between letter strings like ABC and FGH. “Alphabetical consecutiveness” is one of many basic concepts (high-level hypotheses) in a space of possible connections to explore. First the whole space of potential pathways is randomly explored cheaply and unfocusedly. As you collect probes you use this information to assess how promising they seem, and allocate a proportionate amount of resources, so that successive stages are increasingly focused and computationally expensive, making it “terraced” (or hierarchical). At no point, however, do you neglect exploring other possibilities: the path you chose can, after all, turn out to be a dead end. If any of the other paths is found sufficiently promising, it will compete with the current viewpoint, and may ultimately override the positive feedback of the first. Hofstadter uses the metaphor of an ant colony: scout ants make random forays in the forest, reporting to the chief ant how strong the scent of food is, so the chief allocates more scouts in that direction but makes sure that some scouts continue to wander around unconcerned, shall the path later be proven fruitless.
In order to judge how much fitting is enough for a model to be reliable, the brain must also store data about the input variability of different situations, compute variability of incoming data and continually revise prior expectations about precision. For example, the brain should hold apparent data regularities from cocktail-parties to lower standards than data regularities from a silent room, and not let highly variable data revise first-order hypotheses. In other words, error signals are weighted by their reliability. The result is a regress of statistics-about-statistics, a hierarchy of higher-order statistics, where, for example, the prediction error signal in the second level would be difference in expected prediction error. If incoming data is precise, it is deemed reliable, strengthening the prediction error signal. However, when faced with imprecise, ambiguous data, perceptual inference falls back on prior knowledge: our perception becomes “theory-laden” and may fall victim for the well-documented “confirmation bias” where what we see is determined by our anticipations, and where we become more attentive to evidence that confirm our preconceptions.
The “predictive coding” theory only makes claims about unconscious processing. Cognitive scientists call this “System 1”. It operates automatically and involuntarily, as opposed to “System 2”, which is conscious, effortful, but capable of more complex cognition. System 1 provides impressions and intuitions to System 2, but their different applications of Bayesian logic do not always agree. Perhaps because its evolutionary benefits outweigh its costs (either that, or due “evolutionary noise”), this “bug” in our pattern-recognition apparatus makes us over-sensitive to the presence of patterns. For example, in what is known as “availability bias”, System 1 encodes the frequency of events based on how salient they are, causing System 2 to over-estimate the probability of terrorist attacks as a result of their vividness and disproportionate media reporting. If two events co-occur on very memorable occasions, this bias may also lead us to conclude that there is a causal relationship between them, and preferentially seek evidence that confirms this. Across science, this poses a real threat to the ambition to model reality in the predictively most powerful way. Explanations in disciplines where experimental data are hard to obtain, such as politics and history, are at particular risk of confusing noise with cause, of reading symbolism into ink-blots.
Our reliance on salience makes our conscious selves poor Bayesians. We systematically fail to take prior probabilities into consideration when evaluating probability. In what is known as the “representativeness fallacy”, we judge the probability of a stereotypical nerd being a computer scientist as high, even when informed about how rare computer scientists are relative to social scientists. People with paranoid schizophrenia are particularly deficient at Bayesian reasoning, having something like an extreme form of confirmation bias. The evidence they hyper-selectively attend to may fit their theory that they are part of a CIA conspiracy, but a non-delusional individual can recognize how the base rate for conspiracies is almost non-existent compared with that of schizophrenia.
Finally, psychologists have demonstrated that humans have a very poor intuitive understanding of randomness. We struggle to comprehend the fact that HHHHH is just as likely an outcome of five coin tosses as HTHTT. This seems to be the source of a whole range of biases impacting our decision-making. We think mega-rich people like Steve Jobs are smarter than those with failed start-ups, and that famous actors are more talented than the thousands of equally deserving drama school-graduates now serving tables, when fortune plays an enormous role in our fates. The human brain’s own sensitivity to random factors tends to make expert judgments, like those of clinicians, inferior to those by simple but noise-resistant statistical algorithms. And we have a self-serving tendency, known as the “fundamental attribution bias”, to ignore external and random factors when explaining negative behaviors by others, wrongly attributing the behavior to durable personality traits, while we readily blame circumstances when it’s our turn to misbehave.
Perhaps a more sober worldview is that of people as grains of pollen, drunkenly walking through life’s joys and hardships, in a Brownian zig-zag between celebratory, champagne-fueled highs of “I’m awesome!”, and alcohol-drowned lows of sorrow and self-blame – all due to outcomes for which we have no credit or responsibility to take. Sometimes randomness conspires in our favor, sometimes against us – in the former, we are wise to stay hard-to-impress but humble about our own contributions, and in the latter, to forgive and believe the best about ourselves and others.