VII. Randomness, Probability, Compression and Redundancy

“Order” and “disorder”, we have seen, are observer-dependent categories of a dynamical system’s state space. What characterizes disordered states, relative to any observer, is that different disordered states do not differ in any meaningful way. “Disorder” is one single input category, whereas order could span many different input categories. In it, a change would not be detectible, whereas in a low-entropy configuration any rearrangement is highly noticeable. A gas could be ordered in many different ways – packed in corners, concentrated in the middle – but the macrostates meeting this criterion are vastly outnumbered, making an observer more sensitive to changes in it. Meanwhile, if the observer system has seen a high-entropy configuration, he has seen them all. This is seen in how it is easier to draw a cloud than a portrait – even a child can draw a somewhat realistic cloud, but when drawing a portrait, the slightest failure in proportion can distort the face beyond recognition, because a face has lower entropy than a cloud.

If all of the information is invisible – if an observer’s knowledge of any particle is zero and entropy is maximal – then we call the system “random”, or, alternatively, “stochastic”. The concept is made clearer if we consider a bit-string. If completely random, each bit has a 50-50 probability – it is equally unpredictable and uncorrelated as the previous one.  Typically, but not necessarily, the string appears disordered and dull, making details inconsequential. By contrast, if the sequence were highly structured, say 101010101010…, where patterns result from the generating system, then a shuffling would cause a meaningful qualitative change. A generating process where each new output is independent of the previous, like the pseudo-random algorithms used in computers, can produce apparently structured strings like 10111010…, but its entropy rate is 1 bit per character.

Like a paper snowball, randomly generated sequences are incompressible, but do not vary in any meaningful way

This highlights an important point. For the same reason that absence of evidence is not the same as evidence of absence (principle that in philosophy is known as the “problem of induction”, associated with philosopher David Hume), we can prove a complete string to be nonrandom simply by successfully compressing it, but we cannot prove a string to be random, that is, to be generated by process where each new output is independent of the previous. The digits of pi are notoriously pattern-less, but they are far from random. In fact, even the random number-generator algorithms are deterministic.

In practice, we rarely have insight into the generating mechanism. This link may be illustrated by a thought experiment. Suppose a human is told that a bag contains one white and one black ball, but that one ball has been removed and put in a box. Lacking any information, the chance that this ball is black is the same as it being white. However, when allowed to peek into the bag and see that the ball inside to be white, this bit of information instantly makes the observer certain the ball in the box is black, without anything metaphysically extraordinary taking place, and by returning the ball to the bag and repeating the experiment, a random process has easily been created.

Lacking insight into the generating mechanism, what an observer system can do is to record the frequencies of observed outputs, and use relative frequency as a basis for quantifying his uncertainty regarding the next-coming output – a theoretical construct known as probability, represented as a number between 0 and 1. The probabilities of different outputs, of different input categories, constitute the probability distribution of a random variable. As new data come in, this is continually updated. Philosophers of the frequentist persuasion argue that this is an approximation of a “true” distribution (a parameter), and that the notion of randomness only truly applies to processes in which an infinite number of trials yield each possible outcome equally often.

According to the "frequentist" interpretation, probability is an objective property of reality. A more modern interpretation is that it reflects the dynamical organization of the system behind the random variable. It may have attractors, where randomly selected initial conditions converge on the same outcome, leading to a high relative frequency and probability.

Systems theory provides an alternative explanation for the existence of stable, relative frequencies. Systems – which variables represent – vary in their dynamics. If a dynamical organization is robust, it is insensitive to surrounding disorder, and very many of its possible initial states eventually result in the same state (a feature known as an “attractor”). If an initial state is randomly selected (by non-systemic dynamics), then the chance of an event depends on the number of ways that it can be produced.

For example, in a well-organized bus company, the buses are more likely to arrive on time despite the many random disturbances that the company may face. Similarly, when throwing two die, achieving a total of 10 is more likely than one of 9, partly because you are a reliable counter and because 10 can occur in more ways than 9. Real-life outcomes are composites of systematic and random factors. For this reason an extra-ordinary event – such as a stunning performance by a player in a football match – is likely to be succeeded by a more normal one, as the random component “selects” a state that is more frequent in the phase space (a phenomenon known as “regression towards the mean”). Freak events are freaky because there are so few sets of circumstances that support them among all circumstances possible. Only in some systems, of the trivially discrete kind, does it really make sense to speak of approaching some true parameter.

The Brownian motion of pollen grains is due to random collisions with water molecules that sometimes become lop-sided.

Stable relative frequencies, therefore, are partly due to the sameness of system dynamics, partly due to how the random factors that determine system state are orderly when aggregated. Brownian motion provides a vivid mental image of this. It refers to the jagged, haphazard wandering, sometimes termed “the drunkard’s walk”, of pollen grains dissolved in a liquid. When first observed, scientists were unsure of whether this was evidence of some stable mechanism that made a pollen grain’s trajectory predictable. Albert Einstein, however, published a paper in 1905 where he explained the walk as caused by the stochastic bumping by water molecules, which mostly cancel out, but sometimes, by chance, are lop-sided so as to induce a change in direction. The totally structure-less nature of the high-entropy water-molecule motion makes the path on small scales unpredictable, but en masse, it gains exquisite statistical invariance.  A histogram of distances from their mid-plane will, for example, have a normal distribution, and the walk will almost certainly not be a straight line. Inability to do so therefore remains a reliable rule-of-thumb to determine whether your friend has had one drink too many.

Recall that when previously we quantified information, we only considered scenarios in which outcomes had equal probability. But what happens to the information content of a coin toss if the coin is biased? Clearly, if it necessarily produces head, it resolves no uncertainty, and it contains zero bits. A data source definitely generating a long source of 1’s also has an entropy of zero bits. Similarly, when you engage in a conversation and expect a reply to something you just said, you unconsciously have statistical knowledge of what word is likely to succeed the current word, for there are inter-word dependencies. “Barking up the wrong…” has a high probability to be followed by “tree”. Like a 1 in a string of only 1’s, the word “tree” should therefore have lower information than, say, “dog-owner”. And if a message source is considered as random with equal probabilities even though it outputs apparently structured data, e.g. 10101010…, then each new character is 1 bit of information, though intuitively the next bit is completely uncertain. How can we make quantitative sense of this?

Well, first of all, each event can be said to possess self-information – an information content that is determined by how improbable it is. The more surprising, the more self-information. This is not the same as entropy. A biased coin has between 0 and 1 bits in entropy, but if tail comes up despite having a probability of 1/6, its self-information is much higher than that of head. Specifically, to maintain additive properties (i.e. the amount of surprise of two events is the sum of their individual amounts of surprise), self-information has defined as I(w)=log(1/P(W)). This is why “information” in normal parlance, seen from the observer’s point of view, means “knowledge”, whereas for an information theorist, concerned with the microstates, it means “uncertainty”. The two meanings are compatible, because we learn more from surprise than we do from fulfilled expectations.

Given the estimated probability distribution of the different outcomes for a random variable {p1,p2,p3…}, what is the minimum number of bits required to specify the state of the system? We may return to the weather-report example. If the 20 different weathers have different probabilities, then the self-information of more frequent weathers would be lower, while the self-information of very improbable weathers approaches 5. Clearly, we still require 5 bits to distinguish fully between all possible outcomes, but how about the average number of bits required? If you think about it, if you encode each possible outcome with a bit-string whose length is proportional to the outcome’s self-information, then this average (p1*log(1/ p1) + p2*log(1/p2)…)  equals the expected value of self-information, and it is also more specifically what is meant by “Shannon entropy”.

Shannon, working at the big US telephone company Bell Labs, observed that if you let less frequent messages be encoded in expensive ways (more bits) and more frequent messages in cheap ways (shorter strings), in other words let message length be proportional to the improbability of its occurrence, you may compress a message in a distortion-free (“lossless”) way that would save the corporation enormous amounts of money. Since the letter “e” is the most common, it would be efficient in terms of communication channel capacity to represent it with a shorter string. The Morse code and languages in general are built on the same principle. The word “the” in English is short due to its high probability of occurring.

Kolmogorov complexity defines complexity as the size of the computer program that outputs it. A highly ordered system, like an origami figure, can be similarly compressed into its folding instructions.

In effect, entropy defines the boundary of the most lossless “compression” possible. As we have seen, for a uniform probability distribution, the entropy (minimum number of bits) equals the number of outcomes we need to distinguish between. This makes an equiprobable random variable characteristically “incompressible”. However, a finite-length sequence, even though randomly generated, may be ordered. Consider, for example, a 1000-character string with the pattern “101010…”. Its Shannon entropy is maximal, but would intuitively it feels very compressible. To provide for this, there is a measure known as “Kolmogorov complexity”, which is independent of any probability model, and defined as the shortest computer program that outputs the sequence. Writing a loop that prints out “10” 500 times is shorter than the actual string. A maximally complex string is thus one where its shortest representation is its own complete, explicit printout.

Other compressions are lossy. For example, computer images may be stored as JPEG-files, whichuse a scheme that involves “chroma subsampling” and “transform coding” which exploit the fact that, because human eyes are less sensitive to variation in color than luminance, color detail can be imperceptibly reduced by averaging out blocks of color and allocating more bandwidth to the brightness component. This method usually causes a loss in information, since the decoded output is not identical to the input, which is sometimes seen in visible distortions.

To be compressible is to contain “information redundancies” that compression serves to reduce. Redundancies do usually not imply waste. They are used to counterbalance noise and equivocation (the disappearance of data) to ensure fidelity in transmission. The genetic code has redundancies in how, for example, several codons code for the same amino acid, and this prevents errors in DNA from translating into deformed proteins. Spoken and written English has redundancies in how we can communicate even in a nightclub with loud music, and how “c” typically is followed by “k” so often that we could replace “ck” by a single letter without any loss of meaning.

Natural language has information redundancies that act like padding to prevent the signal from deteriorating in noisy conditions.

Languages belong to a class of systems that are called “ergodic”. This means that their statistical properties are uniform throughout the process (“e” is most common in English, almost regardless of the length of your sample). However, different languages vary in their entropy, and English’ is comparatively low. This means that the space that a cryptographer needs to explore in order to decipher an encrypted message is reduced. For this reason, the only cryptographic protocol known to be unbreakable in principle is the so-called “one-time pad”, in which the secret key must have maximal entropy – the next digit should be completely unpredictable given the previous one – and it must be as long as the message itself, and only used once. As a result, the “one-time pad” is worthless, because any method for distributing the key could itself be used to share the message!

Just like how languages save longer codes for infrequent messages, our nervous system saves expensive processes for infrequent sensations. Surprise is expensive to mediate, so we would rather save firing for improbable events, letting “non-firing” as a default map onto the probable state of affairs. As we habituate to stimuli, we fire less. To prevent over-firing, we are numbed to disorderly systems, and experience them as dull and uninteresting. Entropy is a useful concept precisely because it links microstate properties to input categories of the beholder. Somehow or perceptual system is attuned to these pockets of virtual, hologram-like phase spaces projected onto external dynamics.

Our numbness to randomly behaving time-varying quantities is perhaps most obvious in the case of sound. What physicists mean by “noise” is not statistical error, but sound sequences that sound the same at any playing speed. Human voices clearly don’t, since speeding them up makes them sound like Donald Duck. Mathematically, the process is said to be scale-free or fractal. This is seen in how the power spectrum, which described how the average behavior (e.g. variation in loudness) varies with frequency, for noise is proportional to inverse powers of the frequency, symbolized as f-a. If you look at the high-frequency part, the graph has the same statistical properties as the low-frequency part.

Random sound sequences are a specific type of noise known as white noise for which a=0. Different a-values mean different amounts of correlation. White noise is what you get if you let a random number-generator select every note. They are equally hissing and featureless at all frequencies – the signal is serially uncorrelated, which at low intensities tends to have a calming effect on a human listener. The universality in musical aesthetics appears to be partly explained by us having evolved optimal sound sensitivity to sound sequences with particular power-spectral properties, with just the right amount of surprise and confirmation. Analyses on classical Western music has indicated that these lie between a=1 (“pink noise”) a=2 (“brown noise”), reflecting preferences in correlations, and ultimately the grain and extent in boundaries of the human sound-processing system (and potential reasons for this preference will be explored later and are very fascinating).

Because order itself is improbable, it has high self-information in the context of a generally chaotic dynamic. We don’t expect to find order in nature. In order to reduce information-processing load, an umwelt is under pressure to sensitize the organism to low-probability events, both in order to take pre-emptive action, and because information redundancies are opportunities for re-designing reality. Our fixation with order in a Universe that otherwise succumbs to the 2nd law of thermodynamics previously led “vitalists” to attribute it to mystical forces and divine assistance. Life, a marvel of order, seemed like a miraculously local counter-current in an aging, self-undermining sea of erosion and decay, where order spreads and evens out into a lukewarm soup of uselessness.

The loophole in the 2nd law exploited by life to maintain non-equilibrium is that order can arise at the cost of more disorder elsewhere. For example, minutes after the Big Bang, the Universe was filled with gas of staggeringly low entropy. Because of its high density, every region of gas pulled on every other, causing orderly clumps to form but heat to increase (increasing entropy). The heat (made nuclear processes arise), like the stellar fusion of carbon, which, under the orderly sunlight, plants can use to produce well-ordered organic compounds. Animals digest these compounds to maintain internal order, powering semipermeable membranes that hinder free diffusion through energy-consuming pumps and filters, thereby manipulating concentrations of molecules and catalyzing reactions that are normally too improbable to occur. By feeding on negative entropy from its supportive environmental conditions, it consumes their availability of potential to do mechanical work, at the cost of dissipating heat and increasing entropy in the larger thermodynamic context.