The title of this post is that of a book by John von Plato, on the history of probability theory from the start of the twentieth century to the 1930s, starting with Borel and Einstein (but with many backward looks) and continuing through the advent of quantum mechanics to discuss in detail the contributions of von Mises, Kolmogorov, and de Finetti. It was published in 1994 by Cambridge University Press; I bought it in 1999 (I don’t remember where, quite likely for reduced price at a conference somewhere) but never got around to reading it. Now, with downsizing on my mind, I dug it out. (I have a huge pile of elementary textbooks on probability and statistics sent to me by publishers, and this book had got amongst them somehow.) So I am reading it now, in the cracks of all the other things I am doing.

It is not an easy read. Von Plato is not a stylist; his sentences are long and a bit turgid, and the mathematics is sometimes a bit suspect. But the basic information is very interesting, about things that I didn’t know as well as I should have. I will discuss here a few highlights. (I must add: Von Plato has read the primary sources; I have not. So I will just accept what he says.)

**Borel**, who introduced Borel measure and encouraged **Lebesgue** to extend it, is one of the great pioneers of probability theory. But he was ambivalent and, according to the book, a bit muddled too. As a constructivist, he was reluctant to admit the existence of real numbers unless there is a rule for calculating their digits. This leads to the trap that the set of numbers one is allowed to consider is countable. But then there cannot be a countably additive measure which takes the value 0 on each single number and 1 on the whole unit interval! On the other hand, he was happy with what he described as the *geometric probability* of subsets of the unit interval, that is, their Lebesgue measure. Great mathematician as he was, he got his part of the Borel–Cantelli lemma right, though it is a subtle alternation of quantifiers; and he proved the Strong Law of Large Numbers correctly apart from a slight technical problem. (He added up normal approximations to the relevant probabilities without taking care about the errors.) He showed that almost all numbers are normal in any base, but as a constructivist he worried that he could not construct one. (Later **Sierpiński** filled this gap.)

Incidentally, it was **Hardy** and **Littlewood** who introduced the terminology “almost all” for “all except a null set”.

In statistical mechanics, **Maxwell** came up with the notion of an ensemble: to work out probabilities of sets of microscopic states for a system with specified macroscopic parameters such as temperature and pressure, Maxwell considered a very large set of systems with the same macroscopic parameters, and calculated what fraction of them are in the set of interest. But **Boltzmann** realised that what was needed was an *ergodic theorem* stating that the probability that the system is in a given set of microstates is equal to the limiting proportion of the time the system spends in the set. This allows the large ensemble of systems to be replaced by a single system which we watch for a long time. Indeed, **Einstein** regarded the time averages, or *statistical probabilities* as he called them, as more important than probabilites derived from equipartition assumptions. It is clear with hindsight that these time averages dovetail well with the frequentist interpretation of probability.

About Maxwell, von Plato says:

Maxwell’s talk of a large number of systems, instead of a continuous one, was a characteristic way of expression of a physicist who tends to see the infinite as an approximation to the sufficiently large finite. (A mathematician might think exactly the other way around.)

Incidentally, von Plato claims that it is widely thought that Boltzmann used the assumption that the unique trajectory of the system passes through every point in phase space to justify ergodicity; but that actually this is based on a misunderstanding in a review of Boltzmann’s work by Paul and Tatiana Ehrenfest. This misunderstanding led to claims that ergodicity was impossible, since space-filling trajectories were impossible on physical grounds. Of course, from a modern perspective, what is required is that *almost all* trajectories are *dense* in phase space, a weakening at two points of the Ehrenfests’ claim. There seems no doubt that the liberation of the ergodic hypothesis from details of the underlying dynamics was important in the development of probability theory.

The last three chapters of the book deal in detail with the contributions of von Mises, Kolmogorov, and de Finetti. Most of this was new to me.

The approach of **von Mises** was to attempt to model the frequentist interpretation of probability by defining a “random” sequence. He requires that the relative frequency of any set of values among the first *n* terms tends to a limit as *n* → ∞; and that the same is true for any subsequence “selected” in a certain way. Von Plato gives the impression that von Mises is not terribly clear about how this selection is done. If a particular value occurs infinitely often but not with density 1, then choosing just the subsequence where that value occurs would violate the condition, so this is not allowed. A subsequence where the choice of a term depends only on its index and on values of earlier terms would be OK. (We have to stop a gambler improving the odds by using information currently available.)

This approach led eventually to Kolmogorov’s definition of randomness for a finite sequence: a sequence is random if it cannot be generated (by a Turing machine) with a program substantially shorter than the length of the sequence. There are some difficulties in extending this to infinite sequences; these were overcome by Martin-Löf. But it is fair to say that this approach (which appeared just before Kolmogorov’s own) has not been very influential.

**Kolmogorov** published his approach to probability in the book *Grundbegriffe der Wahrscheinlichkeitsrechnung* in 1933. I thought that Kolmogorov defined probability as a countably additive measure on a σ-field of subsets of a set (i.e. closed under countable unions and intersections and complement) with total measure 1. The truth is more subtle. Kolmogorov was an intuitionist (he had already published a paper translating classical mathematics into intuitionistic mathematics by replacing statements by their double negations), and regarded the infinite as an “ideal” object for obtaining information about the finite, without having any physical reality itself. His axioms only require a field of subsets (closed under finite unions and intersections and complement) and a finitely additive measure. Later he adds an axiom of continuity, asserting that if a descending sequence of sets *A _{i}* has empty intersection, then

**P**(

*A*) → 0. This is equivalent to countable additivity. It is then natural to close the field under countable unions and intersections, regarded as “ideal” events.

_{i}There is far more in Kolmogorov’s book than just the axioms: the main contributions are a definition of conditional probability where the conditioning is on an event with probability 0, a detailed discussion of continuous-time stochastic processes, and a chapter on zero-one laws.

A not-unrelated fact: in 1940 he showed how some data “obtained and misinterpreted by a student of Lysenko” were actually a clear confirmation of Mendel’s laws. This paper was censored from his list of published works in 1953.

And so to the last founding father, **de Finetti**. Apparently he is known as the champion of “subjective probability” and Bayesianism, though I was aware of him for a more technical achievement, the originator of the concept of exchangeability and the representation theorem for exchangeable trials. Somehow these things are reconciled, though I don’t really understand how.

Exchangeability means that the distribution of the number of successes in a sequence of trials is independent of the order. An example would be: choose one of a number of urns with different proportions of black and white balls according to some probability distribution; then sample with replacement from the chosen urn. De Finetti’s representation shows that you can recover the probability distributions of the urns and the proportions of black balls in them from the probability distributions of the number of black balls selected in the overall experiment. Somehow this shows that the “objective” data is unnecessary and can be deduced.

I reported in an earlier post how I spent a long time without success looking for a measure on countable graphs which is concentrated on Henson’s universal homogeneous triangle-free graph but which is independent of the ordering of the vertices. Such a measure is exchangeable, and this (and a representation theorem) feature in the construction by Petrov and Vershik.

My own view, for what it is worth, is this. Saying that the Bayesian viewpoint entails accepting subjective probability involves a similar fallacy to saying that the law of cause and effect entails accepting a First Cause. Bayes’ Theorem simply updates probabilities in the light of new information.

The book concludes with a short chapter outlining the remarkable insights of one of my real heroes, Nicole **Oresme**, in the fourteenth century. Oresme realised that, in modern terminology,

- almost all numbers in the unit interval are irrational;
- integer multiples of an irrational mod 1 are dense in the unit interval.

These supported his argument against astrology. The first assertion says that exact planetary conjunctions are extremely improbable; the second, that approximate planetary conjunctions are very common.

Oresme did much more; for example, building on the work of the Merton Calculators, he discovered the formula for uniformly accelerated motion, two centuries before Galileo.

I will end up with a paradox that puzzled me for some time. To define a probability measure on a space, all you need are the notions of “null set” and “sequence of independent trials”: the numerical probabilites follow from this non-numerical data. For let *A* be an event, and let *A _{n}* be independent copies of

*A*for natural numbers

*n*. Then there is a unique number

*p*such that, if

*I*is the indicator function of

_{A}*A*, the event

{s ∈ Ω* : (*I _{A}*(

*s*

_{1})+…+

*I*(

_{A}*s*))/

_{n}*n*does not tend to

*p*as

*n*→ ∞}

is null. This value *p* is the probability of *A*. (The weak law of large numbers says that the relative frequency of *A* in independent trials tends to the probability of *A* almost surely. Here Ω* is the set of infinite sequences of elements of Ω, with the product structure.)

Somehow the numerical probabilities have been slipped in, in the process of constructing the null sets on the space of infinite sequences.

What this in fact shows is that the null sets on the sequence space are rather more subtle than I thought!

The alternative theory of large and small, Baire category, does not share this “feature”.

So interesting.

I don’t understand your paradox example. You make reference to the independent trials A_n, but then the event you refer to makes no reference to A_n. Also, if I understand the point correctly, if \Omega is {H, T} and A = {H} then presumably p is 1/2, but isn’t is perfectly possible to define a probability measure on \Omega such that P({A}) takes on any value between 0 and 1?

What we are looking at is the proportion of occurrences of A in infinitely many independent trials. Maybe I didn’t say it very well.

The paradox is that, if \Omega={H,T} and A={H}, then p could be anything, say 1/3. Then the event that the proportion of occurrences of A either fails to converge or converges to anything other than 1/3 is a null set. So the null sets are not just determined by the structure of the set of sequences, but depend on numbers fed in at the start.

This contrasts with Baire category where the meagre sets are determined independently of any structure on \Omega (if it is discrete).