## Entropy, partitions and groups

Terence Chan, from the University of South Australia in Adelaide, visited London for our recent workshop on “Information Flows and Information Bottlenecks”. I want to give an exposition here of a theorem from his talk, connecting entropy with groups. He stated it without proof in one of his talks, and showed us a proof on the (writeable) coffee tables in the common room.

The problem is to describe the set of points in real space of dimension 2r−1 given by the entropy of r random variables and their interactions on a finite probability space. The theorem asserts that, at least projectively, such points can be approximated arbitrarily closely by points arising from families of subgroups in finite groups.

These are notes from the first of three talks this term in the Combinatorics Study Group. Following this one, Rosemary Bailey will talk about what happens if you require the partitions to form a lattice rather than just a semilattice, and make connections with association schemes; then Søren Riis will say something, I don’t know what yet.

#### Probability and entropy

A finite probability space consists of a finite set Ω with a function P from the subsets of Ω to the real numbers satisfying a (simplified version of) Kolmogorov’s axioms:

• P(A) ≥ 0 for any subset A;
• P(Ω) = 1;
• If AB = ∅ then P(AB) = P(A)+P(B).

Of course it is easy to see that it can be specified by defining P on elements of Ω (these values being non-negative and summing to 1) and then letting P(A) be the sum of the values of P over the elements of A. (We blur the distinction between the probability of an element of Ω and of the corresponding singleton subset.)

It is worth pointing out that, while the origin of probability theory is usually credited to Pascal, it was Fermat who proposed this model; see Keith Devlin’s book The Unfinished Game.

A probability space Ω is uniform if P(a) is constant for a ∈ Ω (the constant, of course, being 1/|Ω|).

Voltaire said of the Holy Roman Empire that it is “neither holy, nor Roman, nor an empire”; in a similar way, a random variable is neither random nor a variable; it is a function on the set Ω. In much of probability theory, the codomain of a random variable has either an order (so that we can talk about the median and other quantiles) or an arithmetic structure (so that we can talk about mean and variance), or often both, typically the real numbers. But in what follows, we are not interested in the codomain of the function.

The identity function on Ω is a random variable, corresponding to “picking a random element of Ω”.

A random variable F defines a partition π = {A1,…,Ak} of Ω into inverse images of points in the range of F. Let pi = P(Ai), and define

h(p1,…,pk) = −∑i pi log pi.

The base of the logarithms is not too important; it is usually taken to be either e or 2, or in coding theory to be the size of the alphabet. Then the entropy of F is defined to be H(F) = h(p1,…,pk).

Entropy is a measure of our uncertainty about the value of F. In particular, if F takes k different values, then H(F) ≤ log k, with equality if and only if p1 = … = pk = 1/k. (We call a random variable satisfying this condition uniform.)

#### Entropy of several random variables

Now let F1,…,Fr be random variables on Ω suppose that the codomain of Fi is Si. For I ⊆ {1,…,r}, we define the joint random variable FI to be the function from Ω to ∏iI Si defined coordinatewise, so that the ith coordinate of FI(a) is Fi(a) for iI.

Partitions of a set are ordered by refinement; they form a lattice, in which the meet of two partitions is their coarsest common refinement. Now the partition corresponding to FI is easily seen to be the meet of the partitions corresponding to Fi for iI. So, given a family of random variables, the partitions of the probability space corresponding to the joint random variables form the meet-semilattice generated by the partitions for the original variables.

Now the entropy of the joint random variables defines a point in the positive
orthant of R2r, with one coordinate for each subset of {1,…,r}. Since the entropy of F is 0, we can omit the first coordinate, and regard the point as lying in R2r−1. Let Γr be the set of all such points (arising from all r-tuples of random variables on finite probability spaces). We will also consider a “projective” version PΓr, where we represent a point by the line through the origin (a point in projective space of dimension 2r−2).

A big problem of information theory is:

Problem Describe the set Γr.

#### Groups

Now let G be a finite group; we regard it as a uniform probability space. (This corresponds to the instruction “choose an element of G uniformly at random”.) If H is any subgroup of G, then H defines a random variable FH which maps each element gG to the right coset Hg of H which contains it. The proof of Lagrange’s Theorem shows that all cosets have the same number of elements; so this is a uniform random variable, and its entropy is log |G:H|, where |G:H| is the index of H in G (the number of cosets).

If H1,…,Hr are subgroups of G, then the joint random variable of the family {FHi:iI} is just FK, where K is the subgroup ∩iI Hi.

We will say that a family of random variables is a group family if it
is defined in this way by a family of subgroups of a finite group.

The theorem states

Theorem The projective point corresponding to any point of Γr can be approximated arbitrarily closely by a group family of random variables.

Corollary Any linear inequality satisfied by the entropy functions of all group families of random variables is satisfied by all entropy functions.

#### Proof of the theorem

First reduction: We may assume that all the probabilities are rational. Simply approximate the real probabilities of elements of the space arbitrarily closely by rationals; it is easy to see that the corresponding entropy function doesn’t change by very much.

Second reduction: We may assume that the probability space is uniform. For suppose that n is the least common multiple of the denominators of all the probabilities. Then replace a point a with P(a) = k/n by k points, each with probability 1/n.

Note for later that we could replace n in this argument by any multiple.

Now each random variable is specified precisely by the corresponding partition of Ω.

Recall that, if n1,…ns are positive integers with sum n, then the multinomial coefficient {n choose n1,…,ns)} is equal to n!/(n1!…ns!). It is the number of partitions of the set {1,…,n} into parts of sizes n1,…,ns. The subgroup of the symmetric group Sn consisting of permutations fixing each part of such a partition is a Young subgroup corresponding to the above partition λ of n. Since the symmetric group acts transitively on partitions of shape λ, the index of a Young subgroup is the corresponding multinomial coefficient.

Note that the intersection of a collection of Young subgroups is the Young subgroup corresponding to the meet of the associated partitions.

We need a rather crude approximation to the multinomial coefficients.

Lemma Let p1,…,ps be positive rational numbers summing to 1. Then

log {n choose p1n,…,psn} ∼ nh(p1,…,ps)

as n→∞.

Warning: the symbol ∼ here is used in the asymptotic sense, where f ∼ g if f/g tends to 1. Do not confuse this with the sense in probability theory, where it means that two random variables have the same distribution!

The interpretation is that n runs through values for which all pin are integers for i =1,…,s.

The proof is straightforward from a weak form of Stirling’s formula:
log n! = n log nn+O(log n).

Proof of the theorem We are given a family of r random variables on Ω, which we may assume (by our first two reductions) is uniform, and (by the remark after the second reduction) is arbitrarily large. A joint random variable associated with the family has certain probabilities p1,…ps for its values, and it corresponds to a partition with parts of sizes p1n,…psn.

We take our group G to be the symmetric group Sn, and the family of subgroups to be the Young subgroups corresponding to the partitions associated with the random variables in the family. By our remark connecting intersections of Young subgroups with meets of partitions above, the same holds for the joint random variables as well. Now the entropy of one of the group random variables is the log of the index of the corresponding Young subgroup, which is asymptotically n times the entropy of the random variable we are trying to approximate. So the theorem is proved.

To take a simple example, suppose that we have two random variables whose joint distribution is given in the table:

 1/2 1/4 0 1/4

We take n = 4k, and associate with the two random variables the Young subgroups fixing the partitions {{1,…3k},{3k+1,…4k}} (corresponding to the rows) and {{1,…2k},{2k+1,…4k}} (corresponding to the columns); the meet of these two is the Young subgroup fixing the partition {{1,…,2k},{2k+1,…3k},{3k+1,…4k}}. We have

• log {n choose (3/4)n,(1/4)n} ∼ nh(3/4,1/4),
• log {n choose (1/2)n,(1/2)n} ∼ nh(1/2,1/2),
• log {n choose (1/2)n,(1/4)n,(1/4)n} ∼ nh(1/2,1/4,1/4),

where the constants multiplying n on the right are the entropies of the original random variables and their joint distribution.

Problems:

• What families of groups other than symmetric groups have families of subgroups whose entropy functions can projectively approximate any entropy function?
• Given a family of groups, what are the restrictions on the entropy functions arising?

For example, Terence Chan pointed out that, for abelian groups, a linear inequality holds for the entropy functions which is not valid in general (the Ingleton inequality). But that’s another story! 