## Estimation and accuracy

I am back in London, desperately trying to throw off a bad cold in time to start lecturing to 305 students on Monday morning.

Various reading matter, chiefly Nature and Significance, provoked a few thoughts on the topic of estimation and accuracy.

Significance reports a survey on people’s perception of statistics. It seems that “most people” think that around 15% of under-16 girls in Britain fall pregnant each year. No, I don’t know what that means either, but it is only a news report, and even the Royal Statistical Society are not very careful in their reporting of news. But the dramatic point is that the actual percentage of under-16 girls who become pregnant is 0.6%, so whatever the first figure means, there is a huge disparity. This does raise interesting questions.

The report gives other examples too. “The public thinks” that 31% of the population of Britain is immigrants; the official figure is 13%, and even a generous allowance for illegal immigrants only brings this up to 15%.

I find this more understandable. I am sure that almost nobody who sees my descendants would think that they could be immigrants; and yet many of my students at Queen Mary, who are completely British in birth, passport, attitude, accent, and every other way, have brown skins and would no doubt be classified as “immigrant” by many people, especially in parts of the country where their skin colour is more unusual. Someone from the deep country coming to East London could get an inflated view of the proportion of immigrants. (I am often embarrassed when students come to me to get a passport application signed, and I have to explain that I can’t sign because I am not British.)

This reminds me of something Bernard Silverman said in his Gresham lecture the year before last. He was chief scientific officer at the Home Office, and as well as showing videos of bomb blasts, he told us that each immigrant to Britain creates 0.8 jobs, on average. (This is another badly-estimated figure; most people think the figure is close to either 0 or 1, depending on their political prejudice.) Now we can do a back-of-the envelope calculation. 64% of families in Britain receive some kind of benefit, so it seems that either immigrants are much less likely to be “benefit-scroungers” than natives, or that rather than “coming here taking our jobs” they are actually creating jobs some of which are taken by natives. Xenophobes can’t have it both ways!

Anyway, back to the story. The news article also mentions that people think that 24% of benefits are claimed fraudulently, whereas the true figure is 0.7%: another huge discrepancy which I can’t explain.

It is certainly true that people are not very good at estimating percentages. An article in the same issue describes asking people to give a 70% confidence interval for various figures which they do not know exactly. Even when the notion of a confidence interval is explained, people tend to be overconfident. This is illustrated with ten questions of the same type, of which the first is “I am 70% sure that Winston Churchill was born between the years — and —”. If you follow the instructions correctly, then (assuming you don’t know the exact answer to any question) you should typically get 7 questions correct. Most people get fewer than 7 correct, suggesting that they are overconfident (or perhaps that they have misunderstood the instructions).

I am a mathematician, and for a mathematician the only satisfactory confidence interval is 100%. I did indeed get more than 7 questions right. But the test made me uneasy. One question was about the average distance from the earth to the moon; I remember reading this in an encyclopaedia when I was a child, so I gave what I thought was a very small interval; the answer was given more accurately than the one I remember, but well within my interval. Another question concerned the GDP of the USA in 2011. I had no knowledge and no interest, so gave a very wide interval, wide enough to catch the true value (I would have had very little confidence in a smaller interval).

Indeed, the very next article reports evidence for the well-known fact that confidence is more important than accuracy for the popularity of media pundits (Nate Silver notwithstanding).

Could it be that all the ways I am atypical boil down in the end to the fact that I am a mathematician?

On the subject of accuracy, Nature had an article about modern machines for sequencing DNA. This is a task that has become enormously faster and cheaper in the last few years. But I was shocked to learn that, unlike the situation with computers, this increase in speed and decrease in cost has been accompanied by a substantial drop in accuracy.

The technique is that the DNA being sequenced is chopped up into a lot of tiny pieces, each piece is sequenced, and then the sequences are reassembled. The article describes it as like taking ten copies of “A Tale of Two Cities”, putting them through the shredder, and reassembling the novel from the resulting fragments. But there are several features that are difficult for this method to get right, including:

• Many genes have long repetitive sections. If the fragments are smaller than a repeat, it is hard to establish the number of repeats.
• Some types of DNA are hard to sequence, for example near the end of a chromosome, or stretches made up mainly of the bases C and G (for some unexplained reason).
• The human genome is diploid: we have two copies of each gene, not identical. Some plants have six or more. If you put ten different editions of “A Tale of Two Cities” through the shredder, what exactly are you trying to reconstruct?

I suspect that part of the problem is that, when a process like this is automated, you feed in the DNA and out comes the answer, and people have a tendency to think “It must be right because the machine says so.” For those of us who work with computers, normally very reliable if correctly programmed, it is salutary to learn of machines which are inherently inaccurate.