Data science and statistics

Words change their meanings. Once “biometrics” referred to the use of statistics in studying biological systems such as agriculture (as opposed to, say, “technometrics”): now only one item on the Google top ten refers to the International Biometric Society, so completely has a different meaning taken over.

This is an example where the change in meaning has been driven by technology. Now it seems that the rise of “data science” may be forcing a similar change on “statistics”.

These rather brief thoughts are provoked by reading Peter Diggle’s presidential address to the Royal Statistical Society (published in their journal, series A, volume 178(4)), entitled “Statistics: a data science for the 21st century”.

By way of introduction, Diggle remarks that the start of his career more or less coincided with the first statistical software, and some people at the time thought that the discipline of statistics was no longer necessary, since computers would do the work. (This is still an issue; there have been recent stories of established research institutes sacking their statisticians because the scientists all have Excel on their desktops.)

Is there a difference between statistics and data science? After a caution that Wikipedia definitions “may not be authoritative, but they are often illuminating”, Diggle cites Wikipedia for definitions of data science, information science, and statistics, and remarks that they show considerable overlap. So what can statistics offer? Among much else, he says:

Crucially, we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty on our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also mimimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

What is striking is that, in contrast to my (and his) opening comments, he urges statisticians to embrace software:

Principally, we learn that a published article is no longer a complete solution to a practical problem. We need our solutions to be implemented in software, preferably open source so that others can not only use but also test and, if need be, improve our solutions. We also need to provide high quality documentation for the software. And in many cases we need to offer an accessible, bespoke user interface.

This ties in with many things I know a little bit about. But it raises another crucial issue. I have always felt (and my experience includes being on two RAE panels) that software is not fairly judged by the UK research assessment. Mathematics panels feel that writing code is not really of the same importance as proving a theorem, even though (for example) GAP has resulted in far more research than a typical mathematics paper in a good journal. I think that statistics is treated in something of the same way: statisticians are technicians rather than scientists making an essential contribution. I hope that Diggle’s words will be used as a weapon in the fight to put this right.

The penultimate section of Diggle’s paper addresses “statistics in context”, starting with John Nelder’s comment that mathematical statistics should really be called statistical mathematics. But it is much more than playing with words. Diggle refers approvingly to the former organisation of CSIRO in Australia (alas, no longer the case), where most statisticians had two offices, “one co-located with other statisticians; one co-located with scientists in another discipline … The result was a symbiotic relationship in which statisticians brought to our weekly meetings challenging problems from many disciplines and took back to those disciplines solutions informed by a very wide range of statistical expertise.” He continues:

I would like to see every research-led university in the UK create a statistics institute. Each statistician on the university’s staff would have a dual appointment, to the institute and to an appropriate second discipline, be it mathematics, computer science or any one of the natural, biomedical or social sciences.

Well, a version of the first sentence was argued for in the most recent International Review of UK mathematics. (The link is to my comments on it; EPSRC have moved it twice, and I don’t know where it resides now.) The review was commissioned by EPSRC, but a long and strongly-worded section on statistics was kicked into the long grass by EPSRC, along with various other things which they really didn’t want to hear (such as support for the current diversity of UK mathematics). So I won’t hold my breath waiting for this.

In a talk by John Stufken a couple of years ago, he presented a “bicycle-wheel” model of statistics, where the hub is the mathematical underpinnings, the tyre the interaction with scientists of every description, and the spokes are the rare people who can operate in both modes.

A final thing is on my mind in view of the memorial day for Donald Preece next week. Donald was an advocate of “data sniffing”: running your eyes over the data, looking for anomalies. (There is more chance of correcting errors you catch in this way if you do it while the scientists are still around and the lab books haven’t been ditched.) How do you do data sniffing if your data is “big data” contained in a data file which may be gigabytes or even terabytes in size? Can computers be programmed to do “data sniffing”? There is a real question for the AI specialists!

But I see from Wikipedia that this brings me back to my starting point. The term “data sniffing” has a different meaning today; it is one of the things that the surveillance industry does to communications they intercept.

Advertisements

About Peter Cameron

I count all the things that need to be counted.
This entry was posted in mathematics and ..., technology and tagged , , , , , , , . Bookmark the permalink.

4 Responses to Data science and statistics

  1. Jon Awbrey says:

    So many issues so close to the heart, so little time, and where to start? So I’ll just jot a note here, to remind me.

    • Good, I’d like your comments. And to add one more topic which Peter Diggle discussed but I didn’t mention, there is the issue of reproducibility of computational results…

  2. Jon Awbrey says:

    To start with an issue long and close to my heart, that would be the nature of scientific inquiry, its relation to everyday reasoning, its present evolution and potential facilitation, its role in the sustainability of life and society and civilization and all that good stuff.

    Jumping to the place of technology, instrumental and informational, in that endeavor, designing tools to extend and facilitate a natural function, say vision or reason, requires good models of that function, models that capture the essence of the function over and above the accidents of its scattered implementations. That is where the job of the software designer enters into scientific inquiry.

  3. Jon Awbrey says:

    There is a lot that could be said here about the role of research and scholarship in sustaining civil society, but the upshot at this juncture in history is something like this:

    We have to come to terms with the fact that a small but disproportionately powerful sector of society gets what it desires — and what it desires knows no bounds — by painting and propagating a false picture of reality and getting the rest of society to act on that picture.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s