Open research data

Another day, another big issue.

I’ve just been reading a draft concordat on open research data, produced by a committee representing UK research councils, funders, and universities. I won’t give you the whole thing, but here for context is the definition they adopt:

Research Data are quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, interview or other methods. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set). The purpose of open research data is to provide the information necessary to support or validate a research project’s observations, findings or outputs. Data may include, for example, statistics, collections of digital images, sound recordings, transcripts of interviews, survey data and fieldwork observations with appropriate annotations.

And here are the principles:

  1. Open access to research data is an enabler of high quality research, a facilitator of innovation and safeguards good research practice.
  2. Good data management is fundamental to all stages of the research process and should be established at the outset.
  3. Data must be curated so that they are accessible, discoverable and useable.
  4. Open access to research data carries a significant cost, which should be respected by all parties.
  5. There are sound reasons why the openness of research data may need to be restricted but any restrictions must be justified and justifiable.
  6. The right of the creators of research data to reasonable first use is recognised.
  7. Use of others’ data should always conform to legal, ethical and regulatory frameworks including appropriate acknowledgement.
  8. Data supporting publications should be accessible by the publication date and should be in a citeable form.
  9. Support for the development of appropriate data skills is recognised as a responsibility for all stakeholders.
  10. Regular reviews of progress towards open access to research data should be undertaken.

I was invited to comment on this, presumably from the perspective of a pure mathematician.

My first reaction was that the definition appears to exclude the kind of data that pure mathematicians generate, such as the list of finite simple groups. (In fact, it is sufficiently woolly that it doesn’t actually exclude anything.)

My second was that we have an exemplary open data source in the Atlas of Finite Group Representations. This is a huge repository of data, including matrices and/or permutations for generators of large numbers of almost simple groups in many different representations. It is well-curated and useable in the sense of Principle 3: computer algebra systems such as Magma and GAP can directly import the appropriate generators from the site, in a way which is almost transparent to the user.

But our great good fortune in having this resource shouldn’t make us complacent: I am aware that there are many other sources of mathematical data which are not managed as well. We are lucky in having Rob Wilson and his team running this resource. So perhaps we do have something to learn from the principles above.

This is an extreme case. Most data sets that discrete mathematicians produce are likely to be sequences of integers, and the OEIS already provides a well-curated repository for these. But there are other cases. Lists of small Latin squares or Steiner triple systems involve huge amounts of data. The roughly 11 billion Steiner triple systems of order 19 have been compressed, by Patric Östergård and his colleagues, into 39 gigabytes using some clever compression techniques (see here), but it still takes a certain amount of courage to embark on a research project which uses this data in a non-trivial way.

So we are probably well ahead of the game in some respects, and well behind in others.

Posted in Uncategorized | Tagged , , | 2 Comments

Data science and statistics

Words change their meanings. Once “biometrics” referred to the use of statistics in studying biological systems such as agriculture (as opposed to, say, “technometrics”): now only one item on the Google top ten refers to the International Biometric Society, so completely has a different meaning taken over.

This is an example where the change in meaning has been driven by technology. Now it seems that the rise of “data science” may be forcing a similar change on “statistics”.

These rather brief thoughts are provoked by reading Peter Diggle’s presidential address to the Royal Statistical Society (published in their journal, series A, volume 178(4)), entitled “Statistics: a data science for the 21st century”.

By way of introduction, Diggle remarks that the start of his career more or less coincided with the first statistical software, and some people at the time thought that the discipline of statistics was no longer necessary, since computers would do the work. (This is still an issue; there have been recent stories of established research institutes sacking their statisticians because the scientists all have Excel on their desktops.)

Is there a difference between statistics and data science? After a caution that Wikipedia definitions “may not be authoritative, but they are often illuminating”, Diggle cites Wikipedia for definitions of data science, information science, and statistics, and remarks that they show considerable overlap. So what can statistics offer? Among much else, he says:

Crucially, we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty on our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also mimimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

What is striking is that, in contrast to my (and his) opening comments, he urges statisticians to embrace software:

Principally, we learn that a published article is no longer a complete solution to a practical problem. We need our solutions to be implemented in software, preferably open source so that others can not only use but also test and, if need be, improve our solutions. We also need to provide high quality documentation for the software. And in many cases we need to offer an accessible, bespoke user interface.

This ties in with many things I know a little bit about. But it raises another crucial issue. I have always felt (and my experience includes being on two RAE panels) that software is not fairly judged by the UK research assessment. Mathematics panels feel that writing code is not really of the same importance as proving a theorem, even though (for example) GAP has resulted in far more research than a typical mathematics paper in a good journal. I think that statistics is treated in something of the same way: statisticians are technicians rather than scientists making an essential contribution. I hope that Diggle’s words will be used as a weapon in the fight to put this right.

The penultimate section of Diggle’s paper addresses “statistics in context”, starting with John Nelder’s comment that mathematical statistics should really be called statistical mathematics. But it is much more than playing with words. Diggle refers approvingly to the former organisation of CSIRO in Australia (alas, no longer the case), where most statisticians had two offices, “one co-located with other statisticians; one co-located with scientists in another discipline … The result was a symbiotic relationship in which statisticians brought to our weekly meetings challenging problems from many disciplines and took back to those disciplines solutions informed by a very wide range of statistical expertise.” He continues:

I would like to see every research-led university in the UK create a statistics institute. Each statistician on the university’s staff would have a dual appointment, to the institute and to an appropriate second discipline, be it mathematics, computer science or any one of the natural, biomedical or social sciences.

Well, a version of the first sentence was argued for in the most recent International Review of UK mathematics. (The link is to my comments on it; EPSRC have moved it twice, and I don’t know where it resides now.) The review was commissioned by EPSRC, but a long and strongly-worded section on statistics was kicked into the long grass by EPSRC, along with various other things which they really didn’t want to hear (such as support for the current diversity of UK mathematics). So I won’t hold my breath waiting for this.

In a talk by John Stufken a couple of years ago, he presented a “bicycle-wheel” model of statistics, where the hub is the mathematical underpinnings, the tyre the interaction with scientists of every description, and the spokes are the rare people who can operate in both modes.

A final thing is on my mind in view of the memorial day for Donald Preece next week. Donald was an advocate of “data sniffing”: running your eyes over the data, looking for anomalies. (There is more chance of correcting errors you catch in this way if you do it while the scientists are still around and the lab books haven’t been ditched.) How do you do data sniffing if your data is “big data” contained in a data file which may be gigabytes or even terabytes in size? Can computers be programmed to do “data sniffing”? There is a real question for the AI specialists!

But I see from Wikipedia that this brings me back to my starting point. The term “data sniffing” has a different meaning today; it is one of the things that the surveillance industry does to communications they intercept.

Posted in mathematics and ..., technology | Tagged , , , , , , , | 4 Comments

Silly season, 2

I received a very respectful email enclosing a paper and soliciting my comments.

I wouldn’t usually treat the sender of such an email like this, but I happened to notice that all the other addressees had names beginning pj, so I doubt that I had been carefully selected.

A glance at the attachment showed two things.

  • First, it contained two proofs that π = (14−√2)/4.
  • Second, it was not a manuscript or preprint, but a reprint from an international journal with the spuriously precise impact factor of 3.785.
Posted in publishing | Tagged , | 4 Comments



I already posted part of this picture, a signpost on the edge of the Olympic Park back in 2011. I tagged it with the last phrase of the second verse of Procol Harum’s apocalyptic song Homburg: “The signposts cease to sign”.

I decided that, instead of scrubby trees by an east London waterway, it deserved a more dramatic background, so I used a photo of the sun setting into the sea taken at St Kilda, a suburb of Melbourne, later that year.

Posted in Uncategorized | Tagged , , , | Leave a comment

Silly season gleanings

EPSRC have finally made it into Private Eye, into Pseuds Corner to be precise; the current edition quotes part of their definition of a sandpit. (The entire definition is several pages long. Thus, sample activities for the first stage, called “Interact, Create mission statement”, of the process might be “Site visits to further explore the issues. Vision setting through creative ‘cartoon strip’ workshop, followed by a night learning how to play a musical instrument … and some friendly team competition.”)

Three kings

The picture above is not friendly team competition over dinner at a sandpit, but shows the kings of Hungary, Bohemia and Poland meeting in the castle of Visegrád in 1335 to hammer out an agreement. Did the agreement include relaxing the Hungarian quota on imports of Czech beer?

Posted in Uncategorized | Tagged , , , | Leave a comment

The other St Andrews


Szentendre is a small town at the end of a suburban railway line from Budapest. I first visited it 22 years ago, in the winter; it was a beautiful place of artists’ studios and galleries, old Serbian churches full of icons, and wide views across an arm of the Danube to a large island.

We went back yesterday, on a summer Sunday at the end of the St Stephen’s Day long weekend, and could hardly move for tourists.

Posted in geography | Tagged , , | Leave a comment

Budapest moment

Rose and yew

The moment of the rose and the moment of the yew tree
Are of equal duration

T. S. Eliot, “Little Gidding”

This picture is of a corner of the sadly neglected gardens around the tomb of Gül Baba, the person who introduced the cultivation of roses to Budapest.

Posted in geography, history | Tagged , , , , , , | Leave a comment