Another day, another big issue.
I’ve just been reading a draft concordat on open research data, produced by a committee representing UK research councils, funders, and universities. I won’t give you the whole thing, but here for context is the definition they adopt:
Research Data are quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, interview or other methods. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set). The purpose of open research data is to provide the information necessary to support or validate a research project’s observations, findings or outputs. Data may include, for example, statistics, collections of digital images, sound recordings, transcripts of interviews, survey data and fieldwork observations with appropriate annotations.
And here are the principles:
- Open access to research data is an enabler of high quality research, a facilitator of innovation and safeguards good research practice.
- Good data management is fundamental to all stages of the research process and should be established at the outset.
- Data must be curated so that they are accessible, discoverable and useable.
- Open access to research data carries a significant cost, which should be respected by all parties.
- There are sound reasons why the openness of research data may need to be restricted but any restrictions must be justified and justifiable.
- The right of the creators of research data to reasonable first use is recognised.
- Use of others’ data should always conform to legal, ethical and regulatory frameworks including appropriate acknowledgement.
- Data supporting publications should be accessible by the publication date and should be in a citeable form.
- Support for the development of appropriate data skills is recognised as a responsibility for all stakeholders.
- Regular reviews of progress towards open access to research data should be undertaken.
I was invited to comment on this, presumably from the perspective of a pure mathematician.
My first reaction was that the definition appears to exclude the kind of data that pure mathematicians generate, such as the list of finite simple groups. (In fact, it is sufficiently woolly that it doesn’t actually exclude anything.)
My second was that we have an exemplary open data source in the Atlas of Finite Group Representations. This is a huge repository of data, including matrices and/or permutations for generators of large numbers of almost simple groups in many different representations. It is well-curated and useable in the sense of Principle 3: computer algebra systems such as Magma and GAP can directly import the appropriate generators from the site, in a way which is almost transparent to the user.
But our great good fortune in having this resource shouldn’t make us complacent: I am aware that there are many other sources of mathematical data which are not managed as well. We are lucky in having Rob Wilson and his team running this resource. So perhaps we do have something to learn from the principles above.
This is an extreme case. Most data sets that discrete mathematicians produce are likely to be sequences of integers, and the OEIS already provides a well-curated repository for these. But there are other cases. Lists of small Latin squares or Steiner triple systems involve huge amounts of data. The roughly 11 billion Steiner triple systems of order 19 have been compressed, by Patric Östergård and his colleagues, into 39 gigabytes using some clever compression techniques (see here), but it still takes a certain amount of courage to embark on a research project which uses this data in a non-trivial way.
So we are probably well ahead of the game in some respects, and well behind in others.