Open research data

Another day, another big issue.

I’ve just been reading a draft concordat on open research data, produced by a committee representing UK research councils, funders, and universities. I won’t give you the whole thing, but here for context is the definition they adopt:

Research Data are quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, interview or other methods. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set). The purpose of open research data is to provide the information necessary to support or validate a research project’s observations, findings or outputs. Data may include, for example, statistics, collections of digital images, sound recordings, transcripts of interviews, survey data and fieldwork observations with appropriate annotations.

And here are the principles:

  1. Open access to research data is an enabler of high quality research, a facilitator of innovation and safeguards good research practice.
  2. Good data management is fundamental to all stages of the research process and should be established at the outset.
  3. Data must be curated so that they are accessible, discoverable and useable.
  4. Open access to research data carries a significant cost, which should be respected by all parties.
  5. There are sound reasons why the openness of research data may need to be restricted but any restrictions must be justified and justifiable.
  6. The right of the creators of research data to reasonable first use is recognised.
  7. Use of others’ data should always conform to legal, ethical and regulatory frameworks including appropriate acknowledgement.
  8. Data supporting publications should be accessible by the publication date and should be in a citeable form.
  9. Support for the development of appropriate data skills is recognised as a responsibility for all stakeholders.
  10. Regular reviews of progress towards open access to research data should be undertaken.

I was invited to comment on this, presumably from the perspective of a pure mathematician.

My first reaction was that the definition appears to exclude the kind of data that pure mathematicians generate, such as the list of finite simple groups. (In fact, it is sufficiently woolly that it doesn’t actually exclude anything.)

My second was that we have an exemplary open data source in the Atlas of Finite Group Representations. This is a huge repository of data, including matrices and/or permutations for generators of large numbers of almost simple groups in many different representations. It is well-curated and useable in the sense of Principle 3: computer algebra systems such as Magma and GAP can directly import the appropriate generators from the site, in a way which is almost transparent to the user.

But our great good fortune in having this resource shouldn’t make us complacent: I am aware that there are many other sources of mathematical data which are not managed as well. We are lucky in having Rob Wilson and his team running this resource. So perhaps we do have something to learn from the principles above.

This is an extreme case. Most data sets that discrete mathematicians produce are likely to be sequences of integers, and the OEIS already provides a well-curated repository for these. But there are other cases. Lists of small Latin squares or Steiner triple systems involve huge amounts of data. The roughly 11 billion Steiner triple systems of order 19 have been compressed, by Patric Östergård and his colleagues, into 39 gigabytes using some clever compression techniques (see here), but it still takes a certain amount of courage to embark on a research project which uses this data in a non-trivial way.

So we are probably well ahead of the game in some respects, and well behind in others.

Advertisements

About Peter Cameron

I count all the things that need to be counted.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

2 Responses to Open research data

  1. John Cremona says:

    Another well-organised mathematical database and “open data source” is the L-functions and modular forms database LMFDB at http://www.lmfdb.org!

  2. One of the flaws I see in the way researchers in mathematics exchange data is that we currently do it individually, on our web pages. Projects like Wikipedia show that it is not hopeless to try to draw a map of everything at once, and find our way there. I long for a central repository that would index mathematical data, in the same way that I long for a central repository that would index implemented algorithms (solving this or that specific problem).

    I contribute to the free software Sage, where we build plenty of data. I implemented a *lot* of known constructions of designs, in the hope that every design that is known to exist (given a set of parameters) will eventually be obtainable by a simple command. Similarly, we try these days to build all strongly regular graphs indexed in Andries Brouwer’s website in Sage. Nothing better than having an example on your computer to make sure that this or that object exists.

    But Sage, like Magma or GAP, is no database though it can produce data. Installing any of them (and learning how it works) is not entirely trivial, and in my experience it does not replace something as simple as raw data. And we *must* find a way to index the mathematical databases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s