"Data Publishing 2020:" Four case statements

22 Jan 2014

"Data Publishing 2020:" Four case statements

Forums:

Dear RDA Members,

After several months of hard work, the RDA-WDS Publishing Data Interest Group has produced a proposal comprising four coordinated Working Groups under its umbrella for consideration and hopefully endorsement by RDA after review.

On behalf of the Co-chairs of the RDA-WDS Publishing Data IG and Chairs and Co-chairs of the proposed Working Groups and all the individual Members who contributed to this work, please reviewthe following 5 documents for your considerations and comments:

Data Publishing 2020: Proposal for a Coordinated Approach. This document articulates the links and dependencies between the four proposed Working Groups and expands on the holistic approach to establish data publishing as part of the scholarly record.
Workflows Working Group Case Statement
Bibliometrics Working Group Case Statement
Cost Recovery for Data Centres Working Group Case Statement
Data Publication Services Case Statement

The PDF portfolio is also available of the RDA website file repository here: https://www.rd-alliance.org/filedepot/folder/114?fid=373

An ordinary pdf is athttps://www.rd-alliance.org/filedepot?fid=380

Please review and comment as a response to this post

- submitted on behalf of Mustapha Mokrane

Log in to post comments
10178 reads

Author: Elizabeth Griffin

Date: 23 Jan, 2014

It is clear that a very significant amount of work has gone into discussing this issue, and in framing a response that should meet all presently-foreseen needs. Almost certainly, not everyone will be satisfied with the proposed operations for managing the publication of data, and almost certainly the desiderata and objectives for attaining them will need o be modified and updated as the situation evolves. The topic bothers some scientists a great deal more than it bothers (or even impresses itself upon the consideration of) others, and that is something that will be hard to mend, or not at all quickly.

Having said that, there are three points that I would like to make. Two concern perception, while the third is more practical.

1. 'Overall objectives' starts by saying that "In the empirical sciences" data have "traditionally been an integral part of scholarly publishing". My own background is astronomy, and I am left rather puzzled by that statement. not because astronomy is a world unto itself, but because - like all other empirical sciences - its 'tradition' stretches back to the decades before observations were recorded digitally. Observations that were hand-written, printed on pro-forma records, entered in log-books, recorded as photographic images, etc., could not be 'published' per se; being unique copies they were probably kept by a laboratory assistant, a plate archivist or a librarian in a well-protected location. Many were made 'public' in the sense that researchers other than those who were responsible for making the observations (or for having them made) could be permitted physical access to the papers, books, prints, charts, photographs, etc., and visiting scientists were often allowed similar privileges too, but that could not be construed as `publishing the data'. My raising this matter is not to spit hairs, but is rather because the concept of 'publishing data' has its roots rather firmly in how things have actually evolved from those 'traditional' beginnings.

2. Closely connected to the above is the question of what is meant by 'data'. This point is absolutely central to every discussion, document and organization that uses the term 'data', yet it probably has more different interpretations than are good for any body that is trying to define the management of 'data'. The observations that we scientists make are objective quantities: they describe, in whatever units, the target - whatever it is - as it was at that moment, bearing specific properties of space and time, brightness, colour, shape, etc., but telling us no more than that. Those properties are *objective* and are unalterable, while everything that is deduced about the target or object - its classification, evolutionary state, development, age, etc. - is *subjective* information that is liable to be modified or updated as theories or standards evolve. Every observation is of course circumscribed in some way since it must carry signatures of the recording or detecting instrumentation (e.g., wavelength range, resolution) or of decisions made by the observer (such as exposure duration or focus). The whole - the raw observation along with its detector signatures - cannot be altered, though the record itself can be manipulated (the signatures removed, often referred to as 'calibration') in order to purify it (of cosmic-ray events, for instance) or to render it in a form that can be compared more closely with other observations. It is vital to maintain a clear, and preferably tangible, separation between the objective and subjective elements of 'data'; too often one meets pressure to include subjective quantities like classification in a catalogue of raw observations. It is also argued that the raw observations should accompany the literature about the object, but that is a dangerous move since it will tend to weaken the essential distinction between objectivity and subjectivity. A pointer to where the relevant observations can be found is the best plan for the present, but ....

3. what is meant by 'publishing data'? Observations in my own field nowadays take the form of ccd images, and enough of them get rather bulky, especially when all the necessary calibration files are included. If it means quite literally that a copy of every relevant observation should be sent to the publisher along with the text of a paper that presents measurements and analyses of them, then the publishers are soon going to be overwhelmed by replicates of all those original observations. In astronomy, what is beginning to happen is that an observatory puts the observations from its own telescopes onto some public website in a commonly-used format like FITS, and they can be accessed from there - so why 'publish' them again along with a paper?

While there is of course merit in principle in validating (quality-checking) and peer-reviewing data submitted for publication, I am unclear as to what is actually meant and how it can be carried out in practice. It will be very difficult for another scientist - even at the same observatory - to confirm after the event that a telescope and detector had been used correctly or that its observations at any given moment were not contaminated in some (perhaps subtle) way.

A bit of extra clarification of the above points will go a long way towards turning a strong proposal into a landmark one that can be endorsed and acted on comprehensively by all the empirical sciences.

Author: Elizabeth Griffin

Date: 23 Jan, 2014

It is clear that a very significant amount of work has gone into discussing this issue, and in framing a response that should meet all presently-foreseen needs. Almost certainly, not everyone will be satisfied with the proposed operations for managing the publication of data, and almost certainly the desiderata and objectives for attaining them will need o be modified and updated as the situation evolves. The topic bothers some scientists a great deal more than it bothers (or even impresses itself upon the consideration of) others, and that is something that will be hard to mend, or not at all quickly.

Having said that, there are three points that I would like to make. Two concern perception, while the third is more practical.

1. 'Overall objectives' starts by saying that "In the empirical sciences" data have "traditionally been an integral part of scholarly publishing". My own background is astronomy, and I am left rather puzzled by that statement. not because astronomy is a world unto itself, but because - like all other empirical sciences - its 'tradition' stretches back to the decades before observations were recorded digitally. Observations that were hand-written, printed on pro-forma records, entered in log-books, recorded as photographic images, etc., could not be 'published' per se; being unique copies they were probably kept by a laboratory assistant, a plate archivist or a librarian in a well-protected location. Many were made 'public' in the sense that researchers other than those who were responsible for making the observations (or for having them made) could be permitted physical access to the papers, books, prints, charts, photographs, etc., and visiting scientists were often allowed similar privileges too, but that could not be construed as `publishing the data'. My raising this matter is not to spit hairs, but is rather because the concept of 'publishing data' has its roots rather firmly in how things have actually evolved from those 'traditional' beginnings.

2. Closely connected to the above is the question of what is meant by 'data'. This point is absolutely central to every discussion, document and organization that uses the term 'data', yet it probably has more different interpretations than are good for any body that is trying to define the management of 'data'. The observations that we scientists make are objective quantities: they describe, in whatever units, the target - whatever it is - as it was at that moment, bearing specific properties of space and time, brightness, colour, shape, etc., but telling us no more than that. Those properties are *objective* and are unalterable, while everything that is deduced about the target or object - its classification, evolutionary state, development, age, etc. - is *subjective* information that is liable to be modified or updated as theories or standards evolve. Every observation is of course circumscribed in some way since it must carry signatures of the recording or detecting instrumentation (e.g., wavelength range, resolution) or of decisions made by the observer (such as exposure duration or focus). The whole - the raw observation along with its detector signatures - cannot be altered, though the record itself can be manipulated (the signatures removed, often referred to as 'calibration') in order to purify it (of cosmic-ray events, for instance) or to render it in a form that can be compared more closely with other observations. It is vital to maintain a clear, and preferably tangible, separation between the objective and subjective elements of 'data'; too often one meets pressure to include subjective quantities like classification in a catalogue of raw observations. It is also argued that the raw observations should accompany the literature about the object, but that is a dangerous move since it will tend to weaken the essential distinction between objectivity and subjectivity. A pointer to where the relevant observations can be found is the best plan for the present, but ....

3. what is meant by 'publishing data'? Observations in my own field nowadays take the form of ccd images, and enough of them get rather bulky, especially when all the necessary calibration files are included. If it means quite literally that a copy of every relevant observation should be sent to the publisher along with the text of a paper that presents measurements and analyses of them, then the publishers are soon going to be overwhelmed by replicates of all those original observations. In astronomy, what is beginning to happen is that an observatory puts the observations from its own telescopes onto some public website in a commonly-used format like FITS, and they can be accessed from there - so why 'publish' them again along with a paper?

While there is of course merit in principle in validating (quality-checking) and peer-reviewing data submitted for publication, I am unclear as to what is actually meant and how it can be carried out in practice. It will be very difficult for another scientist - even at the same observatory - to confirm after the event that a telescope and detector had been used correctly or that its observations at any given moment were not contaminated in some (perhaps subtle) way.

A bit of extra clarification of the above points will go a long way towards turning a strong proposal into a landmark one that can be endorsed and acted on comprehensively by all the empirical sciences.

Author: Jonathan Tedds

Date: 23 Jan, 2014

Hi remgriffin,

I think you raise important issues. As one of the co-chairs for the proposed case statements I hope I can help clarify some of the thinking behind this (and we may need to add something further to the cover note and/or statements).

The overall area is one of considerable debate, as you rightly point out, so I would certainly not assert that this is in any way a last word! Rather the proposed Working Groups are part of an international effort by those involved (including researchers, data centres, publishers, institutions and funders) to describe what's already happening (already very diverse across disciplines), look for common features and begin to test some proposed and working solutions which many of us are involved in that could be (re)used in new settings. I think the brief for what is data and publishing it is therefore as narrow or wide as researchers wish it to be.

Consider a traditional scientific subject such as astronomy, which I am very familiar with or Earth Sciences, say. There will be cases where facility data, assuming it is well described, will be published by means of it's metadata alone e.g. through facilities and/or data centres, according to well described schema that allow direct reuse. However, given that most disciplines do not publish datasets in such ways and that in either case producers may wish to accrue appropriate academic credit for reuse via more traditional means, an associated data paper might be written.

Indeed there is a long tail of astronomy data, as much as any other discipline, where data is produced in non standard ways by individuals or small groups of researchers where sufficient metadata and or associated description is necessary to show how the raw facility data has been collected, calibrated and processed for subsequent investigation and potential reuse. These datasets are often not sufficiently well described in most cases to allow reuse without more extensive description.

I hope that gives a slightly broader context for what we hope to achieve as set out in the case statements but this is just one opinion among many in this rapidly evolving area!

Jonathan Tedds, University of Leicester

Co-chair: RDA-WDS Publishing Data Interest Group & proposed RDA-WDS Workflows Working Group

Author: Rafael Mayo-Garcia

Date: 24 Jan, 2014

With regard to the Workflows Working Group, we at the European Commission co-funded CHAIN-REDS project (http://www.chain-project.eu/) are working on proposing a Data Accessibility, Reproducibility and Trustworthiness challenge workflow, which aims to cover the whole cycle of a research based on previous data: find a specific dataset (or publication with associated data); retrieve it; use these data to either reproduce some experiments or perform new calculations on a computational platform; obtain new results, publish them; and, store, if any, the new produced data.

This cycle is being made possible by the use of data and metadata standars (OAI-PMH, Dublin Core, SPARQL...) and well accepted formats such as Persistent Identifiers. More information can be found in the public document 'D4.2 Analysis of Data Infrastructures and Data repositories' ((scroll down at http://www.chain-project.eu/deliverables to find Deliverable D4.2).

The CHAIN-REDS project has already signed a MoU with the EUDAT initiative and would be interested in collaborating with the RDA on this topic.

Rafael Mayo-García, CIEMAT

CHAIN-REDS WP4 'Data Infrastructure' Manager (on behalf of the project)

Author: Jonathan Tedds

Date: 24 Jan, 2014

Dear Rafael,

Many thanks for decribing the CHAIN-REDS project and it sounds like it could be a very interesting workflow to follow up on. I've passed on your message to the Workflows WG members (not on this site yet pending RDA approval) for consideration so hope to be in touch soon.

Thanks

Jonathan Tedds for Workflows WG

Author: Leticia Cruz

Date: 27 Jan, 2014

The projects time line is adequate for implementation , the projects mission and vision as well . I do have questions regarding the key deliverables , which are to ascertain effective data , and use such data to enhance the scientific communities goals and benefit society, therefor ;I have the following question regarding credentials verification of such persons seeking research data and purpose for doing so .I am however impresed by the project in all its aspects ,and I am eager to assist in its completion and confident in its mission.

Kind Regards

Ms Leticia Cruz RN MSN/ed DNP

O&A Members

MEMBERSHIP

RDA Groups

The Research Data Alliance

Membership

RDA Working and Interest Groups

RDA Solutions

RDA domain research

"Data Publishing 2020:" Four case statements

You are here

"Data Publishing 2020:" Four case statements

submit a comment