Case Statement: Data Citation - Making Research Data Citable

You are here

29 May 2013

Case Statement: Data Citation - Making Research Data Citable

Case Statement: Data Citation - Making Research Data Citable

Posted: Wed May 08, 2013 4:41 pm

by rauber

Dear Colleagues,

Following the BoF Session in Gothenburg (see discussion thread at http://forum.rd-alliance.org/viewtopic.php?f=5&t=58) and several individual follow-up discussions we have now created a first draft of the Case Statement for the Working Group on Data Citation - Making Research Data Citable, attached below.

We are looking forward to receiving your comments, questions, as well as for expressions of interest concerning potential pilots for the solutions to develop and test, either as individual institutions, or as pairs of content owners and solution implementers.

best regards,
Andreas Rauber


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Mon May 13, 2013 3:18 pm

by tobiasweigel

Thanks for uploading this. I mostly agree with the overall charter and the list of issues described on the first pages.
However, I feel that there is some sort of topical break when it comes to the actual work plan. This might just be the result of cutting down the topic towards a subset that is actually achievable within the limited timeframe (something many WGs seem to be struggling with). I agree with the general strategy, which sounds to me achievable and pragmatic (roughly: develop - test - iterate). But as far as the content goes, I might be missing or misunderstanding something important, so here's a detail question:

I am not sure what exactly is intended with the reference model in WP2. As you mention in the beginning, data may come in various formats, from databases to individual netCDF/HDF5 files. Does the reference model provide the means to describe which data is a subset of which other data? And how do PIDs relate to this, as they may be assigned at differing levels of granularity? Perhaps the unclear point for me is whether it's a data model of some kind or sth. else such as a set of policies or a process model.
You are also mentioning data collections to which new elements can be added in a manner that is time-stamped. How is this reflected in the work plan/work packages? 

Best, Tobias


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Tue May 14, 2013 2:01 pm

by rauber

Hi Tobias,

Thanks for raising an important issue - maybe we should clarify this a bit better in the wording of the actual work.
While not wanting to preclude any other solution, there was a feeling that approaches assigning PIDs at different levels of granularity would not scale, as they would either require enormous numbers of PIDs being cited (e.g. when assigning PIDs on a data item level) or not support citation at sufficient level of granularity (when assigning a PID e.g. to an entire data set). 

The apporach thus proposed and that most partners seem to want to test initially thus would assign the PID to the query (or whichever other means of identifying a respective subset of data in any given data file/set), and ensure that this selection statement can be re-executed with identical results at a later point in time (requiring, in the case of SQL-style DBMS, versioned and time-stamped databases, time-stamped queries, potentially ensuring unique sorts if the query itself should not be sufficient and order of result tuples is essential), plus, potentially, hash keys computed over the result set (IDs) to allow verification of the identity of the result sets obtained. This should allow identification of arbitrary subsets of data in a transparent manner (i.e. not requiring specific actions on behalf of the researcher), while being scalable to even very large data sets. The goal is to implement this (and analogous solutions for other types of data sets/systems) in initial pilots, evaluate them and see whether this principle can be recommended as a generic approach, or which specific advantages/disadvantage etc. have to be observed.

Which PID is to be used, as well as which metadata needs to be added to the query identified by the PID in order to allow attribution as well as human-readable interpretation of a citation is to be dealt within a separate WG, all under the umbrella of the Data Publication IG.

Does this help and make sense? We should probably add this more verbatim in the WG description.

Andi


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Tue May 14, 2013 2:10 pm

by rauber

A quick comment also on other comments: thanks to all of you who have contacted me via email with comments on the draft Case Statement: I have collected these and will incorporate them. In a nutshell these comprise specifically the extention of the stakeholder communities, including

- Libraries, who provide data citation related services, and who link between the various communities, 
i.e. researchers, publishers, data providers, etc.
- Funding Agencies and Evaluators of Research, who want to know what they get for their money
- Enterpreneurs, who may see value added / businesses enabled such as currently observed by
Software companies providing services based on open data.

I hope I've been able to answer most of the other questions submitted by email - and I'd like to encourage everybody to use this public channel for questions, answers and suggestions to share the information and views right away, so that we can shape this together, and see which pilots would be most interesting to launch initially.

best, andi


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Thu May 23, 2013 12:16 pm

by pcruse

Thanks to the group for putting the case statement together. My comment concerns the 4 stakeholder communities (data providers, solution providers, researchers, community) that have so far been identified. I would suggest that the case statement also includes libraries/information providers in the mix of stakeholders. Libraries are a neutral services entity and work with researchers, publishers, and the broader community to provide enduring access to information, including data. This is part of our core mission. Libraries have the ability to work equally with publishers and societies on many of the items included in the case statement outlines. Libraries can also work with researchers to create incentives to take up many of the actions that you have identified. Finally, DataCite, an important component of data citation, is a group of libraries that have come together to push forward data citation.

Here are a handful of specific actions that libraries can take: 
- encourage data citation with outreach and services, where appropriate and possible
- educate researchers on importance and benefits of data citation
- work with publishers to encourage the inclusion of cited data in traditional scholarly literature
- foster an environment of attribution for all scholarly works by encouraging scholars to cite their data and to ask for data citation in the journals where they publish.

Trisha Cruse
UC Curation Center (UC3)
California Digital Library


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Tue May 28, 2013 7:59 am

by tobiasweigel

rauber wrote:The apporach thus proposed and that most partners seem to want to test initially thus would assign the PID to the query (or whichever other means of identifying a respective subset of data in any given data file/set), and ensure that this selection statement can be re-executed with identical results at a later point in time (requiring, in the case of SQL-style DBMS, versioned and time-stamped databases, time-stamped queries, potentially ensuring unique sorts if the query itself should not be sufficient and order of result tuples is essential), plus, potentially, hash keys computed over the result set (IDs) to allow verification of the identity of the result sets obtained. This should allow identification of arbitrary subsets of data in a transparent manner (i.e. not requiring specific actions on behalf of the researcher), while being scalable to even very large data sets. The goal is to implement this (and analogous solutions for other types of data sets/systems) in initial pilots, evaluate them and see whether this principle can be recommended as a generic approach, or which specific advantages/disadvantage etc. have to be observed.

Hi Andi,

thanks for the explanation. You are right, this should be added more verbatim prominently in the description. It seems you already have a detail solution in mind when you talk about re-executable queries, which is helpful given the limited timeframe; I'd guess not everyone caring about the larger citation issue may agree with this solution, but then again, RDA WGs do not work in the "one size fits all" fashion and in the end, actually working solutions count. I think this particular view should be clarified early on so you can control eventual scope creep. Later WGs/the IG should pick up differing aspects.

Best, Tobias


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Tue May 28, 2013 11:14 am

by rauber

pcruse wrote:Thanks to the group for putting the case statement together. My comment concerns the 4 stakeholder communities (data providers, solution providers, researchers, community) that have so far been identified. I would suggest that the case statement also includes libraries/information providers in the mix of stakeholders. ...

Hi Trisha,

Thanks for pointing this out - libraries will for sure be included as a stakeholder community in the next version of the draft case statement. We also need to ensure a close collaboration with the other WGs working on PIDs and metadata to be associated with data citation to support attribution etc. Maybe some libraries who also actually hold data would be interested in testing the feasibility of the approaches elaborated in this WG.

Ciao, andi


Re: Case Statement: Data Citation - Making Research Data Cit

Posted: Tue May 28, 2013 11:25 am

by rauber

tobiasweigel wrote:
Hi Andi,

thanks for the explanation. You are right, this should be added more verbatim prominently in the description. It seems you already have a detail solution in mind when you talk about re-executable queries, which is helpful given the limited timeframe; I'd guess not everyone caring about the larger citation issue may agree with this solution, but then again, RDA WGs do not work in the "one size fits all" fashion and in the end, actually working solutions count. I think this particular view should be clarified early on so you can control eventual scope creep. Later WGs/the IG should pick up differing aspects.

Best, Tobias


 

 

Hi Tobias,

Ok, we'll revise the description to make this more specific!

But, just to make sure: 
We definitely do not want to limit the WG to one specific implementation/approach, and definitely just on one example for SQL-style DBMS! This was just meant as a specific example of how to implement it for that type of data. The basic notion is to have some form of data representation and some form of accessing subsets of that data - which is the situation giving rise to the need for citing subsets of data, when data is being changed, updated, growing, etc. (if it's only static, indivisible blocks of data, there is little need for absolutely new models).
Now, as soon as you have some data and some means of access, the principle should work to assign the PID to the specific "selection", i.e. operation that identifies the subset of the data - and to ensure that this process is deterministicly repeatable at late rpoints in time. In the SQL example, that would be versioned and time-stamped database, potentially with unique sorting. In the WG we definitely want to explore several options, particularly for non-SQL style data. 

I'd be happy to see some member stepping forward to propose a solution or run a pilot for other types of data - as well as, of course, discussing deficiencies of the current concept! We have had a few rounds of sanity check on the principles outlined above, and they seem to be fine so far, in terms of stability across different systems, scalability in some scenarios, etc. - but we definitely want to keep discussing this, identifying potential downsides - and actually testing it on a few pilots.

Ciao, andi

 

Groups audience: 
  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 14 Jul, 2013

    I'm part of the Data Foundations and Terminology WG and would like to keep in touch/liaison with this group on

    concepts and terms of interest.

    Your prior discussion by Rauber  on assigning PIDs at different levels of granularity is of immediate interest, at least to me since

    I think that we need some agreement across RDA on how to define data identity and the relation to various granular

     silos of data like "data sets", "data collections" and the like.

    One worry, as noted by Rauber, is that simple models of identtiy might not scale when considering all the ways that

    one identified piece of data is considered in various aggregations. It seems that mutiiple WGs (e,g, Policy, Metadata

    Repositories etc,) will be considering  this issue and I'm not sure that we are yet keeping in touch with each others

    thinking to advance together.

    Some of the ideas may be on the Forum over time, but perhaps we need some virtual meeting to discuss common

    interests.

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 03 Aug, 2013

    Hi Gary,

    Having some joint discussion will definitely be necessary. The best option for this would be, in my opinion, as part of a session by the IG on Data publishing, which should subsume/integrate the various WG activities in this area. So much for fomalities. On a pragmatic level: should we aim for a short joint session to discuss the core concepts? or merely see that we align the individual WG sessions so that they could form a consistent session schedule to follow? I will be mostly off-line in August, but having a prep virtual mtg in September prior to the RDA mtg sounds like a nice option to me.

    best regards, Andi

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 30 Jul, 2013

    Has the group developed any core list ot concepts identified by terms that the DFT group may include in the scope of its work?

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 03 Aug, 2013

    Gary,

     

    We haven't really gotten our head around terminology yet, but probably partially because so far it did not arise as a specific issue. We had some debates on different types of "dynamics" in data, i.e. data only being added to vs. (historic) data items being deleted or corrected/ammmended - but basically all these concepts exist in the field fof databases for quite some time, as do timestamping and versioning concepts. The situation may be different for other types of data (i.e. non-standard SQL), but the concepts are likely to be similar.

     

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 08 Aug, 2013

    Dear all,

    Some quick update on the collaboration between our WG initiative and other activities in the field of data citation: a new synthesis group has been formed under the coordination of FORCE11, bringing together a range of initiatives and projects in this field (including, beyond RDA also CODATA, W3C, DataCite, OpenAire, APARSEN, and others). The goal is to establish some consensus on common principles in data citation. These include but obviously reach beyond the core topics touched upon by our WG. Details on this synthesis group are available at http://www.force11.org/node/4381

     

    Andi

     

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 08 Aug, 2013

    We will have a slot during the upcoming RDA plenary in Washington Sep 16-18. If you have any specific items for the agenda for that meeting, raise points to be discussed, or want to show some pilots, please let me know. (https://rd-alliance.org/future-events)

    Also, I'll give a short presentation at the RDA europe Workshop in Munich on Sep 10, prior to the RDA mtg in Washington. If you want me to include some specific aspects or demos/progress reports from your side, please let me know. (https://europe.rd-alliance.org/Content/Events.aspx?id=145)

    Andi

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 10 Sep, 2013

    Dear all,

    To ease communication we have established a mailing list for our WG. You can subscribe to it at the following webpage:

    http://lists.lists.rd-alliance.org/mailman/listinfo/rda-wg-dc

     

    best regards, Andreas

submit a comment