Making Dynamic Data Citeable

You are here

29 Jan 2014

Making Dynamic Data Citeable

We have published the new version of the Case Statement of the WG on Data citation: Making Dynamic Data Citeable.

It is  available in the RDA File repository under

https://www.rd-alliance.org/filedepot/folder/99?fid=379

We appreciate any comments and suggestions on things to add, clarify, etc.

best regards,
Andreas Rauber
Ari Asmi
Dieter van Uytvanck

  • Mark Parsons's picture

    Author: Mark Parsons

    Date: 31 Jan, 2014

    Fraser Taylor writes:

     

    I have read the document and find it very interesting.

    I have the following comments:

    1.The document properly emphasizes the need for cooperation with existing initiatives in the citation field.

    It is not however entirely clear what particular niche the proposed activities would occupy.What will be done that is  not being done elsewhere? This is perhaps implicit in some of the rhetoric of the proposal but in my view needs to be more explicitly stated.

    2. A great deal of interesting work is being done in the library/archives community on both data citation and especially on data archiving and preservation. There is perhaps a need to ensure that the WG is more aware of these initiatives and can ensure that cooperation takes place.

    3.Data sharing and access to data are key issues and it should be a central objective of the WG to ensure that this aspect of their work is even more strongly emphasised.

    4.The work of the group should be primarily demand driven.This is there in the proposal but needs to be even more explicit.

    5.The WG has explicitly stated that it will not look at social,economic and related issues.I understand this and appreciate the need for a clear focus. However technical solutions which do not consider the operational context in which they are to be used are less likely to be useful. If a citation system which is technologically eloquent and complete but difficult or costly to apply is created then its impact is likely to be less.

    6.The issue of data ownership is a complex one and the WG might well consider how this can be effectively included in a citation system.I note that Paul Uhlir is a member and his insights on this issue will be especially valuable.

    7.Standards and specifications are critical to interoperability and the WG might reach out to OGC in discussing these.

    8.I like the idea of case studies which if carefully selected will answer some of the concerns I have outlined above.Starting with these can help ensure a demand driven approach to the work.

    I hope that you find these comments useful.I am supportive of the proposal of the WG but would like to see it  refined and strengthened.Above all we must avoid duplication of effort with other initiatives.It would not hurt to get reactions from groups like CODATA to the proposal in advance rather than seeking cooperation once the WG has been established and has started its work.

    Fraser.

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 24 Feb, 2014

    Dear Fraser,

    thanks a lot for your comments - I'll try to answer/reflect on them in-line below:

    > I have read the document and find it very interesting.
    >
    > I have the following comments:
    >
    > 1.The document properly emphasizes the need for cooperation with
    > existing initiatives in the citation field.
    >
    > It is not however entirely clear what particular niche the proposed
    > activities would occupy.What will be done that is  not being done
    > elsewhere? This is perhaps implicit in some of the rhetoric of the
    > proposal but in my view needs to be more explicitly stated.

    we had assumed that we would have stressed this in the revised version, but maybe more emphasis is needed to address this. the core topic addressed really is to identify means to make arbitrary subsets of data citabale, with a focus on data sets that are large-volume and dynamically changing, i.e. new data being added or existing data being updated.
    all standard appproaches of assigning PIDs following discussions so far do not seem to work: assigning PIDs to the entire database will require verbal description of the subsest. Also, it does not address the update-issue. Citing individual cells (i.e. asigning a PID to each data element/cell) would lead to citations tat are the same size as the data set being cited.
    The approach to assigned PIDs to the queries used to identify the subset of data being used against a time-stamped and versioned database seems to be working in the settings elaborated and discussed so far.
    Also, we haven't seen any other group working on this particular aspect - usually, most approaches assume that there is something that one can assign a PID to that identifies the data in question - how to do this is usually not elaborated, which is why this WG was created: to discuss approaches how to do this, test whether these work in different settings (types of data, different sizes and levels of dynamics, different forms of data representation, from CSV via SQL to XML, RDF,...)

    > 2. A great deal of interesting work is being done in the
    > library/archives community on both data citation and especially on data
    > archiving and preservation. There is perhaps a need to ensure that the
    > WG is more aware of these initiatives and can ensure that cooperation
    > takes place.

    We are definitely aware of the activities in the field of digital preservation (personally, being involved in a number of projects in that domain, but also via the membership in the WG). One of the activitie forseen, for example, is to test the proposed approach wrt. stability across technology changes, i.e. testing whether citations can be fasibly resolved when the underlying system for representing/storing the data changes.

    > 3.Data sharing and access to data are key issues and it should be a
    > central objective of the WG to ensure that this aspect of their work is
    > even more strongly emphasised.

    obviously, but that aplies to the entire set of RDA activities.
    Obvously, the goal of this WG is to enable researchers to easily assign a citation to a specific subset of data that they are using in a specific experiment/analysis - and to allow others to resolves this (others including also machines, i.e. machine-actionable reference resolution). Legal barriers, access rights, etc. are, however, excluded from the discussion in this WG to keep it focused.

    > 4.The work of the group should be primarily demand driven.This is there
    > in the proposal but needs to be even more explicit.

    obviously, as it only will work this way.
    We ave started collecting specific pilot settings: not hypothetical use cases, but concrete data sets and institutinal settings that want to support data citaton at an arbitrary subset level. these will serve as the basis for oth a conceptual verificaton and, where possible, eben a real implementation-bsed verification (although this may well be ouutside the scope of the WG activities form most production-level settings)
    One example of this happening has been the inclusion of CSV-based data: this hasn't been initially seen as a focus, as the data we were originally aiming for was large-volume and highly dynamic settings, yet we have come across the need for support also in this setting, so it is now being included, with some activities of developing an actual prototype solution seemingly being underway.

    > 5.The WG has explicitly stated that it will not look at social,economic
    > and related issues.I understand this and appreciate the need for a clear
    > focus. However technical solutions which do not consider the operational
    > context in which they are to be used are less likely to be useful. If a
    > citation system which is technologically eloquent and complete but
    > difficult or costly to apply is created then its impact is likely to be
    > less.

    This may be some misunderstanding created by an unlucky formulation on our side: When stating that we would exclude economic aspects we did not mean to ignore the economic feasibility of the solution (on the contrary, this is a core aspect, evaluating the cost of different types of versioning, history tables, etc. to mention one concrete example)
    What we meant to exclude was the economic impact of data citation, e.g. the amount of money that may be saved by supporting re-use of datasets, etc. It was mainl meant to distinguish the activities in this WG from the myriard of activities all related to data citation in other WGs (including one specifically on cost issues), which we also tried to address by extending the name of the WG to include th focus on making dynamic data citeable.

    > 6.The issue of data ownership is a complex one and the WG might well
    > consider how this can be effectively included in a citation system.I
    > note that Paul Uhlir is a member and his insights on this issue will be
    > especially valuable.

    This is an open issue to be discussed. Whie data ownership is an absolutely essential issue, I currently feel that this issue is currently dealt with much more comprehensively in others groups.
    Also, as far as I have been following discussions both within our WG as well as within the other groups you are referring to, the solutions seem to be mutually ocmpatible, i.e. the approach currently followed in this WG (time-stamped&versioned data sources and PID assignment to time-stamped "queries") will provide the meas to cite data, pointing to e.g. a landing page, where a lot of data citation activities will come together to provide the required information in a machine-processable way. We currently would tend not to broaden the list of questions discussed to keep things foused, but cross-link with othe rinitiatives ot make sue that the individual bulding block can cnnect and complement each other.
     
    > 7.Standards and specifications are critical to interoperability and the
    > WG might reach out to OGC in discussing these.

    Acknowledged. The principles developed within this WG will, however, be even mor egeneric as they will need to be adoptable by all kind of communities and for all types of data, so the core aim is to come up with a solution that is compatible/can be mapped to OGC standards, but not solely an OGC solution. However, such a mapping would be highly welcome! (although, without having more input at this time I would be hesitant to include this already in the Case Statement - as soon as members of the OGC deem this feasible, I'd be very happy to pick this up - I need more inut and feedback from the OGC commmunity on our approach for the time being to judge the viability and to hear their views about this approach - which is exactly what this WG is about.

    > 8.I like the idea of case studies which if carefully selected will
    > answer some of the concerns I have outlined above.Starting with these
    > can help ensure a demand driven approach to the work.

    Absolutely!
    We hope to have a range of what we refer to as "pilots"  from different communities, having different types of data and differing citation requirements.
    In a first iteration we will discuss the feasibiity of the proposed solution and its limitations on a conceptual basis for these plots, and then move on to even some technical imlementation for some pilots who have the resources to do so.
     
    > I hope that you find these comments useful.I am supportive of the
    > proposal of the WG but would like to see it  refined and
    > strengthened.Above all we must avoid duplication of effort with other
    > initiatives.It would not hurt to get reactions from groups like CODATA
    > to the proposal in advance rather than seeking cooperation once the WG
    > has been established and has started its work.

    this contact is established, and we are following each others discussions and meetings, having contributed to the FORCE11 synthesis principles, a potential follow-up group on implementation, etc. so these contacts are not sought ex-post, but rather there is cross-membership to ensure that the individual building blocks will match, and save each group effort by allowing each to address a specific sub-topic.

    I hope this helps with clarifying some of the conceerns identified.
    Wewill need to think about how to make these things still more explicit in the Case statement. Any sugggestions towards specific wording would be highly welcome!

    best, Andreas
     

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 31 Jan, 2014

    AN important aspect not least to allow the community to persuade funders to consider data as a product of research as well as publications.

     

    There is no mention of metadata (as such) - the group should liaise with MIG, MSDWG and DICIG (I have already spoken with Andi Rauber about this).  A very important aspect of citation is ensuring all contextual information is known - who cited what, why, when, how - this implies rich metadata describing not oly he dataset but also the citation (or utilisation) of it.

    Keith

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 24 Feb, 2014

    Keith,

     

    You are absolutely correct! But the topic of attribution, the type of information to be added to a data citation to trace provenance etc., is addressed in the WGs and IGs mentioned by you. We would like to add to their work those fieldas that we deem necessary to ensure correct citability (such as, potentially, a hash key for result set verification purposes, still under debate),

    How a citation is being used, what information will be provided on the resulting landing page, etc. is competently being worked upon by these groups. We would like to contribute principles/recommended approachs of how to actually be able to cite any arbitrary subset of data.

     

    Talking to funders and researchers: the key scenario is a researchers selecting a subset of data via whatever workbench is being used, and to obtain, together with the data, a PID that identifies that subset - allowing immediate reference to it, and thus supporting  re-use, integration, etc. without putting a heavy load onto the underlying repositories.

     

    Andreas

submit a comment