Data Description Registry Interoperability: Draft Case Statement

You are here

11 Dec 2013

Data Description Registry Interoperability: Draft Case Statement

This working group focuses on the challenge of interoperability at the discovery phase and investigates the problem of enabling data description exchange between research data registry systems. This group aims to provide working services and pragmatic methods that enable finding datasets across multiple registry systems.

AttachmentSize
File DDRI-CaseStatement-V1-4.docx30.78 KB
  • Simon Cox's picture

    Author: Simon Cox

    Date: 17 Dec, 2013

    This is a well-supported proposal on a topic of key significance.

    My one concern is that its work appears to overlap with the scope of two existing WGs: 

    1. in the Discovery Phase with the Metadata Standards Directory

    2. in the Access Phase with the Data Types Registry

    The program of the proposed WG appears to be taking a more integrated approach, but needs to ensure that it takes advantage of the outputs from the existing groups, and does not repeat work done elsewhere in RDA.  

  • Amir Aryani's picture

    Author: Amir Aryani

    Date: 18 Dec, 2013

    Hi Simon,

    Thanks for the comment. About the overlaps:

    1. in the Discovery Phase with the Metadata Standards Directory

    There is strong point of collaboration here. I hope that the framework by Metadata Standards Directory WG would help us to provide a better model for the bilateral interoperability projects; however, metadata interoperability is only part of the problem.

    At this stage we know that to complete the bilateral interoperability projects in this WG, we would need solutions for identifiers (collaboration with PID Information Types WG), disambiguation (author + work), web service protocols (synchronisation and authentication) and multilingual search. Also as part of the interoperability project with DataCite we have noticed the problem of the search ranking. Each one of these issues can be addressed as separate research problems with collaboration with experts from different groups. 

    2. in the Access Phase with the Data Types Registry

    Addressing the challenge of access phase will be outside the scope of this WG. Even the discovery phase is a major challenge for the limited timeframe of the group. However, I hope that the outcomes of this WG can provide benefit to the Data Type Registries and groups who work on large-scale data exchange between research facilities. 

    Being mindful of the existing activities in the sector and particularly the effort by other RDA members, open collaboration with other groups is a key enabler for the projects in this WG.

    Best Regards,
    Amir

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 18 Dec, 2013

    As co-chair of MSDWG (Metadata Standards Diretory WG - Jane and Rebecca please jump in to correct anything) and MIG (Metadata Interest Group - Rebecca please jump in as necesssary) I note the strong overlaps not only with those two groups (and the types group already mentioned) but also with the data in context group, PID group, the citation group, the preservation group, the terminology group. the legal interoperability group, the brokering group and all the groups concerning domain specific metadata (marine, agriculture, wheat, biodiversity, urban, proton and neutron, structural biology, toxicology, history, materials...)

    (I note some of these are mentioned in Amir's response above)

    In discovery phase I see more-or-less complete overlap with MSDWG: that group's misssion statement is clear. The current workplan and objectives are short-term (get the directory in place) but the longer term plan is directory development and maintenance (essentially from a human-readable inventory to computer-understandable directory) and using it for interoperable access.  There has been a great aount of research in discovery techology and most of the advanced projects are moving beyond 'flat' metadata formats like DC, DCAT, INSPIRE, eGMS, CKAN etc to more complex metadata needed for assisting in the automation of discovery.

    In access phase it gets even more interesting and even more strongly multilingual (including multi characterset) metadata standards are needed.  In particular there is a requirement for declared semantics (just about impossible to get common semantics acrosss multicultural actors) over formal syntax.  In Europe quite a lot of projects concerning research information (including research datasets) use CERIF: examples are OpenAIRE, MERIL, ENGAGE,EPOS, PaaSage and many of the research funder systems (includng ERC) and university research management systems.  Some of these projects use a 3-layer architectural model for metadata with discovery at the top generated from contextual in the middle which points to domain-specific at the bottom.  The top level is DC, DCAT, INSPIRE, eMS, CKAN etc.  The middle CERIF.  The bottom layer is 'schema level' and very specific to a community (there are many examples in the UK DCC catalog and progressively in the MSDWG directory).

    I think we have a real problem in RDA with the 'organic growth' model; it will lead to divergence not convergence and although 'letting a thousand flowers bloom' is great for research advances it is not great for what RDA is trying to achieve especially interoperability which requires coordination more than advances.

    At the last plenary in Washington I made a plea for a cross-group effort on metadata generally (to come up with an architectural model and best practice)  using MIG (since it it long-lasting and general rather than task or domain specific) as the vehicle; I have proposed such a session for Plenary3 in Dublin (as well as sessions for MSDWG and MIG for themselves).

    To conclude, although the proposal for this WG is well-stuctured and well-supported I believe what it is trying to achieve is already well-covered (admittedly not holistically but in various aspects) by other groups.  On the other hand this proposal is not as holistic as the sum of the existing groups.

    Perhaps TAB has a view on how to reduce divergence and increase convergence in RDA?

    best wishes

    Keith

     

  • Amir Aryani's picture

    Author: Amir Aryani

    Date: 19 Dec, 2013

    Hi Keith,

    I like your question about the divergence versus convergence, and I do not know the answer. Should we aim for large groups that include many projects and deliverables? and if we aim for these mega groups, will the coordination effort be manageable? or do we prefer small groups with limited and scoped deliverables? I guess in the latter case there will be overlaps as groups will pursue connected goals. 
     
    As you pointed out about the proposed WG, it is hard to avoid overlaps for us. We try to deliver working software solutions that enable cross platform discovery, and to achieve this goal, we need expertise from multiple groups including MSDWG, PID, Citation, Publishing Data and external resources such as web service and distributed software architecture community. 
     
    For example, this group should have an active collaboration with PID Information Types as persistent identifiers are significant element in enabling cross platform discovery.
     
    Best Regards
    Amir
     

  • Leticia Cruz's picture

    Author: Leticia Cruz

    Date: 24 Jan, 2014

    How will the data be catagorized for data set production, and how will the existing research data included in the construction of data sets be validated ?

     

    Kind Regards

    Ms Leticia Cruz RN MSN/Ed

  • Amir Aryani's picture

    Author: Amir Aryani

    Date: 25 Jan, 2014

    Hi Leticia

    I am not sure if I understood the question correctly, do you ask about publishing new datasets in repositories?

    The problem that this group aims to address is creating software solutions for cross-platform discovery. For example, enabling Australian scientists who use Research Data Australia portal (http://researchdata.ands.org.au) to find international datasets in http://datadryad.org, NARCIS.nl, ...

    However, the main problem is providing scientists with search results that are actually relevant to their research. When a researcher searches for a datasets in a specific topic, there is no value in returning overwhelming results. Like general-purpose search systems, the relevance of the top results is the key element in making discovery useful, and this is the main focus of this group.

    So back to the question, although identifying the dataset categories is quite important in discovery as it provides context to the query results, making decision about categorizing new datasets is outside the scope of this working group. I believe Publishing Data Interest Group is the best community to focus on this problem. 

    Kind Regards,

    Amir

     

     

     

  • Amir Aryani's picture

    Author: Amir Aryani

    Date: 25 Jan, 2014

    We have a revised case statement for this working group. The problem statement and scope are clarified; also, following the discussion with Simon, and the comment by Keith, I think we can use an alternative title – ‘Research Data Cross-Platform Discovery’ to be more precise about the deliverables of this working group.

    You can find the new case statement (Research Data Cross-Platform Discovery - Case Statement - Version 1.8.docx) at https://www.rd-alliance.org/filedepot?cid=176&fid=381 or from my dropbox account: https://www.dropbox.com/s/moyc2yk4ubjftyu/DDRI-CaseStatement-V1-8.docx

     
     

  • Larry Lannom's picture

    Author: Larry Lannom

    Date: 25 Jan, 2014

    Good discussion. Two comments, the first specific to this thread and the second more general and specific to RDA:

    1. One of the main use cases for the Data Type Registries WG (I am one of the co-chairs) is registering and explicating the types that form half of the type/value pairs returned from identifier resolution. This is the strong connection with the PIT WG. Was glad to see Simon pointing out the overlap issue and just wanted to fill it in a bit. I gather from the Case Statement that the group wisely plans to look at discovery through a network of connections, as opposed to, e.g., a single search interface. Starting with an item of known interest one would like to be able to retrieve additional current information about the item (who, what, where, citations, etc.) and one good way to do that would be to resolve the identifier of the item to type/value pairs that connected directly or indirectly to the additional contextual information.

    2. I thought Keith's comments were spot on and illustrate a good use case for RDA itself and how to organize going forward. I have been dubious about the term 'technical road map' because it implies a top down detailed technical agenda which I feel we really must try to avoid. But Keith describes the other side of that coin - insufficient coordination can lead to equally unfortuante results. So the question is how to walk the middle path and exactly what can we put in place organizationally to optimize RDA results. This is clearly a TAB activity, but exactly what that activity consists of is really a question for RDA as a whole. In the end I think we have to allow for the possibility of overlapping or even conflicting WGs and WG outputs, but only as a last resort and after as much coordination effort as we can bring to bear on the issue.

  • Stefanie Kethers's picture

    Author: Stefanie Kethers

    Date: 28 Jan, 2014

    Dear all,

    This is just a heads up that this case statement will go to TAB and Council for their reviews at the end of this month

    Please note that the latest version is V1.8.1, which is available here: https://www.rd-alliance.org/filedepot?cid=176&fid=381 

    Thanks!

    Best wishes,

    Stefanie

     

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 30 Jan, 2014

    A quick and very late comment on the PID Information Types overlap: I do not see particular problems with what is written in the case statement. For me, this doesn't read like an overlap in terms of duplication of effort. I'd rather see the proposed WG's work in terms of potential adoption and use at a higher architectural layer. You want to use identifiers, and the types on them, but how that is realized may not be particularly important. The PIT WG will provide the required interface, and also the essential tools to organize the information you want to store and access. Also, as Larry pointed out, the idea of discovery through a network of connections resonates well with the PIT work.

submit a comment