A data curator´s workflow to engage researchers in RDM: Results from 13 multi-domain data description sessions
The application of the so-called FAIR principles highly depends on rich metadata, yet domain vocabularies are still mostly underused in several disciplines. This means that there are many data reuse opportunities missed due to the lack of engagement of researchers in data description. Therefore, the definition of use cases to further engage communities in FAIR ecosystem is recommended, by the the designated Expert Group on FAIR Data, in the Turning FAIR into reality report. With this in mind, data curators must play an essential role in strengthening the RDM practices of researchers, within the limited possibilities that researchers may have to commit to such practices. The main goal of this work is to foster the collaboration between researchers and data curators, and does so by promoting the engagement of researchers in a data curator's workflow for the development of domain-specific metadata models.
These metadata models leads to the selection of familiar concepts for the researchers that they can use in more casual descriptions, preferably by the time they start to collect data, to mitigate possible existing barriers to metadata creation. This data curator's workflow entails meetings, interviewees, the development of metadata models formalized as lightweight ontologies, followed by data description sessions with the researchers, as well as content analysis of domain publications as a complementary task to overcome communication shortcomings.
In order to assess the merits of this data curator’s workflow, 13 data description sessions were carried out in Dendro*, a staging RDM platform developed at University of Porto, between January 2018 and September 2019, with researchers from a diversity of domains. The participating researchers also completed a questionnaire to measure their attitude towards data description.
Overall, researchers have produced satisfactory or good quality metadata records. A total of 178 fields were completed and 89 different descriptors were used. On average, researchers needed 27 minutes to fill in 14 descriptors. Metadata elements regarding the context of data production, i.e. the study design, were the most used, corresponding to 55% of the total metadata created. Data description was characterized by researchers as a slightly demotivating and slightly time-consuming, yet somewhat interesting, moderately easy and moderately practical activity. The degree of usefulness of the data description was considered to be high.
Altogether, the quality of the produced metadata records produced and the researchers' feedback concerning data description allows the conclusion that metadata creation is a realistic activity to be performed by the researchers as long as adequate tools are provided to them. Therefore, this data curator’s workflow is regarded as a promising approach to engage researchers in RDM, through data description.
* https://github.com/feup-infolab/dendro
Click on the poster image to enlarge
The collaborative nature of this work, between researchers and data curator, is aligned with the overall vision of RDA. It is an inclusive approach, by fostering the engagement of researchers from several domains in data description, thus promoting a culture of sustainable data sharing and re-use.
This contribution also fits with work done in some IGs or WGs, namely with the Engaging Researchers with Data IG. Several researchers, without former metadata experience, were able to create metadata records of satisfactory or good quality. Moreover, by being included in the proposed data curator’s workflow researchers have increased their RDM awareness, which is essential to enable cultural change.
The knowledge acquired in the various interactions with researchers, as well as the overall experience, are also interesting to share with the RDA community.
- Log in to post comments
- 3055 reads
Author: Shanmugasundara...
Date: 01 Apr, 2020
Hi Joao,
the need to engage researchers to contribute more to curation of data is very important and your research is important towards this goal. From the results of the metadata quality assessements, would it be accurate to say that there was no single field of research that was better than another at producing good quality metadata? e.g. natural sciences are better than social sciences?
Also, do any of the subjects you used in this study have pre-existing metadata standards that you could compare to?
Thank you,
Venkat
Author: João Aguiar Castro
Date: 02 Apr, 2020
Hi Venkat,
thank you for the comment and pertinent questions.
Regarding the quality of metadata I must say that the criteria for the assessement took into account: 1) a minimum number of descriptors filled in; 2) a balaced distribution of metadata categories; 3) the usage of descriptors expected in mainstream data repositories; 4) overall rigour in the information provided.
I also have to consider my own limitations in assessing the metadata quality in specific-domains and that for most participants it was their first contact with RDM tools.
That said, my results do not show that metadata created in a given field was better than another. But I can make some general observations. Overall, the participants resorted to domain-specific metadata and were not very interested in filling in complementary metadata (generic metadata from Dublin Core for instance), partipants from the social sciences were more hesitant to choose descriptors, by reading in detail the definition of DDI elements (ambiguity I guess), while researchers from experimental domains were quicker to choose and fill in the metadata fields, since these are very fine grained. The Magnetic Dynamics, was an exception, this researcher produced poor metadata quality because would prefer more generic descriptors, so filled in a few DDI elements and provided valuable feedback instead.
As for the pre-existing metadata standards, when preparing the sessions I want to make sure that the necessary descriptors were available. So, DDI covered most of metadata requirements I gathered from interviewing the participants, as well as Dublin Core, Friend of a Friend and other more generic ones. Descriptors from the Ecological Metadata Language and Darwin Core were also available, from previous collaborations. I designed the domain ontologies based on interviews and content analysis of their publications, when standards are not available or are very complex to adopt.
From the researchers perspective, most were not aware of metadata creation and none was familiar with standards. The researcher from the the Structural Adhesive Joints and his group have already defined some metadata that their instruments should automatically generate, and metadata from instruments was also mentioned by the Sustainable Chemistry and Magnetic Material researchers. Another one used project documentation to create the metadata in his session.
I guess this answers the second question, but let me know if its not the case.
All the best
Author: Sarah Jones
Date: 02 Apr, 2020
Hi João
Really nice concept and poster.
I wondered how you support (or plan to support) disciplinary metadata via Dendro. Do you offer different disciplinary profiles or do any kind of crosswalking between standards?
As you found, there's a fair amount that you can do with generic standards to support basic description and discovery but reuse often requires some domain specifics
All best
Sarah
Author: João Aguiar Castro
Date: 02 Apr, 2020
Hi Sarah,
thanks for the interest.
In the early stages, around 2014, generic ontologies like Dublin Core and Friend of a Friend were the core vocabularies in Dendro. From there, we have incrementally added vocabularies as we engage new researchers.
There are two scenarios. If there are already suitable standards in the DCC metadata directory we design the corresponding ontology based on them. For instace, we have selected a subset of descriptors of the DDI, EML and of the MIBBI standards and developed lightweight ontologies (in order to reduce the complexity of such standards). The subsets are also validated by the researchers.
In domains where we cannot find suitable standards we define them from scratch in close collaboration with researchers. If the concepts are available in a given ontology, we reuse them as much as possible.
So, a few vocabularies will cover many domains, and we add new ones when needed. Researchers mostly search descriptors by browsing the vocabularies.
All the best