Practices for improving the discovery of datasets

The results of a small-scale survey conducted for the Long Tail of Research Data Interest Group found that Dublin Core and DataCite metadata were the most common schemas used and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets when searching directly in the repository, however, the metadata may not support widespread discovery via search engines or dataset directories.

In Dublin, we discussed strategies to improve discovery of datasets and did some brainstorming about strategies to improving data discoverability. The following practices were mentioned:

Linking data to related ublication
Build an extra discovery layer that describes the data
Link to or attach related Data Management Plans (DMPs) to the data
DOIs or data citation
Enable searching in repository to limit to datasets only
Enable machine readability
Improve quality and comprehensiveness of metadata (through researcher education or by repository staff)
Have your repository be harvested by aggregators

I would like to collect some example of these practices, mainly the first three areas.

If you know of good examples, please let me know.

I will post them on the Interest Group website.

Thanks!

Kathleen

Kathleen Shearer

co-chair of RDA Interest Group Long Tail of Research Data

Executive Director, Confederation of Open Access Repositories

Log in to post comments
17578 reads

Author: Annemiek van de...

Date: 29 Apr, 2014

In the 3TU.Datacenter (data repository for the technical sciences) datasets are linked to their publication.
Good examples are:
· the thesis ‘CFD in drinking water treatment’ of Bas Wols, with links between the thesis, a publication, data sets, several videos:
· link between data and publication of Leo Kouwenhoven ‘Signatures of Majorana fermions in hybrid superconductor-semiconductor nanowire devices’ and his article in Science.
Kind regards
Annemiek
Annemiek van der Kuil
Research Data Officer
3TU.Datacentrum
TU Delft | Research Data Services
T +31 (0)15 27 85 540
E ***@***.***
E ***@***.***
W www.datacentrum.3tu.nl
[Description: Description: Description: Description: cid:***@***.***] @AnnemiekvdKuil
[Description: Description: Description: Description: cid:***@***.***] @3TUDatacentrum
Van: m.kathleen.shearer=***@***.***-groups.org [mailto:***@***.***-groups.org] Namens Kathleen Shearer
Verzonden: maandag 28 april 2014 19:19
Aan: Long tail of research data IG
Onderwerp: [rda-tailresearchdata-ig] Practices for improving the discovery of datasets
Hi all
The results of a small-scale survey conducted for the Long Tail of Research Data Interest Group found that Dublin Core and DataCite metadata were the most common schemas used and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets when searching directly in the repository, however, the metadata may not support widespread discovery via search engines or dataset directories.
In Dublin, we discussed strategies to improve discovery of datasets and did some brainstorming about strategies to improving data discoverability. The following practices were mentioned:
1. Linking data to related ublication
2. Build an extra discovery layer that describes the data
3. Link to or attach related Data Management Plans (DMPs) to the data
4. DOIs or data citation
5. Enable searching in repository to limit to datasets only
6. Enable machine readability
7. Improve quality and comprehensiveness of metadata (through researcher education or by repository staff)
8. Have your repository be harvested by aggregators
I would like to collect some example of these practices, mainly the first three areas.
If you know of good examples, please let me know.
I will post them on the Interest Group website.
Thanks!
Kathleen
Kathleen Shearer
co-chair of RDA Interest Group Long Tail of Research Data
Executive Director, Confederation of Open Access Repositories
--
Full post: https://rd-alliance.org/practices-improving-discovery-datasets.html
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1685

Author: Tim Smith

Date: 30 Apr, 2014

Dear Kathleen,
I can offer a Zenodo record as an example: https://zenodo.org/record/7531
There you will see:
1) the link to the related publication (via its DOI)
2) the description field which is indexed and searchable (not sure if this is what you meant by "discovery layer”)
4) Minted DOI
5) Browse to Datasets and search only there
6) Several machine readable formats, including the DataCite link
7) Submitters can re-edit metadata, and curator can enrich
8) Harvested through OAI-PMH by several aggregators
Other important things demonstrated but not mentioned in your list are
+) The funder information
+) The rights information
Both of which are in the machine readable part as well
Best Regards,
Tim
From: Kathleen Shearer <***@***.***>
Reply-To: "***@***.***-groups.org" <***@***.***-groups.org>
Date: Monday 28 April 2014 19:18
To: Long tail of research data IG <***@***.***-groups.org>
Subject: [rda-tailresearchdata-ig] Practices for improving the discovery of datasets
Hi all
The results of a small-scale survey conducted for the Long Tail of Research Data Interest Group found that Dublin Core and DataCite metadata were the most common schemas used and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets when searching directly in the repository, however, the metadata may not support widespread discovery via search engines or dataset directories.
In Dublin, we discussed strategies to improve discovery of datasets and did some brainstorming about strategies to improving data discoverability. The following practices were mentioned:
1. Linking data to related ublication
2. Build an extra discovery layer that describes the data
3. Link to or attach related Data Management Plans (DMPs) to the data
4. DOIs or data citation
5. Enable searching in repository to limit to datasets only
6. Enable machine readability
7. Improve quality and comprehensiveness of metadata (through researcher education or by repository staff)
8. Have your repository be harvested by aggregators
I would like to collect some example of these practices, mainly the first three areas.
If you know of good examples, please let me know.
I will post them on the Interest Group website.
Thanks!
Kathleen
Kathleen Shearer
co-chair of RDA Interest Group Long Tail of Research Data
Executive Director, Confederation of Open Access Repositories
--
Full post: https://rd-alliance.org/practices-improving-discovery-datasets.html
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1685

Author: Simon Hodson

Date: 30 Apr, 2014

Dear All,
An example from Dryad might also be of interest.
Here is the full metadata record for a Dryad data package (as it happens the most downloaded package): http://datadryad.org/handle/10255/dryad.38181
1) All Dryad data packages provide a collection of data relating to a specific publication. There are links to the the article and if you scroll down the recommendation to cite both the article and the data package if the data is reused.
2) There is metadata for the package and descriptions of each of the data items. See the full metadata at http://datadryad.org/handle/10255/dryad.38181?show=full
4) Data package DOI (e.g. here it is doi:10.5061/dryad.6p76c3pb - note the dryad suffix).
Dryad only includes 'data packages' but these contain a wide variety of 'data' types. The metadata (including DOI) is machine readable. As Tim notes the funder and rights information is important. Dryad has rights information with CC0 recommended, but I'm not sure funder information is widely provided by Dryad. Authors are encouraged to provide as much appropriate metadata as possible at submission, but there is also a curation process.
Hope that's useful and of interest.
With very best wishes,
Simon.
___________________________
SciDataCon 2014 Call for Proposals: http://www.scidatacon2014.org/submissions
___________________________
Dr Simon Hodson | Executive Director CODATA | http://www.codata.org
E-Mail: ***@***.*** | Twitter: @simonhodson99 | Skype: simonhodson99
Blog: http://www.codata.org/blog
Diary: http://bit.ly/simonhodson99-calendar
Tel (Office): +33 1 45 25 04 96 | Tel (Cell): +33 6 86 30 42 59
CODATA (ICSU Committee on Data for Science and Technology), 5 rue Auguste Vacquerie, 75016 Paris, FRANCE

Author: Jochen Schirrwagen

Date: 20 May, 2014

Dear All,
here a late arrival, I hope this interesting thread is still alive ;-)
In recent years Bielefeld University, Germany (UNIBI) has been working
on services for research data management that includes:
* UNIBI as a publication agent for DOIs for datasets as a service for
its researchers
* tool to create DMPs
https://data.uni-bielefeld.de/en/data-management-plan
* advancement of the "institutional repository" to support registration,
deposit, description, exposition of research data sets and making links
to publications and projects if possible
You can find a general overview of this kind of services:
https://data.uni-bielefeld.de/en/researchdata
examples of published data sets with a number of filter options:
http://pub.uni-bielefeld.de/data/
The data can be either stored in our PUB "institutional repository"
and linked to a publication:
http://pub.uni-bielefeld.de/data/2670491
or linked to data services of a department or external data archives:
in the following example the record links to a dataset stored in the
CITEC - Cognitive Interaction Toolkit
http://pub.uni-bielefeld.de/data/2639459
exposition of dataset metadata using the DataCite metadata kernel via
OAI-PMH
http://pub.uni-bielefeld.de/oai?verb=ListRecords&metadataPrefix=oai_data...
Best,
Jochen

Author: Kathleen Shearer

Date: 20 May, 2014

Thanks Jochen.
Your information has come just in time. I will be posting a summary soon on the RDA long tail website.
At the meeting in Dublin we said we would start identifying priorities. I hope to get started with this next step very soon.
All the best, Kathleen

Author: Herman Stehouwer

Date: 20 May, 2014

Dear Jochen,
thanks for the examples of activities.
I find the data management plan support tool very helpful.
Thanks,
Herman
--
Dr. ir. Herman Stehouwer
Rechenzentrum Garching @ Max Planck for Plasmaphysics
RDA Secretariat
***@***.*** 0031-619258815

Author: Chris Taylor

Date: 20 May, 2014

Dear all,
I've been searching DataCite's metadata registry/catalogue and it seems
that links confirming published status either don't get passed back at all
in some formats (e.g., JSON), or come back as a generic link (no indication
that it is anything special). The datacite format gives most I think, but
even that only has this:

relatedIdentifierType="DOI">10.1016/J.YMPEV.2011.06.012
Running down that link could confirm that a dataset is linked to a
published paper, but it's a bit flaky.
Also, does anyone have a sense of the 'index' publication for a dataset (as
opposed, at the other extreme, to a paper that reanalyses a dataset years
later but is still perhaps linked)? An example might be a big genome paper
whose data get reanalysed to death post hoc.
Chris Taylor.

Author: Angus Whyte

Date: 20 May, 2014

I believe the DCC tools were mentioned when the IG met in Dublin so I
guess may already be in the summary. If not it would be good to mention
if it's not too late the DCC tool DMPonline
(http://dmponline.dcc.ac.uk/) and Checklist
(http://www.dcc.ac.uk/resources/data-management-plans/checklist). Of
course both are designed to help researchers to create data management
plans, and also to help institutions to support them with their specific
guidance.
The DMP outline and checklist from Bielefeld University is an
interesting example (as is the data publication repository), and we
should also link to it from the DCC website.
Jochen is your university's service planning to link DMPs to datasets in
your repository?
Thanks,
Angus
Dr Angus Whyte
Senior Institutional Support Officer
Digital Curation Centre
University of Edinburgh

Author: Chris Taylor

Date: 20 May, 2014

Update: Seems that 'relatedIdentifier:issupplementto\:*' (
http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A*)
is DataCite's preferred way to link out to a paper, but that doesn't seem
to me to exclusively indicate that a* journal publication* is on the end of
the link.
[From @datacite]
'If the author knows about a data-paper link, it is in the metadata
http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A*Often
they do not know'
(Original Tweet: https://twitter.com/datacite/status/468691940960391168)

Author: Stefan Kramer

Date: 21 May, 2014

FWIW, the DataCite folks are very open to questions & suggestions for improvements to the metadata schema:

https://groups.google.com/forum/#!forum/datacite-metadataStefan Kramer***@***.***Research Data LibrarianAmerican University4400 Massachusetts Ave. NWWashington, DC 20016www.american.edu/profiles/faculty/skramer.cfm
-----chrisftaylor=***@***.***-groups.org wrote: -----

To: ***@***.***-groups.orgFrom: chrisftaylor <***@***.***>Sent by: chrisftaylor=***@***.***-groups.orgDate: 05/20/2014 06:46AMSubject: Re: [rda-tailresearchdata-ig] Practices for improving the discovery of datasets

Update: Seems that 'relatedIdentifier:issupplementto\:*' (http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A*) is DataCite's preferred way to link out to a paper, but that doesn't seem to me to exclusively indicate that a journal publication is on the end of the link.
[From @datacite]
'If the author knows about a data-paper link, it is in the metadata http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A* Often they do not know'
- Show quoted text -

On 20 May 2014 11:50, Chris Taylor <***@***.***> wrote:

Dear all,

I've been searching DataCite's metadata registry/catalogue and it seems that links confirming published status either don't get passed back at all in some formats (e.g., JSON), or come back as a generic link (no indication that it is anything special). The datacite format gives most I think, but even that only has this:

<relatedIdentifier relationType="IsReferencedBy" relatedIdentifierType="DOI">10.1016/J.YMPEV.2011.06.012</relatedIdentifier>

Running down that link could confirm that a dataset is linked to a published paper, but it's a bit flaky.

Also, does anyone have a sense of the 'index' publication for a dataset (as opposed, at the other extreme, to a paper that reanalyses a dataset years later but is still perhaps linked)? An example might be a big genome paper whose data get reanalysed to death post hoc.

Chris Taylor.

On 20 May 2014 11:31, Kathleen Shearer <***@***.***> wrote:

Thanks Jochen.

Your information has come just in time. I will be posting a summary soon on the RDA long tail website.

At the meeting in Dublin we said we would start identifying priorities. I hope to get started with this next step very soon.

All the best, Kathleen

On 2014-05-20, at 6:01 AM, jotschirr <***@***.***-bielefeld.de> wrote:

> Dear All,
>
> here a late arrival, I hope this interesting thread is still alive ;-)
>
> In recent years Bielefeld University, Germany (UNIBI) has been working
> on services for research data management that includes:
> * UNIBI as a publication agent for DOIs for datasets as a service for
> its researchers
> * tool to create DMPs
> https://data.uni-bielefeld.de/en/data-management-plan
> * advancement of the "institutional repository" to support registration,
> deposit, description, exposition of research data sets and making links
> to publications and projects if possible
>
> You can find a general overview of this kind of services:
> https://data.uni-bielefeld.de/en/researchdata
>
> examples of published data sets with a number of filter options:
> http://pub.uni-bielefeld.de/data/
>
> The data can be either stored in our PUB "institutional repository"
> and linked to a publication:
> http://pub.uni-bielefeld.de/data/2670491
> or linked to data services of a department or external data archives:
> in the following example the record links to a dataset stored in the
> CITEC - Cognitive Interaction Toolkit
> http://pub.uni-bielefeld.de/data/2639459
>
> exposition of dataset metadata using the DataCite metadata kernel via
> OAI-PMH
>
> http://pub.uni-bielefeld.de/oai?verb=ListRecords&metadataPrefix=oai_datacite
>
>
> Best,
> Jochen
>
>
> On 28.04.2014 19:18, Kathleen Shearer wrote:
>> Hi all
>>
>> The results of a small-scale survey conducted for the Long Tail of
>> Research Data Interest Group found that Dublin Core and DataCite
>> metadata were the most common schemas used and less than half of the
>> respondents were using DOIs. In terms of discovery, most respondents
>> indicated that the metadata was sufficient for users to find the
>> datasets when searching directly in the repository, however, the
>> metadata may not support widespread discovery via search engines or
>> dataset directories.
>>
>> In Dublin, we discussed strategies to improve discovery of datasets and
>> did some brainstorming about strategies to improving data
>> discoverability. The following practices were mentioned:
>>
>> 1. *Linking data to related ublication*
>> 2. *Build an extra discovery layer that describes the data*
>> 3. *Link to or attach related Data Management Plans (DMPs) to the data*
>> 4. DOIs or data citation
>> 5. Enable searching in repository to limit to datasets only
>> 6. Enable machine readability
>> 7. Improve quality and comprehensiveness of metadata (through
>> researcher education or by repository staff)
>> 8. Have your repository be harvested by aggregators
>>
>> I would like to collect some example of these practices, mainly the
>> first three areas.
>>
>> If you know of good examples, please let me know.
>>
>> I will post them on the Interest Group website.
>>
>> Thanks!
>>
>> Kathleen
>>
>>
>>
>> Kathleen Shearer
>>
>> co-chair of RDA Interest Group Long Tail of Research Data
>>
>> Executive Director, Confederation of Open Access Repositories
>>
>> --
>> Full post:
>> https://rd-alliance.org/practices-improving-discovery-datasets.html
>> Manage my subscriptions: https://rd-alliance.org/mailinglist
>> Stop emails for this post:
>> https://rd-alliance.org/mailinglist/unsubscribe/1685
>
> --
> Jochen Schirrwagen
>
> Department of Library Technology and Knowledge Management
> Bielefeld University - University Library
> Universitätsstr. 25 - 33615 Bielefeld
> Tel: +49 (0) 521/106-4047
> Fax: +49 (0) 521/106-4052
>

Author: Chris Taylor

Date: 21 May, 2014

Cheers I might just start annoying people about this then...
The ideal would be a boolean somewhere up top, with a regularized way to
elaborate further down. And for it to become standard across catalogues and
repositories. Certainly relying on seeing a DOI in context isn't enough
(for example, conference papers frequently lack them).
Incidentally, does anyone know how much effort goes into ThompsonReuters'
Data Citation Index?

Author: Chris Taylor

Date: 21 May, 2014

Actually probably better to have a series of values rather than a boolean
so we could specify the manner of publication (0/null = none; 1 =
peer-reviewed journal; 2 = conference proceedings; 3 = thesis for a higher
degree; other values for white papers, book chapters, monographs,
self-published, pre-print, etc.). I'm guessing someone already did that
somewhere for something...

Author: Kathleen Shearer

Date: 21 May, 2014

Forwarding this for Angus Whyte as it didn't come through the list the first time.

Author: Jochen Schirrwagen

Date: 23 May, 2014

Dear Angus, All,
my reply inline:
Dear Angus, All,
my reply inline:
On 20.05.2014 13:39, sangusa wrote:
>
> I believe the DCC tools were mentioned when the IG met in Dublin so I
> guess may already be in the summary. If not it would be good to mention
> if it's not too late the DCC tool DMPonline
> (http://dmponline.dcc.ac.uk/) and Checklist
> (http://www.dcc.ac.uk/resources/data-management-plans/checklist). Of
> course both are designed to help researchers to create data management
> plans, and also to help institutions to support them with their specific
> guidance.
>
> The DMP outline and checklist from Bielefeld University is an
> interesting example (as is the data publication repository), and we
> should also link to it from the DCC website.
great.
The tool itself (build as a module for Drupal6) is dedicated for
researchers at our university but it is possible to request a demo
access via
***@***.***-bielefeld.de
Dear Angus, All,
my reply inline:
On 20.05.2014 13:39, sangusa wrote:
>
> I believe the DCC tools were mentioned when the IG met in Dublin so I
> guess may already be in the summary. If not it would be good to mention
> if it's not too late the DCC tool DMPonline
> (http://dmponline.dcc.ac.uk/) and Checklist
> (http://www.dcc.ac.uk/resources/data-management-plans/checklist). Of
> course both are designed to help researchers to create data management
> plans, and also to help institutions to support them with their specific
> guidance.
>
> The DMP outline and checklist from Bielefeld University is an
> interesting example (as is the data publication repository), and we
> should also link to it from the DCC website.
great.
The tool itself (build as a module for Drupal6) is dedicated for
researchers at our university but it is possible to request a demo
access via
***@***.***-bielefeld.de
>
> Jochen is your university's service planning to link DMPs to datasets in
> your repository?
This is an interesting but also delicate question.
Of course it makes a lot of sense to make links between datasets and the
associated DMP plus the project.
But is a DMP in general a public or internal document?
Maybe the status is internal during the lifetime of a project where the
data gets created and the DMP might be dynamically adapted according to
the work progress, but public at the time when datasets are published?
Best,
Jochen