Practices for improving the discovery of datasets

28 Apr 2014

Hi all

The results of a small-scale survey conducted for the Long Tail of Research Data Interest Group found that Dublin Core and DataCite metadata were the most common schemas used and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets when searching directly in the repository, however, the metadata may not support widespread discovery via search engines or dataset directories.

In Dublin, we discussed strategies to improve discovery of datasets and did some brainstorming about strategies to improving data discoverability. The following practices were mentioned:

  1. Linking data to related ublication
  2. Build an extra discovery layer that describes the data
  3. Link to or attach related Data Management Plans (DMPs) to the data
  4. DOIs or data citation
  5. Enable searching in repository to limit to datasets only
  6. Enable machine readability
  7. Improve quality and comprehensiveness of metadata (through researcher education or by repository staff)
  8. Have your repository be harvested by aggregators

I would like to collect some example of these practices, mainly the first three areas.

If you know of good examples, please let me know.

I will post them on the Interest Group website.

Thanks!

Kathleen

 

Kathleen Shearer

co-chair of RDA Interest Group Long Tail of Research Data

Executive Director, Confederation of Open Access Repositories

  • Annemiek van der Kuil's picture

    Author: Annemiek van de...

    Date: 29 Apr, 2014

    In the 3TU.Datacenter (data repository for the technical sciences) datasets are linked to their publication.
    Good examples are:
    · the thesis ‘CFD in drinking water treatment’ of Bas Wols, with links between the thesis, a publication, data sets, several videos:
    · link between data and publication of Leo Kouwenhoven ‘Signatures of Majorana fermions in hybrid superconductor-semiconductor nanowire devices’ and his article in Science.
    Kind regards
    Annemiek
    Annemiek van der Kuil
    Research Data Officer
    3TU.Datacentrum
    TU Delft | Research Data Services
    T +31 (0)15 27 85 540
    E ***@***.***
    E ***@***.***
    W www.datacentrum.3tu.nl
    [Description: Description: Description: Description: cid:***@***.***] @AnnemiekvdKuil
    [Description: Description: Description: Description: cid:***@***.***] @3TUDatacentrum
    Van: m.kathleen.shearer=***@***.***-groups.org [mailto:***@***.***-groups.org] Namens Kathleen Shearer
    Verzonden: maandag 28 april 2014 19:19
    Aan: Long tail of research data IG
    Onderwerp: [rda-tailresearchdata-ig] Practices for improving the discovery of datasets
    Hi all
    The results of a small-scale survey conducted for the Long Tail of Research Data Interest Group found that Dublin Core and DataCite metadata were the most common schemas used and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets when searching directly in the repository, however, the metadata may not support widespread discovery via search engines or dataset directories.
    In Dublin, we discussed strategies to improve discovery of datasets and did some brainstorming about strategies to improving data discoverability. The following practices were mentioned:
    1. Linking data to related ublication
    2. Build an extra discovery layer that describes the data
    3. Link to or attach related Data Management Plans (DMPs) to the data
    4. DOIs or data citation
    5. Enable searching in repository to limit to datasets only
    6. Enable machine readability
    7. Improve quality and comprehensiveness of metadata (through researcher education or by repository staff)
    8. Have your repository be harvested by aggregators
    I would like to collect some example of these practices, mainly the first three areas.
    If you know of good examples, please let me know.
    I will post them on the Interest Group website.
    Thanks!
    Kathleen
    Kathleen Shearer
    co-chair of RDA Interest Group Long Tail of Research Data
    Executive Director, Confederation of Open Access Repositories
    --
    Full post: https://rd-alliance.org/practices-improving-discovery-datasets.html
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1685

  • Tim Smith's picture

    Author: Tim Smith

    Date: 30 Apr, 2014

    Dear Kathleen,
    I can offer a Zenodo record as an example: https://zenodo.org/record/7531
    There you will see:
    1) the link to the related publication (via its DOI)
    2) the description field which is indexed and searchable (not sure if this is what you meant by "discovery layer”)
    4) Minted DOI
    5) Browse to Datasets and search only there
    6) Several machine readable formats, including the DataCite link
    7) Submitters can re-edit metadata, and curator can enrich
    8) Harvested through OAI-PMH by several aggregators
    Other important things demonstrated but not mentioned in your list are
    +) The funder information
    +) The rights information
    Both of which are in the machine readable part as well
    Best Regards,
    Tim
    From: Kathleen Shearer <***@***.***>
    Reply-To: "***@***.***-groups.org" <***@***.***-groups.org>
    Date: Monday 28 April 2014 19:18
    To: Long tail of research data IG <***@***.***-groups.org>
    Subject: [rda-tailresearchdata-ig] Practices for improving the discovery of datasets
    Hi all
    The results of a small-scale survey conducted for the Long Tail of Research Data Interest Group found that Dublin Core and DataCite metadata were the most common schemas used and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets when searching directly in the repository, however, the metadata may not support widespread discovery via search engines or dataset directories.
    In Dublin, we discussed strategies to improve discovery of datasets and did some brainstorming about strategies to improving data discoverability. The following practices were mentioned:
    1. Linking data to related ublication
    2. Build an extra discovery layer that describes the data
    3. Link to or attach related Data Management Plans (DMPs) to the data
    4. DOIs or data citation
    5. Enable searching in repository to limit to datasets only
    6. Enable machine readability
    7. Improve quality and comprehensiveness of metadata (through researcher education or by repository staff)
    8. Have your repository be harvested by aggregators
    I would like to collect some example of these practices, mainly the first three areas.
    If you know of good examples, please let me know.
    I will post them on the Interest Group website.
    Thanks!
    Kathleen
    Kathleen Shearer
    co-chair of RDA Interest Group Long Tail of Research Data
    Executive Director, Confederation of Open Access Repositories
    --
    Full post: https://rd-alliance.org/practices-improving-discovery-datasets.html
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1685

  • Simon Hodson's picture

    Author: Simon Hodson

    Date: 30 Apr, 2014

    Dear All,
    An example from Dryad might also be of interest.
    Here is the full metadata record for a Dryad data package (as it happens the most downloaded package): http://datadryad.org/handle/10255/dryad.38181
    1) All Dryad data packages provide a collection of data relating to a specific publication. There are links to the the article and if you scroll down the recommendation to cite both the article and the data package if the data is reused.
    2) There is metadata for the package and descriptions of each of the data items. See the full metadata at http://datadryad.org/handle/10255/dryad.38181?show=full
    4) Data package DOI (e.g. here it is doi:10.5061/dryad.6p76c3pb - note the dryad suffix).
    Dryad only includes 'data packages' but these contain a wide variety of 'data' types. The metadata (including DOI) is machine readable. As Tim notes the funder and rights information is important. Dryad has rights information with CC0 recommended, but I'm not sure funder information is widely provided by Dryad. Authors are encouraged to provide as much appropriate metadata as possible at submission, but there is also a curation process.
    Hope that's useful and of interest.
    With very best wishes,
    Simon.
    ___________________________
    SciDataCon 2014 Call for Proposals: http://www.scidatacon2014.org/submissions
    ___________________________
    Dr Simon Hodson | Executive Director CODATA | http://www.codata.org
    E-Mail: ***@***.*** | Twitter: @simonhodson99 | Skype: simonhodson99
    Blog: http://www.codata.org/blog
    Diary: http://bit.ly/simonhodson99-calendar
    Tel (Office): +33 1 45 25 04 96 | Tel (Cell): +33 6 86 30 42 59
    CODATA (ICSU Committee on Data for Science and Technology), 5 rue Auguste Vacquerie, 75016 Paris, FRANCE

  • Jochen Schirrwagen's picture

    Author: Jochen Schirrwagen

    Date: 20 May, 2014

    Dear All,
    here a late arrival, I hope this interesting thread is still alive ;-)
    In recent years Bielefeld University, Germany (UNIBI) has been working
    on services for research data management that includes:
    * UNIBI as a publication agent for DOIs for datasets as a service for
    its researchers
    * tool to create DMPs
    https://data.uni-bielefeld.de/en/data-management-plan
    * advancement of the "institutional repository" to support registration,
    deposit, description, exposition of research data sets and making links
    to publications and projects if possible
    You can find a general overview of this kind of services:
    https://data.uni-bielefeld.de/en/researchdata
    examples of published data sets with a number of filter options:
    http://pub.uni-bielefeld.de/data/
    The data can be either stored in our PUB "institutional repository"
    and linked to a publication:
    http://pub.uni-bielefeld.de/data/2670491
    or linked to data services of a department or external data archives:
    in the following example the record links to a dataset stored in the
    CITEC - Cognitive Interaction Toolkit
    http://pub.uni-bielefeld.de/data/2639459
    exposition of dataset metadata using the DataCite metadata kernel via
    OAI-PMH
    http://pub.uni-bielefeld.de/oai?verb=ListRecords&metadataPrefix=oai_data...
    Best,
    Jochen

  • Kathleen Shearer's picture

    Author: Kathleen Shearer

    Date: 20 May, 2014

    Thanks Jochen.
    Your information has come just in time. I will be posting a summary soon on the RDA long tail website.
    At the meeting in Dublin we said we would start identifying priorities. I hope to get started with this next step very soon.
    All the best, Kathleen

  • Herman Stehouwer's picture

    Author: Herman Stehouwer

    Date: 20 May, 2014

    Dear Jochen,
    thanks for the examples of activities.
    I find the data management plan support tool very helpful.
    Thanks,
    Herman
    --
    Dr. ir. Herman Stehouwer
    Rechenzentrum Garching @ Max Planck for Plasmaphysics
    RDA Secretariat
    ***@***.*** 0031-619258815

  • Chris Taylor 's picture

    Author: Chris Taylor

    Date: 20 May, 2014

    Dear all,
    I've been searching DataCite's metadata registry/catalogue and it seems
    that links confirming published status either don't get passed back at all
    in some formats (e.g., JSON), or come back as a generic link (no indication
    that it is anything special). The datacite format gives most I think, but
    even that only has this:

    relatedIdentifierType="DOI">10.1016/J.YMPEV.2011.06.012
    Running down that link could confirm that a dataset is linked to a
    published paper, but it's a bit flaky.
    Also, does anyone have a sense of the 'index' publication for a dataset (as
    opposed, at the other extreme, to a paper that reanalyses a dataset years
    later but is still perhaps linked)? An example might be a big genome paper
    whose data get reanalysed to death post hoc.
    Chris Taylor.

  • Angus Whyte's picture

    Author: Angus Whyte

    Date: 20 May, 2014

    I believe the DCC tools were mentioned when the IG met in Dublin so I
    guess may already be in the summary. If not it would be good to mention
    if it's not too late the DCC tool DMPonline
    (http://dmponline.dcc.ac.uk/) and Checklist
    (http://www.dcc.ac.uk/resources/data-management-plans/checklist). Of
    course both are designed to help researchers to create data management
    plans, and also to help institutions to support them with their specific
    guidance.
    The DMP outline and checklist from Bielefeld University is an
    interesting example (as is the data publication repository), and we
    should also link to it from the DCC website.
    Jochen is your university's service planning to link DMPs to datasets in
    your repository?
    Thanks,
    Angus
    Dr Angus Whyte
    Senior Institutional Support Officer
    Digital Curation Centre
    University of Edinburgh

  • Chris Taylor 's picture

    Author: Chris Taylor

    Date: 20 May, 2014

    Update: Seems that 'relatedIdentifier:issupplementto\:*' (
    http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A*)
    is DataCite's preferred way to link out to a paper, but that doesn't seem
    to me to exclusively indicate that a* journal publication* is on the end of
    the link.
    [From @datacite]
    'If the author knows about a data-paper link, it is in the metadata
    http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A*Often
    they do not know'
    (Original Tweet: https://twitter.com/datacite/status/468691940960391168)

  • Stefan Kramer's picture

    Author: Stefan Kramer

    Date: 21 May, 2014

    FWIW, the DataCite folks are very open to questions & suggestions for improvements to the metadata schema:

    https://groups.google.com/forum/#!forum/datacite-metadataStefan Kramer***@***.***Research Data LibrarianAmerican University4400 Massachusetts Ave. NWWashington, DC 20016www.american.edu/profiles/faculty/skramer.cfm
    -----chrisftaylor=***@***.***-groups.org wrote: -----

    To: ***@***.***-groups.orgFrom: chrisftaylor <***@***.***>Sent by: chrisftaylor=***@***.***-groups.orgDate: 05/20/2014 06:46AMSubject: Re: [rda-tailresearchdata-ig] Practices for improving the discovery of datasets

    Update: Seems that 'relatedIdentifier:issupplementto\:*' (http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A*) is DataCite's preferred way to link out to a paper, but that doesn't seem to me to exclusively indicate that a journal publication is on the end of the link.
    [From @datacite]
    'If the author knows about a data-paper link, it is in the metadata http://search.datacite.org/ui?&q=relatedIdentifier%3Aissupplementto\%3A* Often they do not know'
    - Show quoted text -

    On 20 May 2014 11:50, Chris Taylor <***@***.***> wrote:

    Dear all,

    I've been searching DataCite's metadata registry/catalogue and it seems that links confirming published status either don't get passed back at all in some formats (e.g., JSON), or come back as a generic link (no indication that it is anything special). The datacite format gives most I think, but even that only has this:

    <relatedIdentifier relationType="IsReferencedBy" relatedIdentifierType="DOI">10.1016/J.YMPEV.2011.06.012</relatedIdentifier>

    Running down that link could confirm that a dataset is linked to a published paper, but it's a bit flaky.

    Also, does anyone have a sense of the 'index' publication for a dataset (as opposed, at the other extreme, to a paper that reanalyses a dataset years later but is still perhaps linked)? An example might be a big genome paper whose data get reanalysed to death post hoc.

    Chris Taylor.

    On 20 May 2014 11:31, Kathleen Shearer <***@***.***> wrote:

    Thanks Jochen.

    Your information has come just in time. I will be posting a summary soon on the RDA long tail website.

    At the meeting in Dublin we said we would start identifying priorities. I hope to get started with this next step very soon.

    All the best, Kathleen

    On 2014-05-20, at 6:01 AM, jotschirr <***@***.***-bielefeld.de> wrote:

    > Dear All,
    >
    > here a late arrival, I hope this interesting thread is still alive ;-)
    >
    > In recent years Bielefeld University, Germany (UNIBI) has been working
    > on services for research data management that includes:
    > * UNIBI as a publication agent for DOIs for datasets as a service for
    > its researchers
    > * tool to create DMPs
    >  https://data.uni-bielefeld.de/en/data-management-plan
    > * advancement of the "institutional repository" to support registration,
    > deposit, description, exposition of research data sets and making links
    > to publications and projects if possible
    >
    > You can find a general overview of this kind of services:
    >  https://data.uni-bielefeld.de/en/researchdata
    >
    > examples of published data sets with a number of filter options:
    > http://pub.uni-bielefeld.de/data/
    >
    > The data can be either stored in our PUB "institutional repository"
    > and linked to a publication:
    >  http://pub.uni-bielefeld.de/data/2670491
    > or linked to data services of a department or external data archives:
    > in the following example the record links to a dataset stored in  the
    > CITEC - Cognitive Interaction Toolkit
    >  http://pub.uni-bielefeld.de/data/2639459
    >
    > exposition of dataset metadata using the DataCite metadata kernel via
    > OAI-PMH
    >
    > http://pub.uni-bielefeld.de/oai?verb=ListRecords&metadataPrefix=oai_datacite
    >
    >
    > Best,
    > Jochen
    >
    >
    > On 28.04.2014 19:18, Kathleen Shearer wrote:
    >> Hi all
    >>
    >> The results of a small-scale survey conducted for the Long Tail of
    >> Research Data Interest Group found that Dublin Core and DataCite
    >> metadata were the most common schemas used and less than half of the
    >> respondents were using DOIs. In terms of discovery, most respondents
    >> indicated that the metadata was sufficient for users to find the
    >> datasets when searching directly in the repository, however, the
    >> metadata may not support widespread discovery via search engines or
    >> dataset directories.
    >>
    >> In Dublin, we discussed strategies to improve discovery of datasets and
    >> did some brainstorming about strategies to improving data
    >> discoverability. The following practices were mentioned:
    >>
    >> 1. *Linking data to related ublication*
    >> 2. *Build an extra discovery layer that describes the data*
    >> 3. *Link to or attach related Data Management Plans (DMPs) to the data*
    >> 4. DOIs or data citation
    >> 5. Enable searching in repository to limit to datasets only
    >> 6. Enable machine readability
    >> 7. Improve quality and comprehensiveness of metadata (through
    >>    researcher education or by repository staff)
    >> 8. Have your repository be harvested by aggregators
    >>
    >> I would like to collect some example of these practices, mainly the
    >> first three areas.
    >>
    >> If you know of good examples, please let me know.
    >>
    >> I will post them on the Interest Group website.
    >>
    >> Thanks!
    >>
    >> Kathleen
    >>
    >>
    >>
    >> Kathleen Shearer
    >>
    >> co-chair of RDA Interest Group Long Tail of Research Data
    >>
    >> Executive Director, Confederation of Open Access Repositories
    >>
    >> --
    >> Full post:
    >> https://rd-alliance.org/practices-improving-discovery-datasets.html
    >> Manage my subscriptions: https://rd-alliance.org/mailinglist
    >> Stop emails for this post:
    >> https://rd-alliance.org/mailinglist/unsubscribe/1685
    >
    > --
    > Jochen Schirrwagen
    >
    > Department of Library Technology and Knowledge Management
    > Bielefeld University - University Library
    > Universitätsstr. 25 - 33615 Bielefeld
    > Tel: +49 (0) 521/106-4047
    > Fax: +49 (0) 521/106-4052
    >

  • Chris Taylor 's picture

    Author: Chris Taylor

    Date: 21 May, 2014

    Cheers I might just start annoying people about this then...
    The ideal would be a boolean somewhere up top, with a regularized way to
    elaborate further down. And for it to become standard across catalogues and
    repositories. Certainly relying on seeing a DOI in context isn't enough
    (for example, conference papers frequently lack them).
    Incidentally, does anyone know how much effort goes into ThompsonReuters'
    Data Citation Index?

  • Chris Taylor 's picture

    Author: Chris Taylor

    Date: 21 May, 2014

    Actually probably better to have a series of values rather than a boolean
    so we could specify the manner of publication (0/null = none; 1 =
    peer-reviewed journal; 2 = conference proceedings; 3 = thesis for a higher
    degree; other values for white papers, book chapters, monographs,
    self-published, pre-print, etc.). I'm guessing someone already did that
    somewhere for something...

  • Kathleen Shearer's picture

    Author: Kathleen Shearer

    Date: 21 May, 2014

    Forwarding this for Angus Whyte as it didn't come through the list the first time.

  • Jochen Schirrwagen's picture

    Author: Jochen Schirrwagen

    Date: 23 May, 2014

    Dear Angus, All,
    my reply inline:
    Dear Angus, All,
    my reply inline:
    On 20.05.2014 13:39, sangusa wrote:
    >
    > I believe the DCC tools were mentioned when the IG met in Dublin so I
    > guess may already be in the summary. If not it would be good to mention
    > if it's not too late the DCC tool DMPonline
    > (http://dmponline.dcc.ac.uk/) and Checklist
    > (http://www.dcc.ac.uk/resources/data-management-plans/checklist). Of
    > course both are designed to help researchers to create data management
    > plans, and also to help institutions to support them with their specific
    > guidance.
    >
    > The DMP outline and checklist from Bielefeld University is an
    > interesting example (as is the data publication repository), and we
    > should also link to it from the DCC website.
    great.
    The tool itself (build as a module for Drupal6) is dedicated for
    researchers at our university but it is possible to request a demo
    access via
    ***@***.***-bielefeld.de
    Dear Angus, All,
    my reply inline:
    On 20.05.2014 13:39, sangusa wrote:
    >
    > I believe the DCC tools were mentioned when the IG met in Dublin so I
    > guess may already be in the summary. If not it would be good to mention
    > if it's not too late the DCC tool DMPonline
    > (http://dmponline.dcc.ac.uk/) and Checklist
    > (http://www.dcc.ac.uk/resources/data-management-plans/checklist). Of
    > course both are designed to help researchers to create data management
    > plans, and also to help institutions to support them with their specific
    > guidance.
    >
    > The DMP outline and checklist from Bielefeld University is an
    > interesting example (as is the data publication repository), and we
    > should also link to it from the DCC website.
    great.
    The tool itself (build as a module for Drupal6) is dedicated for
    researchers at our university but it is possible to request a demo
    access via
    ***@***.***-bielefeld.de
    >
    > Jochen is your university's service planning to link DMPs to datasets in
    > your repository?
    This is an interesting but also delicate question.
    Of course it makes a lot of sense to make links between datasets and the
    associated DMP plus the project.
    But is a DMP in general a public or internal document?
    Maybe the status is internal during the lifetime of a project where the
    data gets created and the DMP might be dynamically adapted according to
    the work progress, but public at the time when datasets are published?
    Best,
    Jochen

submit a comment