Data Citation - Participate in shaping BibLaTeX

01 Apr 2021
Groups audience: 

Dear WG Data Citation members,

I work at a repository for the long-term preservation of digital research data and we are currently in the process of preparing citation information for each data collection and also for the individual parts of the collection.

We decided to use BibLaTeX because it already provides a type 'dataset' that is already also supported by some citation styles (e.g. APA6th). BibLaTeX is actively developed, which allows for the shaping of the dataset-type. This is needed because currently there is no 'good' way to e.g. collect reference information for parts of a dataset or provide fingerprints, unique queries, etc.

I've started a discussion on GitHub: https://github.com/plk/biblatex/issues/1103

Any feedback (here or maybe directly on GitHub) would be highly appreciated!

Best from Vienna
Martina Trognitz

  • Hugh Paterson's picture

    Author: Hugh Paterson

    Date: 01 Apr, 2021

    Greetings,
    There is an overarching problem in dataset citation, and you point out half
    of it. Datasets are frequently aggregate works (collections) and there is
    no easy way to reference the components of the aggregate unit. While I
    advocate making bibliographic metadata available, using biblatex is
    marginally better than bibtex, neither was designed to be the source
    authority format for archival records.
    The second and more frequently ignored problem in dataset referencing is
    that datasets as archival objects are often miscategorized.
    Some take the position that all objects in digital form are data… but is
    software data? Ans this leads to an important philosophical question about
    the role of institutional repositories, should they persist data, or should
    they persist the evidenciary record? That is, is the term data at all even
    useful?
    If I have an aggregate work of audio materials that may be be
    cited/referenced as an album and each sub-unit as a track. There is no
    reason to categorize this as a “dataset”. The same is true with a set of
    ethnographic interviews which are just dumped into a repository. They are
    interviews not just recordings or “dataset”. So depending on the media
    type some things should not be datasets. I find the dcmitype vocabulary
    very helpful in this regard. Dublin core says that every artifact should
    have a one to one record in the catalogue. So each audio recording should
    get its own record an a relationship to the record for the aggregate work
    which would be the album.
    Datasets are a legitimate item type, but as the dcmitype identifies them
    they are tabular data, ready for ingest into a computer application. In
    this manner they are distinct from the dcmitype for text in that they are
    not designed for human literary consumption.
    The need to accurately identify item types comes back to repositories and
    how they identify content, and make those identical ions available via
    pre-formatted bibliographic records. If the repository says that everything
    is “dataset” then the use of biblatex @dataset versus bibtex @misc is a
    mute point because both are equally unhelpful and ambiguous to the end-user
    who might look to reuse the bibliographic metadata.
    Also note that apa7th is out. I don’t like it but it is out.
    Some food for thought,
    All the best,
    - Hugh
    On Thu, Apr 1, 2021 at 10:26 AM mtrognitz via Data Citation WG <
    ***@***.***-groups.org> wrote:

  • Roberto Di Cosmo's picture

    Author: Roberto Di Cosmo

    Date: 01 Apr, 2021

    Dear all,
    similar discussions have been taking place about software (that is not just a
    special case of data), and the following short summary may be of interest for
    your work in this area.
    Of course, Bib(La)TeX entries are not meant to contain all relevant metadata
    about software, there are other standards for that (like CodeMeta): for example,
    we do not expect to find individual author roles or affiliations recorded there.
    Nonetheless they are extremely useful when it comes to citing software in
    publications, creating bibliographic reference lists, producing activity
    reports, etc.
    Hence the importance to have proper support for software entries in BibLaTeX,
    but, like for @dataset, the @software entry in the stock BibLaTeX is just
    another name for @misc, and does not fit the bill at all.
    A significant amount of work has been done to determine:
    - the fields needed to describe software in a bibliography
    - the kind of entries needed for capturing the software facets
    (@software alone is not enough)
    - the best way to make these new fields and entries supported in existing stlyes
    This has led to the development of the biblatex-software package, included in
    all recent TeXLive distributions, and also separately available from CTAN at
    https://ctan.org/pkg/biblatex-software
    This package is a /style extension/, that can be used to add to any existing
    BibLaTeX bibliographic style support for the folloing four entries:
    @software
    @softwareversion
    @softwaremodule
    @codefragment
    Biblatex-software supports inheritance between these entries, and provides a
    broad set of parameters that allow to tweak the rendering of bibliographies as
    desired, see the extensive documentation at [1] for more information.
    It would be great to see a similar work done for @dataset, and I hope this
    information about what we did for software (not only the technical
    implementation, but also the process that led to it) may be of help
    All the best
    --
    Roberto Di Cosmo
    [1] https://ctan.gutenberg.eu.org/macros/latex/contrib/biblatex-contrib/bibl...
    ------------------------------------------------------------------
    Computer Science Professor
    (on leave at INRIA from IRIF/Université de Paris)
    Director
    Software Heritage https://www.softwareheritage.org
    INRIA
    Bureau C328 E-mail : ***@***.***
    2, Rue Simone Iff Web page : http://www.dicosmo.org
    CS 42112 Twitter : http://twitter.com/rdicosmo
    75589 Paris Cedex 12 Tel : +33 1 80 49 44 42
    ------------------------------------------------------------------
    GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 02 Apr, 2021

    Dear all,
    This is absolutely correct! In a nutshell, we need to differentiate
    between two aspects, namely (1) **identifying** the precise
    subset/colection of a - potentially changing - dataset, and 2) the
    actual information making up a citation to that dataset.
    for 1) this WG has come up with an answer, a single principle, that so
    far seems to work across all types of data and solutions for
    implementing a data repository, whereas for 2) some recommendations can
    be made, but it will ultimately depend strongly on the domain, the type
    of data and its use.
    1) For the identification we again have two sub-challenges, namely a)
    the evolution of data (new data being added, errors corrected, ...; and
    b) the identification of arbitrary subsets. 1a) is solved by versioning
    all changes to data, whereas 1b) is addressed by resolving any subsets
    dynamically via an operation that is reproducible (referred to, in the
    guidelines, as a query) that was executed at a certain timestamp, that
    needs to be stored and that is associted with a persistent identifier.
    This could be the check-out from a Git repository at a certain point in
    time, it could be a "list-directory" command against a versioned file
    system, it can be an SQL query against a temporal table, we have seen
    solutions using slice/dice operators against a NetCDF file, but it could
    also - to use the audio-example referred to before - even a pointer to a
    specific off-set in an audio file (or, e.g. a 30sec segment starting at
    minute 1 for all audio files with a 44kHz sampling rate in the
    collection, as sometimes used to be done for music retrieval
    experiments). This also works across distributed repositories, as each
    only needs to keep the queries it processed as well as the timestamps
    locally, without any need to synchronize clocks. An aggregator would
    then simply store the individual PIDs of the responses from a federated
    system.
    2) Concerning the actual citation text, we may want to re-visit that
    topic to see how much more specific we can get while still making sure
    that any recommendation works across possibly all types of data and
    domains. Currently, we stayed at a very limited set of metadata,
    borrowing from the analogy of citations to literature, recommending the
    use of two identifiers: one for the (continuously evolving) data source,
    and one for the specific subset extracted from it at a given point in
    time (analogy: a specific (static) paper identified e.g. via a DOI in an
    (evolving, i.e. growing, with new editions being added) journal
    (identified via e.g. an ISSN). The creator of the subset may be compared
    to the author of a paper, whereas the owner/operator of the data source
    may be likened to the editor of proceedings - but any such mapping will
    already differ quite a lot across repositories and types of data, so it
    is not part of the general recommendations - beyond the statement that
    each data center should provide a recommended way of phrasing/expressing
    citation. this may well be worth picking up if we have a feeling that
    this core can be extended. now that we have a better understadning of
    the identification and resolving process.
    The pre-print version of the paper we've prepared on the recommendations
    as well as reference implementations and deployed adoptions, could be
    useful to review these principles in different settings:
    http://doi.org/10.5281/zenodo.4571616
    best regards,
    Andi

  • Martina Trognitz's picture

    Author: Martina Trognitz

    Date: 14 Apr, 2021

    Dear Hugh, Roberto, Andi and Mark,

    thank you very much for your thoughts, comments and pointers to other recommendations. They do provide some useful information to consider for shaping BibLaTeX's @dataset type.

    I feel that I should elaborate a bit on my use case to clarify what I am trying to achieve. The repository hosts data from the (digital) humanities, with collections from disciplines like oriental studies, archaeology or history. Sizes of the collections vary significantly (1GB to 10TB; or a few up to 100 000 (and more) resources) and a collection can contain multiple resource types (we use a vocabulary based on DCMI type). We developed a dedicated metadata schema to describe the objects both on collection as well as on resource level and as long as the resources are publicly accessible they also get a PID (Handle). The PID points to the respective object's landing page with all its metadata and machine-readable endpoints are also available.

    To aid in proper attribution when a collection (or a subcollection or a single resource) is re-used the repository provides a citation suggestion. This is comparable to those you can e.g. find on Zenodo or on Dataverse instances. The suggested citation is automatically computed from the metadata and we decided to provide it in BibLaTeX format because most reference management software supports this and also many citation styles already exist.

    During the mapping process of our metadata to BibLaTeX, I found that most of the principles of this WG's recommendations can be met, but not all. As BibLaTeX is still actively developed I saw the chance to shape the @dataset type into something that could then help citation style developers to provide sound and useful citations of 'datasets'. One of the developers pointed out: "I realise that with some things we might have a bit of a chicken-or-egg problem: Certain things might not be popular yet, because they are not properly supported by the software yet." -- By working on enhancing the data model for the @dataset type and possibly even introduce a new type like @datasubset we could pave the way for better citation styles for data. I myself e.g. was thinking of promoting this in the German-speaking Archaeology community, but this only makes sense if the technological basis is there.

    Best
    Martina

     

  • Hugh Paterson's picture

    Author: Hugh Paterson

    Date: 14 Apr, 2021

    Greetings Martina,
    given your further description, my impression is that your DCMIType should
    be "collection" not "dataset", given that DCMIType suggests that a dataset
    can not be further broken down and described—this is inferred because
    "collection" is the only DCMIType class which can be further broken down
    and "contain"/"hasPart" individually describable items. In that case the
    most appropriate type biblatex type would be @collection, and each part
    could be either @incollection or another more appropriate Biblatex database
    entry type for when the item is referenced as a single entity. Looking at
    the linked reference [1] I notice that there is no default @audio
    or @recording which would be appropriate for audio type artifacts in
    collections—an album is a type of audio collection. However even though
    these are non-default, they are used in some style sheets (see link [2]).
    This leads me to ask, if there you want to stay with the default settings
    in Biblatex or if you are willing to provide data in formats used within
    "standard" secondary communities of the biblatex community. That is, if
    there is a biblatex style that is common with the major audience of the
    content within your archive, then maybe venturing into providing biblatex
    within that dialect would be acceptable. Another thing to note, if you are
    serving archaeology data, is that you might have DCMIType
    InteractiveResource material — assuming that some of the larger artifacts
    are visualizations from lidar or other 3D imaging tools used in modern
    archaeology.
    You mention the collection size varying significantly and you list (1GB to
    10GB) but, I suggest that your extent on a collection is not the number of
    bytes that it contains but rather the number of objects which are uniquely
    described within next lower level of the collection (collections in Dublin
    Core can be recursive). Unfortunately, the number of items in a collection
    does not fit into the allowable options of the extent field within DCTerms
    (see [4]). One must use the property tableOfContents [5]. Different
    referencing styles handle this sort of information in different ways.
    APA6th edition [6] provides the following template on page 212, which I
    have emulated for how I would apply it for a collection of field recordings
    in linguistics. My application shows how I would provide the collection
    summary statement to include the tableOfContents/extent information.
    Author, A. A. (Year, Month Day). Title of material. [Description of
    material]. Name of collection (Call number, Box number, File name or
    number, etc.). Name and location of repository.
    Paterson III, H. J. (2018-2019). Western Kainji Oral Stories. [435 audio
    and video recordings, 5 hours, 8 languages]. African Voices (ark:12025,
    DOI: 10.1234/780912 ) Pangloss, Paris France.
    With an audio or video artifact (or set of artifacts) would not be
    helpfully described as 1GB to 10GB, but would be more helpfully described
    with a time based extent, e.g., 1h3m35s. Note that Zenodo does not
    currently allow a depositor to distinguish between audio and video
    materials—its painful.
    all the best,
    - Hugh Paterson III
    [1]: reference:
    http://tug.ctan.org/info/biblatex-cheatsheet/biblatex-cheatsheet.pdf
    [2]:
    https://tex.stackexchange.com/questions/74766/how-define-biblatex-entry-...
    [3]: http://purl.org/dc/dcmitype/InteractiveResource
    [4]:
    https://www.dublincore.org/specifications/dublin-core/dcmi-terms/terms/e...
    [5]:
    https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http:/...
    [6]: VandenBos, Gary R, ed. 2010. *Publication Manual of the American
    Psychological Association*. 6th edn. Washington, DC: American Psychological
    Association.
    On Wed, Apr 14, 2021 at 4:42 PM mtrognitz via Data Citation WG <
    ***@***.***-groups.org> wrote:

  • Martina Trognitz's picture

    Author: Martina Trognitz

    Date: 16 Apr, 2021

    Good morning Hugh,

    thank you again!

    One thing I would like to stress in this discussion is the purpose of references, which should identify and point to some source or resource. IMHO references are not intended to fully and thoroughly describe the object it is identifying, as this is either done with an imprint (or other means) in a printed resource or a landing page of a digital resource.

    The Dublin Core's DCMI Type and BibLaTeX's Entry Types are two different pair of shoes with very different backgrounds and terminology. While the DCMI Type vocabulary was already developed with digital data collections and different resource types in mind, the Entry Types of BibLaTeX originate in bibtex which itself was first released in 1985. bibtex and biblatex were developed having printed contributions in mind and therefore the Entry Types concentrate on such, e.g. @collection is defined as:

    A single-volume collection with multiple, self-contained contributions by distinctauthors which have their own title. The work as a whole has no overall author but itwill usually have an editor.

    From this definition, it becomes clear that it is not possible to just use that type for electronic data collections, especially if having in mind that a user might include various different kinds of references in a bibliography. The best fit is @dataset, which was introduced as a fully supported Entry Type in BibLaTeX in 2019. It is quite vaguely defined as

    A data set or a similar collection of (mostly) raw data.

    For the purposes of including a reference to a data collection, I think this vague definition is fine. I think introducing corresponding biblatex Entry Types for each of the DCMI Types should be avoided to (1) prevent the list of Entry Types from exploding, (2) avoid dealing with disambiguation in defining Entry Types, (3) keep the adoption barrier low. Adoption is key: it does not suffice to have a proper Entry Type, but also, after properly defining the Entry Type, respective citation styles (e.g. with the Citation Style Language (CSL)) should be available for convenient use.

    What I am trying to achieve is (a) provide a convenient way for user's of our repository to save a reference to a data collection to their bibliography, and (b) do this in a way that it can be widely used without much tinkering like installing proper packages etc. This is why I thought that shaping and expanding the @dataset Entry Type (and the necessary and optional Entry Fields) directly for one of the next releases of BibLaTeX (see issue on GitHub) might be the best way. (Another task would be to advocate for proper data referencing in the respective communities and try to influence citation guidelines.)

    To wrap up a bit here are the key points that came up in this discussion, which I will suggest over there:
    * way to reference the components of the aggregate unit (i.e. part of a data set)
    * get an Entry Field to allow to indicate the type of @dataset (e.g. with terms from DCMI Type)
    * have an Entry Field to allow storing a "query"

    By the way: BibLaTeX has an Entry Type @software, but it is treated as an alias of @misc. and it could also be worked and expanded on, if wanted.

    Best
    Martina

submit a comment