Use cases and identifier schemes for persistent software source code identification (V1.1)

RDA/FORCE11 Software Source Code Identification WG

As this WG has now completed, please direct any questions or comments about this Output to the Software Source Code Interest Group, which is ongoing.

Group co-chairs: Roberto Di Cosmo, Martin Fenner, Daniel S. Katz

Supporting Output Title: Use cases and identifier schemes for persistent software source code identification

Authors: Research Data Alliance/FORCE11 Software Source Code Identification WG, Alice Allen, Anita Bandrowski, Peter Chan, Roberto Di Cosmo, Martin Fenner, Leyla Garcia, Morane Gruenpeter, Catherine M Jones, Daniel S. Katz, John Kunze, Moritz Schubotz, Ilian T. Todorov

Impact: Provides an overview of the current state-of-the-art of the practice of software identification, including use cases and identifier schemes from different academic domains and in industry.

DOI: 10.15497/RDA00053

Citation and download: Research Data Alliance/FORCE11 Software Source Code Identification WG, Allen, A., Bandrowski, A., Chan, P., Di Cosmo, R., Fenner, M., Garcia, L., Gruenpeter, M., Jones, C. M., Katz, D. S., Kunze, J., Schubotz, M. & Todorov, I. T. (2020). Use cases and identifier schemes for persistent software source code identification (V1.1). Research Data Alliance. https://doi.org/10.15497/RDA00053

Summary

Software, and in particular source code, plays an important role in science: it is used in all research fields to produce, transform and analyse research data, and is sometimes itself an object of research and/or an output of research.

This output, with inputs from a broad panel of stakeholders, provides an overview of the current state-of-the-art practice in software identification, including use cases and identifier schemes from different academic domains and in industry.

Context:

The SCID WG was spawned from discussions both on the RDA’s Software Source Code IG and FORCE11’s Software Citation Implementation WG, recognizing that software is a special kind of object, and that its identification needs to be specifically addressed taking into account the various existing identifier schemes for software.

Objectives:

The goal of this output of the working group is to survey different systems of identifiers for software, and their usage in different use cases, in an harmonized way. We hope that this will provide solid ground on which to build recommendations for the academic community, and help academic and industrial stakeholders to adopt solutions compatible with each other and especially with the software development practice of tens of millions of developers worldwide.

Request for comments:

We invite the RDA & FORCE11 community to review and comment on the SCID WG output as part of the open process for endorsement and recognition by RDA and FORCE11.

Comments are welcome and should be made no later than September 4th 2020. If you are an RDA member, we would appreciate to have your review in the comments below for the record, you can also add direct comments on the Google document that contains the first stable version. All comments in the document will be transcribed in this post during or after the community review:

https://docs.google.com/document/d/1MpWGgxet1A0qFhPFJoIs0363wXOUKgzwQIinKc8QqWI/edit?usp=sharing

Notice that this is a different document from the one used by the working group internally: the old document will be accessible for the record, but new comments and edits have been disabled.

Please note that Version 1.0 of the Output underwent community review, and that Version 1.1 is the final version of the Supporting Output based on these comments.

Output Status:

RDA Supporting Outputs

Review period start:

Friday, 17 July, 2020 to Friday, 4 September, 2020

Group content visibility:

Public - accessible to all site users

Primary Domain/Field of Expertise:

Engineering and Technology

Primary WG Focus / Output focus:

Identity, Store, and Preserve

Domain Agnostic:

File:

Attachment	Size
SCID WG output for community review (v1.0).pdf	2.3 MB
Casos de uso y esquemas de identificación del código fuente del software persistente.pdf	130.19 KB

Attachment	Size
Card RDA 20200910.pdf	1.02 MB

Log in to post comments
7712 reads

Author: Francoise Genova

Date: 17 Aug, 2020

Dear colleagues,

Thank you for putting all the information together. The last sentence of the document is " The next step would be to produce a set of recommendations based on these findings. ". Are there plans to continue the work in that direction?

Best

Francoise

Author: Morane Gruenpeter

Date: 27 Aug, 2020

Dear Francoise,

Thank you for your kind comment.

The SCID WG has ended its lifespan and will not continue at the moment with other outputs. There are many initiatives with the aim of producing recommendations, including the FAIR4RS WG, that can use this output as a basis for software identification.

Best,

Morane

Author: Morane Gruenpeter

Date: 02 Oct, 2020
Edwin Henneken: An additional argument is that a significant part of research knowledge is encoded in software

Edwin Henneken: I think Asclepias belongs in this list (e.g. https://doi.org/10.5281/zenodo.1011088)

Edwin Henneken: The Astrophysics Data System (ADS) should also be in this list

Anonymous: Agreed. It makes sense to have at least one thematic service cited, since they provide high added-value thanks to their deep knowledge of their community needs and assets. NASA ADS is an excellent example which includes links to data and software."

Manodeep Sinha: Typo: Should be PyPI

Lorraine Hwang: and maintenance

Neil Chue Hong: Is this the level at which computational/mathematical libraries would sit?

Morane Gruenpeter : A mathematical Library (like SagaMath or matplotlib) is a project and in that project you can find modules. The modules can be found as separate repositories (as we do with Software Heritage) or as designated directories or even as single files (as you can find on matplotlib). Here the distinction is for a module that focuses on one part of the software functionality and you might want to identify this part without identifying the complete software. The modules architecture is quite common in software engineering. If you want to reference a mathematical library, you should ask yourself, if you want to identify the complete library or a specific module (which might have more specific authors)."

Manodeep Sinha: Unsure what this means. Is it that the ""executable"" is accessible through a download link, or that a ""download link"" is an example of an executable?"

Morane Gruenpeter: The download link is an example to locate and access the executable

Jose Benito Gonzalez Lopez: Should 'tag' be added to this list as a separated item as well?

Edwin Henneken: Also, within the context of Github (and maybe other repos), releases come with ""assets"", which can be downloaded separately.

Manodeep Sinha:Would test datasets be under this category? Particularly, datasets that might be externally hosted and not within version control"

Morane Gruenpeter,No this is only about identifying the code or the part of the software that lives within the software folder (which can be test data).

Tom Honeyman:This only occurs as an acronym in this document.

Lorraine Hwang:Is this parenthetical or distinctly different?

Lorraine Hwang,":""referenced"" could be significantly different than reuse. I publication maybe referenced in the context of review of prior methods for example.

Morane Gruenpeter:I agree and will separate to 2 use cases for the next document version.

Lorraine Hwang:Can you differentiate how this would be different from an RSAs needs?

Morane Gruenpeter:I think we should change the actor to RSA

Lorraine Hwang: For maintenance, it would also be important in tracking and implementing changes of dependencies.

Morane Gruenpeter:I agree and will add use case

Kat Thornton:I would like to supply a list of all Wikidata properties related to software.

Jose Benito Gonzalez Lopez: Maybe I am not reading this table properly, but just in case: a user can upload (manually) to Zenodo software with almost any GL. The user is free to upload a full project, a file, a module, or even a script. Choosing only ""Release"" GL corresponds to the GitHub-Zenodo integration, but as said, let's not forget that users can upload manually whatever they wish.

Jose Benito Gonzalez Lopez: Even more, taking only the GitHub-Zenodo integration, there is no representation here for the Concept DOIs that are generated per software. For a giving software there is always 1 DOI that represents the software (project) and multiple DOIs that represent the versions (in case of GitHub those are releases, in case of manual upload they could be anything).

Manodeep Sinha:Agree with what Jose said about the Zenodo DOI + GitHub covering both GL1 and GL2 automatically, and potentially any level via manual uploads

Morane Gruenpeter:This table was completed with the Zenodo-GitHub integration where only releases are uploaded in Zenodo. Seems like a ""typo"" not having the GL1 noted for the concept DOI (I'll check the draft).

Morane Gruenpeter: I can add directory and file to the possible GL. Concerning code fragment, technically it will be a file if uploaded into Zenodo. A revision is a point in time with access to the full past development history which is not supported in Zenodo. The same for a snapshot, you can't capture different branches in Zenodo (technically).

submit a comment

RDA/FORCE11 Software Source Code Identification WG

Status: Completed

Chair(s): Roberto Di Cosmo, Martin Fenner, Daniel S. Katz

O&A Members

MEMBERSHIP

RDA Groups

The Research Data Alliance

Membership

RDA Working and Interest Groups

RDA Solutions

RDA domain research

Use cases and identifier schemes for persistent software source code identification (V1.1)

You are here