Data Versioning Interest Group Charter
The Data Versioning WG has decided to transition to an Interest Group. The proposed Charter, which underwent community review, can be found here. The final version of the Charter, which was updated after the community review, can be found here.
For more background information on the Data Versioning WG, please refer to the comment by Jens Klump on this page.
- Log in to post comments
- 5475 reads
Author: Jens Klump
Date: 07 May, 2021
The RDA Data Versioning WG has come to a close. The outcomes of the Working Group were published as RDA Supporting Outputs.
To continue working on questions relating to data versioning, and for the development of actionable recommendations, we propose to close the Working Group and transform it into an Interest Group and publish a new Interest Group Charter for community review.
Author: Francoise Genova
Date: 07 May, 2021
I fully support the creation of an RDA Data Versioning IG, to provide a venue to discuss the topic following the successful work of the WG.
A few comments on the proposed Charter:
- Introduction:
. The charter indicates that an RDA IG on Data Versioning was formed in 2017. I cannot find it on the RDA site, neither in the IG list nor in the 'historical group' list. What was its fate at that time?
. A link to the WG white paper and other outputs could be provided as a footnote to the Introduction.
- Participation: in the point about provenance, the P17 BoF was held
- Timeline: I don't understand what 'task force align deliverables' are. Tasks forces aligned with deliverables?
Thanks for moving this important topic on in the RDA
Francoise Genova
Author: Mingfang Wu
Date: 07 Jun, 2021
Hi Francoise,
Thank you very much for reviewing the Charter.
For your first comment, I think this is what happed: the data versioning IG, which was formed in 2017, was transited to the WG in 2018, and the WG kept the URL and members of the IG; then the IG wasn't classified as a "historical group". This may explain why you didn't find the IG (2017) anywhere.
We have added a section for the WG outputs, and citations to the use cases and the white paper in the introduction. We also corrected a few typos as you spotted on.
Author: Sarah Davidson
Date: 14 May, 2021
I am glad to see this effort to start a data versioning IG! And particularly getting at the challenge of republication/mirroring, which is something I'd like to learn from the RDA community about.
My main comment on the draft would be to say "reproducibility and re-use" wherever you say "reproducibility". From my perspective, the focus on reproducibility is overstated. It is certainly important to highlight because of the importance to the concept of scientific results being reproducible, and because it is a focus of publisher data accessibility policies. However in a practical sense, I think it is important to also state "re-use" specifically because
(1) It is one of the core FAIR data principles.
(2) In my experience, end users are far more interested in re-using and aggregating data for other purposes, using the largest and most current versions of relevant datasets, regardless of which of those data records were used in particular existing published analyses. (In 10 years of curating research data, I have never been approached by someone wanting to reproduce a published result.)
(3) I also think that for reproducibility, it is often methods, and not neccessarily the exact set of source data, that are often critical to preserve. In many cases, robust findings should be reproducible using the same methods across a variety of relevant data sources. And this I think falls outside the scope of the proposed IG, so a general focus data re-use could also help to maintain the intended scope of effort.
(4) By including other types of re-use, we can better explain the full extent of challenges posed when overlapping, duplicated, or reprocessed data are published and copied within and across platforms. A basic example of the problem is this: A researcher publishes several overlapping subsets of the same research dataset, which underlie different published analyses. One or more of those subsets might also involve a different reprocessing of the data, use of different identifiers for study subjects/samples, etc. Some of the published data might also be harvested by other data platforms. This full research dataset is ongoing across multiple grant cycles and is not published or publicly available, because journals/funders don't/can't require this. Another researcher wants to run an aggregate analysis across published data within their domain of interest. But they cannot assume uniqueness of data records across published datasets, and there may be no feasible or documented way to identify unique records. They are thus either limited in their metaanalysis options, or need to skip the published data and request access to the unpublished datasets, and there is a risk that incorrect assumptions are made and lead to flawed research conclusions. If the IG can address these kinds of situations, this makes a strong case for the importance of the IG and how its outputs could scale the growth and quality of future research opportunities and new knowledge.
Author: Mingfang Wu
Date: 07 Jun, 2021
Hi Sarah,
Thank you for reviewing the Charter and your thoughtful comments.
We totally agree with you that the proposed work will impact not only reproducibility, but also data re-use. For scientific reproducibility yes we agree that the methodology is important, but that is clearly outside of our scope. We feel it is essential that a user be able to identify the exact version of the data that was used in a scientific investigation or to decide which version should be re-used, which can only be done through more precise definition of 1) the exact version of the data used and data processing provenance, 2) the data format and 3) the actual data repository/data service that the dataset was accessed from.
Your data re-use use cases provide very good examples why we propose this interest group. We proposed in the Charter to collect use cases related to citation, identification, authority and ethics when versioning, overlapping, duplicating, or reprocessing datasets; this would be aligned up with the use cases you discussed.
We have added the "re-use" wherever appropriate in the proposed charter.
Author: Coralie VINCENT
Date: 18 May, 2021
Thank you very much for the proposed Charter.
Just to be sure: are the dates in the table p. 4 chronological? If so, shouldn't it be "Jan-June 2022" instead of "Jan-June 2021"? (just trying to understand the timeline)
Author: Mingfang Wu
Date: 06 Jun, 2021
Thank you Coralie for spotting the typo, we will correct it in the final version.