community collaboration on licensing for data reuse

24 Sep 2017

Dear All,
Some of you were involved in developing a community letter to NIH regarding data licensing impediments to data reuse. Here, I am including additional folks who I think might like to join this conversation. Please see the letter to NIH here and sign the letter here if you are supportive (we will leave the signatories open). Thanks to all who helped get this letter written and sent in the first place. Since then, we’ve been working to better understand the licensing barriers to data reuse and redistribution, and have also had some meetings with NIH and the Technology Transfer community.
We have been evaluating data sources (specifically relevant to the Monarch Initiative and the NCATS Data Translator, but this is just a starting place) against a new evaluation rubric. You can see the results of our efforts at http://reusabledata.org/. We debuted this effort at the recent Research Data Alliance Legal-Interoperability group meeting in Montreal; there are many other non-biomedical communities that have the same issues and I was very grateful to find them and have their assistance. The slides from RDA are posted here.
We would really appreciate the review of the rubric and the addition of new data sources - the rubric is meant to evolve. You can make tickets or pull requests for your own sources of interest in the github repo. The goal is to help other downstream data re-users understand what barriers they might expect in redistributing data from these sources. Note that very few sources actually fair well according to the rubric - especially those of us with large, integrated data sets where we don’t have the legal authority to redistribute. This is a community level problem and does not intend to call out any individual source out as being good or bad.
While there have been various efforts to understand how to evaluate data sources for “FAIRness,” and an NIH RFI on data repository evaluation (responses here), we believe that data integration, reuse, and redistribution require a deep level of understanding of not only content, interoperability, and access issues, but especially the licensing ones. Our response to the RFI is at: Metrics to Assess Value of Biomedical Digital Repositories, our evaluation of the open science prizes, and some related slides presented to the Biocuration Society are also available.
The 2018 Annual Association of University Technology Managers meeting is held between Feb. 18-21 in Phoenix, Arizona. https://www.autm.net/events-courses/annual-meeting/2018-annual-meeting/ Here, we hope to hold a round table and would welcome participants and attendees, especially those with legal, business, and technology transfer expertise. Please let us know if you are interested.
Special thanks to Seth Carbon, Julie McMurry, Robin Champieux, Letisha Wyatt and Lilly Winfree for the hard work necessary to launch reusabledata.org.
Please post to the brand new list serve or the github if you have questions, concerns, ideas, fears, etc. We will be writing a manuscript on this soon and would very much like your input and feedback. We need all of your help to make our publicly funded data resources fundamentally more reusable!
Very best,
Melissa
Melissa Haendel, PhD
Associate Professor
Library & Dept. of Medical Informatics and Clinical Epidemiology
***@***.***
503-407-5970
www.monarchinitiative.org
Appointments: Shanez De Silva
***@***.***

  • Melissa Haendel's picture

    Author: Melissa Haendel

    Date: 25 Sep, 2017

    Thanks Justin,
    On Sep 25, 2017, at 9:08 AM, Justin B Starren <***@***.***> wrote:
    Melissa, et al.,
    A review of the letter conveys a data consumer perspective. Having lived on both the consumer and producer sides, I can see arguments on both sides.
    The reusable data project is for both users and providers. We need to be educating each other as to the barriers; most providers I work with don’t realize that their licensing terms actually require legal intervention and this was not really their intent. Data integrators are often also not aware that their redistribution might not be legal or allowed.
    If the only way I get tenure is based on writing papers about the data I have published, I am strongly disincentived against sharing anything until after I get enough papers accepted to ensure my future promotion and tenure.
    Agreed. We need to change this: one CV at a time, one hire at a time, and one T&P committee at a time. I generally don’t hire people, for example, that don’t exhibit signs of collaboration, sharing, and open science. It has to become part of our value and evaluation system. You at Northwestern have been leaders in this area, with the T&P guidelines being enhanced to include data sharing and collaboration as key criteria for T&P. You may want to discuss this with Kristi Holmes at Northwestern, she’s been a leader in this area and is cc’d.
    It is worth remembering that the Large Hadron Collider, which is often presented as an example of data sharing is actually a large, and very closed, consortium. It was only after the consortium felt it had wrung every possible publication from the data was it released to the unrestricted public.
    Rather than focused on prescriptive solutions, a focus on rewards is likely to be more productive in the long run.
    I agree completely that we need to focus on sustainability and incentives. This is part of where we are heading with this project, but to get there we first, as a community, must understand the nature of the licensing problem. We hope that this will help assist us in conversations about how to get the data out faster, better, in a sustainable and quality fashion.
    Scientists are smart. We are very good a complying with the letter but not the spirit of rules, if we view those rules as harmful to our careers. Simply asking federally funded academic institutions to report how many scientists received tenure or promotion primarily for data sharing would help move the needle.
    Agreed.
    Finally, your 5 star rating requires unfettered use.
    It is actually aiming to understand what requires legal intervention for redistribution. There are many other aspects of data sharing that are relevant, see the FAIR-TLC rubric for example, but many others have published on this as well (there is a reading list on the site to start).
    That implies no need to cite the data producer.
    This implication is definitely not intended and goes against not only the FAIR-TLC rubric described below but also all of our and other's work on data citation and attribution for such things. See related articles:
    Achieving human and machine accessibility of cited data in scholarly publications
    How to cite individual resources such as a dataset, a blog related to identifiers: Bad Identifiers are the Potholes of the Information Superhighway: Take-Home Lessons for Researchers and related work on identifiers, which includes attribution and provenance: Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data
    Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products
    Publishing: Credit where credit is due
    Outputs of the NISO Alternative Assessment Metrics Project
    This is equivalent to eliminating the requirement to cite quotes or figures when reused in publication.
    I don’t think that is your intent.
    Definitely not. Remember, the goal of the reusable data project was focused solely on the licensing issues; if you see issues there that are confusing towards these ends they are not intended and we would very much appreciate you making a ticket or PR to help make things more clear. Please also join the list serve as I would like to move this conversation to a public place :-).
    Perhaps, there was a separate discussion of data citation, but having that expectation unambiguously stated in the rubric would help.
    There have been very many discussions about this, in fact :-). This is why I included the FAIR-TLC content, was to provide this type of context. The above citations can help get you started on the various community attribution/citation conversations, I’m happy to send additional citations and project information related to these efforts.
    Best,
    Melissa
    -justin
    Justin Starren
    Northwestern University
    From: <***@***.***-groups.org> on behalf of Melissa Haendel <***@***.***>
    Date: Sunday, September 24, 2017 at 10:18 AM
    To: Benedict Paten <***@***.***>, David Haussler <***@***.***>, Mark Diekhans <***@***.***>, "Hunter, Larry" <***@***.***>, John Wilbanks <***@***.***>, "***@***.***" <***@***.***>, "***@***.***" <***@***.***>, "Srinivasan, Subhashini" <***@***.***>, "Sean D. Mooney" <***@***.***>, "Sinha, Saurabh" <***@***.***>, "Jongeneel, C Victor" <***@***.***>, "***@***.***" <***@***.***>, David Ellison <***@***.***>, Bill Hersh <***@***.***>, Shannon McWeeney <***@***.***>, Adrienne Zell <***@***.***>, Robin Champieux <***@***.***>, Nicole Weiskopf <***@***.***>, Justin Guinney <***@***.***>, Lara Mangravite <***@***.***>, Andrew Su <***@***.***>, Eric Topol <***@***.***>, Chunlei Wu <***@***.***>, Adam Wilcox <***@***.***>, Peter Robinson <***@***.***>, Christopher Chute <***@***.***>, Janet Palmer <***@***.***>, Kristi Holmes <***@***.***>, Philip Payne
    <***@***.***>, Robert Schuff <***@***.***>, David Dorr <***@***.***>, "Eichmann, David A" <***@***.***>, Patrick Barlow
    <***@***.***>, Benjamin Good <***@***.***>, Keith Alan Herzog <***@***.***>, Ali Torkamani <***@***.***>, Ted Laderas <***@***.***>, Michel Dumontier <***@***.***>, Reece Hart <***@***.***>, Reece Hart <***@***.***>, ISB Exec <***@***.***>, Mark Lawler <***@***.***>, Tudor Oprea <***@***.***>, Casey Greene <***@***.***>, Daniel Himmelstein <***@***.***>, "Katz, Daniel S" <***@***.***>, Daniel Mietchen <***@***.***>, "***@***.*** Committee" <***@***.***>, "Stephan Schürer, PhD" <***@***.***>, Danielle Robinson <***@***.***>, Sean McDonald <***@***.***>, Philip Bourne
    <***@***.***>, Keith Porcaro <***@***.***>, "Sofia, Heidi (NIH/NHGRI) [E]" <***@***.***>, "Dearry, Allen (NIH/NIEHS) [E]" <***@***.***>, "Sherri De Coronado," <***@***.***>, Warren Kibbe <***@***.***>, "Jessie Tenenbaum, Ph.D." <***@***.***>, "Lawler, Cindy (NIH/NIEHS) [E]" <***@***.***>, "***@***.***" <***@***.***>, "***@***.***" <***@***.***>, Arvin Paranjpe
    <***@***.***>, Fran Berman <***@***.***>, RDA/CODATA Legal Interoperability IG <***@***.***-groups.org>, Hassan Naqvi <***@***.***>, "Austin, Christopher (NIH/NCATS) [E]" <***@***.***>, "Kim, John (JP) (NIH/OD) [E]" <***@***.***>, John Kunze <***@***.***>, "Colvis, Christine (NIH/NCATS) [E]" <***@***.***>, "Southall, Noel (NIH/NCATS) [E]" <***@***.***>, "Gersing, Kenneth (NIH/NCATS) [E]" <***@***.***>, "Green, Eric (NIH/NHGRI) [E]" <***@***.***>, Paul Clemons
    <***@***.***>, Christian von Mering <***@***.***>, "Moore, Jason H." <***@***.***>, "Resnick, Adam C" <***@***.***>, "George A. Polisner" <***@***.***>, Andrew Hoppin <***@***.***>, Vivien Bonazzi <***@***.***>, Letisha Wyatt <***@***.***>, Avi Maayan <***@***.***>, "Greenberg,Jane" <***@***.***>, "***@***.***" <***@***.***>, "***@***.***" <***@***.***>, "***@***.***" <***@***.***>, "***@***.***" <***@***.***>, Justin B Starren <***@***.***>, Firas Wehbe <***@***.***>, Thongsy Singvongsa <***@***.***>, "***@***.***" <***@***.***>, "***@***.***" <***@***.***>
    Subject: [rda-legalinterop-ig] community collaboration on licensing for data reuse
    Dear All,
    Some of you were involved in developing a community letter to NIH regarding data licensing impediments to data reuse. Here, I am including additional folks who I think might like to join this conversation. Please see the letter to NIH here and sign the letter here if you are supportive (we will leave the signatories open). Thanks to all who helped get this letter written and sent in the first place. Since then, we’ve been working to better understand the licensing barriers to data reuse and redistribution, and have also had some meetings with NIH and the Technology Transfer community.
    We have been evaluating data sources (specifically relevant to the Monarch Initiative and the NCATS Data Translator, but this is just a starting place) against a new evaluation rubric. You can see the results of our efforts at http://reusabledata.org/. We debuted this effort at the recent Research Data Alliance Legal-Interoperability group meeting in Montreal; there are many other non-biomedical communities that have the same issues and I was very grateful to find them and have their assistance. The slides from RDA are posted here.
    We would really appreciate the review of the rubric and the addition of new data sources - the rubric is meant to evolve. You can make tickets or pull requests for your own sources of interest in the github repo. The goal is to help other downstream data re-users understand what barriers they might expect in redistributing data from these sources. Note that very few sources actually fair well according to the rubric - especially those of us with large, integrated data sets where we don’t have the legal authority to redistribute. This is a community level problem and does not intend to call out any individual source out as being good or bad.
    While there have been various efforts to understand how to evaluate data sources for “FAIRness,” and an NIH RFI on data repository evaluation (responses here), we believe that data integration, reuse, and redistribution require a deep level of understanding of not only content, interoperability, and access issues, but especially the licensing ones. Our response to the RFI is at: Metrics to Assess Value of Biomedical Digital Repositories, our evaluation of the open science prizes, and some related slides presented to the Biocuration Society are also available.
    The 2018 Annual Association of University Technology Managers meeting is held between Feb. 18-21 in Phoenix, Arizona. https://www.autm.net/events-courses/annual-meeting/2018-annual-meeting/ Here, we hope to hold a round table and would welcome participants and attendees, especially those with legal, business, and technology transfer expertise. Please let us know if you are interested.
    Special thanks to Seth Carbon, Julie McMurry, Robin Champieux, Letisha Wyatt and Lilly Winfree for the hard work necessary to launchreusabledata.org.
    Please post to the brand new list serve or the github if you have questions, concerns, ideas, fears, etc. We will be writing a manuscript on this soon and would very much like your input and feedback. We need all of your help to make our publicly funded data resources fundamentally more reusable!
    Very best,
    Melissa
    Melissa Haendel, PhD
    Associate Professor
    Library & Dept. of Medical Informatics and Clinical Epidemiology
    ***@***.***
    503-407-5970
    www.monarchinitiative.org
    Appointments: Shanez De Silva
    ***@***.***
    Melissa Haendel, PhD
    Associate Professor
    Library & Dept. of Medical Informatics and Clinical Epidemiology
    ***@***.***
    503-407-5970
    www.monarchinitiative.org
    Appointments: Shanez De Silva
    ***@***.***

  • Melissa Haendel's picture

    Author: Melissa Haendel

    Date: 25 Sep, 2017

    Adding a few more folks and therefore pasting the conversation below.
    Please join the mailing list :-)
    There is also the work that was done on the Biden Blue Ribbon panel on data sharing, we discussed licensing and attribution, as well as provenance quite a lot in that context. The report is here: https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiati...
    On Sep 25, 2017, at 11:11 AM, Hunter, Larry <***@***.***> wrote:
    On Sep 25, 2017, at 10:08 AM, Justin B Starren <***@***.***> wrote:
    If the only way I get tenure is based on writing papers about the data I have published, I am strongly disincentived against sharing anything until after I get enough papers accepted to ensure my future promotion and tenure.
    Justin,
    Jeffrey Flier just published a very important (and I hope influential) editorial in Nature about exactly that: “Faculty promotion must assess reproducibility,” complete with several specific proposals about how. https://www.nature.com/news/faculty-promotion-must-assess-reproducibilit...
    Larry
    Thanks Justin,
    On Sep 25, 2017, at 9:08 AM, Justin B Starren <***@***.***> wrote:
    Melissa, et al.,
    A review of the letter conveys a data consumer perspective. Having lived on both the consumer and producer sides, I can see arguments on both sides.
    The reusable data project is for both users and providers. We need to be educating each other as to the barriers; most providers I work with don’t realize that their licensing terms actually require legal intervention and this was not really their intent. Data integrators are often also not aware that their redistribution might not be legal or allowed.
    If the only way I get tenure is based on writing papers about the data I have published, I am strongly disincentived against sharing anything until after I get enough papers accepted to ensure my future promotion and tenure.
    Agreed. We need to change this: one CV at a time, one hire at a time, and one T&P committee at a time. I generally don’t hire people, for example, that don’t exhibit signs of collaboration, sharing, and open science. It has to become part of our value and evaluation system. You at Northwestern have been leaders in this area, with the T&P guidelines being enhanced to include data sharing and collaboration as key criteria for T&P. You may want to discuss this with Kristi Holmes at Northwestern, she’s been a leader in this area and is cc’d.
    It is worth remembering that the Large Hadron Collider, which is often presented as an example of data sharing is actually a large, and very closed, consortium. It was only after the consortium felt it had wrung every possible publication from the data was it released to the unrestricted public.
    Rather than focused on prescriptive solutions, a focus on rewards is likely to be more productive in the long run.
    I agree completely that we need to focus on sustainability and incentives. This is part of where we are heading with this project, but to get there we first, as a community, must understand the nature of the licensing problem. We hope that this will help assist us in conversations about how to get the data out faster, better, in a sustainable and quality fashion.
    Scientists are smart. We are very good a complying with the letter but not the spirit of rules, if we view those rules as harmful to our careers. Simply asking federally funded academic institutions to report how many scientists received tenure or promotion primarily for data sharing would help move the needle.
    Agreed.
    Finally, your 5 star rating requires unfettered use.
    It is actually aiming to understand what requires legal intervention for redistribution. There are many other aspects of data sharing that are relevant, see the FAIR-TLC rubric for example, but many others have published on this as well (there is a reading list on the site to start).
    That implies no need to cite the data producer.
    This implication is definitely not intended and goes against not only the FAIR-TLC rubric described below but also all of our and other's work on data citation and attribution for such things. See related articles:
    Achieving human and machine accessibility of cited data in scholarly publications
    How to cite individual resources such as a dataset, a blog related to identifiers: Bad Identifiers are the Potholes of the Information Superhighway: Take-Home Lessons for Researchers and related work on identifiers, which includes attribution and provenance: Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data
    Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products
    Publishing: Credit where credit is due
    Outputs of the NISO Alternative Assessment Metrics Project
    This is equivalent to eliminating the requirement to cite quotes or figures when reused in publication.
    I don’t think that is your intent.
    Definitely not. Remember, the goal of the reusable data project was focused solely on the licensing issues; if you see issues there that are confusing towards these ends they are not intended and we would very much appreciate you making a ticket or PR to help make things more clear. Please also join the list serve as I would like to move this conversation to a public place :-).
    Perhaps, there was a separate discussion of data citation, but having that expectation unambiguously stated in the rubric would help.
    There have been very many discussions about this, in fact :-). This is why I included the FAIR-TLC content, was to provide this type of context. The above citations can help get you started on the various community attribution/citation conversations, I’m happy to send additional citations and project information related to these efforts.
    Best,
    Melissa
    Melissa Haendel, PhD
    Associate Professor
    Library & Dept. of Medical Informatics and Clinical Epidemiology
    ***@***.***
    503-407-5970
    www.monarchinitiative.org
    Appointments: Shanez De Silva
    ***@***.***

submit a comment