New RDA working group on FAIRification of genome sequence annotations (tracks) - Kickoff meetings in September

You are here

25 Sep 2023 UTC

New RDA working group on FAIRification of genome sequence annotations (tracks) - Kickoff meetings in September

Date: 
25 Sep 2023 - 14:00 to 15:30 UTC

These online meetings will start off the work of the new RDA Working Group on 'FAIRification of Genome Sequence Annotations'. Through a community effort, we want to harness and mobilise the wealth of functional genomics and other genome–mapped datasets across projects, genomes and species, spanning across biomedical as well as biodiversity domains. We aim to achieve this through consolidated FAIR (Findable, Accessible, Interoperable, and Reusable) metadata and connected tools, expanding on previous work from ELIXIR and EMBL-EBI (https://fairtracks.net).

 

We are eager to share our vision and technical prototypes, facilitate an open exchange of ideas, and encourage you to think big about the potential your expertise might bring to the table. The work will continue as an open collaboration in a new Working Group (WG) in the Research Data Alliance (RDA), with an estimated duration of 18 months. The initiative has been pre-selected as a demonstrator WG to take part in the Horizon Europe-funded RDA TIGER project, which will provide various support services to the WG throughout its lifespan.

 

Prior to the session, attendees have the opportunity to provide input to the group’s Case Statement - defining the group’s charter, value proposition, and engagement with other initiatives in the area. More information and relevant links can be found in the working document of the initiative.

 

September 25 @ 14:00-15:30 UTC: REGISTER HERE

 

Please read on for more information on the background of this Working Group.

 

----------------------------------------------------------------------------------

 

Motivation

Immense research funding and countless work hours are invested in generating genomic datasets, encompassing both large–scale consortia initiatives and smaller–scale research projects, spanning across the past, present, and future. Mapping experimental and aggregated datasets onto genome sequences provides a powerful unifying model for data–driven analysis across data types, single cells, cell lines, tissues, genomes, and species. The advantages of anchoring functional elements to genomic coordinates still remain as reference genomes are giving way to personal genomes and reference pangenomes in biomedical research, or per–species genome samples arising from the plethora of biodiversity undertakings.

 

As AI methodologies experience rapid growth, the potential that will emerge from the consolidation of metadata for sequence-mapped datasets across projects, omics and species is increasingly formidable. In pursuit of this potential, we extend an invitation to join an open collaboration facilitated by the Research Data Alliance (RDA). This collaborative effort seeks to embrace and enhance the existing schemas and infrastructure established by the FAIRtracks project (https://fairtracks.net) through bridging with new user communities, data sources, and tools.

 

Our collaborative efforts will be in line with cutting-edge standards and advancements for achieving Findable, Accessible, Interoperable, and Reusable (FAIR) data, such as high-precision genome identifiers and recognised ontologies for precise conceptual descriptions, and will enable granular and unified data discovery at the level of individual files.

 

FAIRtracks

FAIRtracks started as a proof–of–concept implementation study funded by ELIXIR, the pan–European research infrastructure for biological information, and is now recognized as an ELIXIR Recommended Interoperability Resource. Starting from the point of view of researchers, we recognized the major practical difficulties in locating track data relevant to the specific analytical contexts, despite the major efforts from larger consortia and smaller research projects to make their data public through repositories, data portals, genome browsers and track hubs. We developed the FAIRtracks infrastructure as a proposed solution to these issues. At the core is a set of schemas proposed as a metadata exchange standard. Around this, we built a set of services and tools, including a central search service, a validation service and a library for building and deploying scalable data flows to continuously transform metadata from various sources.

 

For more information on FAIRtracks, please take a look at the FAIRtracks.net web site, the blog post and paper published by F1000Research, or for a lighter intro, the recent poster and presentation.

 

Going global

The current initiative for a 'FAIRification of Genomic Sequence Annotations' Working Group in the RDA represents a stepping up from implementing prototypes in a smaller group, to inviting data producers, tool developers, domain experts, RDM/FAIR specialists and analytical end users to a broad collaboration to provide global solutions to significantly enhance the potential for data-driven life science. The exact use cases we will support are still open for debate, but here are some examples to give an idea of our vision:

  • Precise categorical search based on unified metadata across data portals, track hubs and genomes

  • Persistent and versioned long-term storage of metadata

  • Integration of search capabilities in genome browsers, analysis tools and programming libraries

  • Precise sequence-derived identifiers of genomes and genome browser instances through adoption of the upcoming GA4GH sequence collection standard

  • Support for genome annotation processes for novel biodiversity assemblies, meeting challenges recently raised by the Earth Biogenome Projects

  • Maintainable and scalable metadata transformation flows, following current best practices for metadata crosswalks, ontology mapping, etc.

 

A biodiversity challenge

The recent surge in biodiversity projects and initiatives raises a particular challenge this working group might be fit to tackle: managing a uniform metadata schema that supports ongoing annotation of novel genome assemblies, following the recommendations of the Earth Biogenome Project (EBP). A recent report from EBP directly states:

 

There is a great need for metadata standards, file format standards, versioning and annotation quality metrics to be formalized for annotation. Currently there is little standardization in any of these areas.

"Report on Annotation Standards" (June 2023), Earth Biogenome Projects

 

In the best case scenario, the output of the 'FAIRification of Genomic Sequence Annotations' WG has the potential to become such a metadata standard, or at least to provide uniform mapping to different metadata schemas arising in different biodiversity project.