Composed by: Pedro Mendes (RDA/EOSC Future Ambassador for Catalysis), Daniel Costa Downloadable disciplinary info sheet: Chemistry |
What is chemistry data?
Chemistry studies the composition, structure, and properties of substances and the transformations that they undergo1. Thus, all data generated throughout such a study can be considered chemistry data. For more specific terminology questions, NFDI4Chem provides a terminology service, and a list of all terminologies tagged with Chemistry and registered with FAIRsharing (an output of the RDA FAIRsharing WG) is also available.
For a machine-interpretable definition, an ontology should be employed. A review of current ontologies for chemistry can be found here.
Where is chemistry data shared?
In addition to general-purpose repositories/databases, like Zenodo2, there are specific ones for chemistry. Some sub-domains, like analytical chemistry and computational chemistry, have mature repositories and thus sharing via these specific repositories is highly recommended. The FAIRsharing registry provides a searchable registry of Chemistry databases as part of the larger registry of databases, standards and policies across all subject areas. Re3data provides a searchable database of repositories, while NFDI4Chem guides one to choose “the right repository” in chemistry.
How is chemistry data shared (e.g. standards, guidelines, trusted examples)?
For a quick start, check the knowledge database of NFDI4Chem, and you can also search the FAIRsharing registry generally, more specifically its ecosystem of Chemistry standards. While full standardisation is still far, some standards have been developed within the community. For instance, for chemical reactions, the Unified Data Model standard format can be used.
When no standards are available, generic guidelines based on FAIR principles3 should be applied (check RDA ones listed below, also see here4). The underlying principle for chemistry being to report the full raw data of experimental or computational results and key determining variables. IUPAC is also running a series of projects which should issue chemistry-specific guidelines in the next couple of years, taking also part of the WorldFAIR project.
Additionally, a variety of trusted examples are provided by NFDI4Chem and the Spotlight section can also serve as inspiration.
What are typical data and file formats for chemistry data?
Most chemistry data is better reported in tabular formats, but there are many specific formats for specific data in chemistry. For instance, molecular structures can be represented by InChI5 or SMILES6 formats.
For data types and formats in chemistry, check DataCC and NFDI4Chem lists.
Tips on the best file formats for a given generic data format are provided by OpenAIRE and the 5-star for Open Data classification. Formats should be accessible and interoperable, e.g. ideally csv or json for tabular formats.
Collaborating disciplines
Chiefly, biology and materials science. Both disciplines are contiguous to chemistry topic-wise and particularly biology is quite some steps ahead4,7 in terms of open science practices, serving thus as inspiration.
RDA Groups active in this discipline
RDA Groups in this discipline that are no longer active
[TBC]
If your Working Group or Interest Group may be of relevance to those working in Chemistry, please email enquiries[at]rd-alliance.org to have your group added to this page.
Highlighted RDA Outputs
- FAIRsharing - an output of the FAIRsharing WG, the FAIRsharing Registry provides visualisation and description of the ecosystem of resources across all research domains, including the Chemistry subject area and a collection of Chemistry resources linked to the RDA Chemistry Research Data Interest Group (CRDIG). Is FAIRsharing missing any Chemistry standards, databases or policies? If so, please register them with FAIRsharing to enhance their visibility and discoverability. Some examples of resources useful for chemists available in the FAIRsharing Registry are:
- Chemical Markup Language (CML): A XML language developed for chemistry designed to hold chemistry concepts like molecules, reactions and properties.
- Core Scientific MetaData model (CSMD): A metadata model used to support data collected within a scientific workflow. Is generic enough to to work with different chemistry subfields.
- IUPAC International Chemical Identifier (InChI): A machine-readable string that encode information about a chemical compound. Works as a barcode for chemistry species.
- FAIRsharing Community Champions - Launched under the auspices of the Launched under the auspices of the RDA / EOSC Future Domain Ambassadorship for standards, databases and policies, it includes representatives from a number of RDA WGs and EOSC clusters.
Are you an expert in Chemistry resources and believe you can contribute to FAIRsharing? If so please consider joining the community champions.
Selected tools
Online machine-actionable tool developed by OpenAIRE to facilitate Research Data Management (RDM) activities concerning the implementation of Data Management Plans (DMPs). | |
B2FIND | EUDAT metadata indexing service that provides a discovery portal which allows users to find data collections within an international and inter-disciplinary scope. |
OpenAIRE Mining Service |
Service that performs text mining (entity resolution) on the metadata and the text of publications and extracts information on: links to projects/grants and funders; data citations or links to scientific database entries (e.g. links to entries in PDB - Protein Data Bank); document classification according to several taxonomies; software citations; author affiliations; references; document similarity. |
Useful resources
Training
- For kick-off on open science, check the Open Science Primers. Additional concept definitions available at FAIRSharing.
- For chemistry-specific training, check NFDI4Chem and the WorldFAIR Chemistry community.
- For introductory workshops for groups, contact Pedro Mendes (RDA/EOSC ambassador for catalysis) pedro.f.mendes@tecnico.ulisboa.pt.
- Catalogues of open access training on open science are made available by both EOSC and OpenAIRE.
Services
- EOSC provides a catalogue of open science-related services.
- For an open electronic laboratory notebooks, check Chemotion.
- To transform Scholarly articles in PDF format into XML to facilitate the document analysis, check GROBID (https://grobid.readthedocs.io/en/latest/) and CERMINE (http://cermine.ceon.pl/index.html).
- For a data process and analysis tool, both simple and relatively complex tasks can be performed with existing Python libraries like scikit-learn and seaborn. For more complex and ready-to-use tools, check EOSC catalogue.
Networks
Join the Chemistry Research Data Interest Group (CRDIG) at RDA.
Developers and curators for FAIR data in chemistry can be found at FAIRsharing as members of the Community Champions programme.
References
1. Merriam-Webster dictionary, https://www.merriam-webster.com/dictionary/chemistry, on 11/01/2023.
2. Zenodo in FAIRsharing: https://doi.org/10.25504/FAIRsharing.wy4egf
3. FAIR Principles in FAIRsharing: https://doi.org/10.25504/FAIRsharing.WWI10U4. SMILES in FAIRsharing: https://doi.org/10.25504/FAIRsharing.qv4b3c
4. P. S. F. Mendes, S. Siradze, L. Pirro, J. W. Thybaut, ChemCatChem 2021, 13, 836.
5. InChI in FAIRsharing: https://doi.org/10.25504/FAIRsharing.ddk9t9
6. SMILES in FAIRsharing: https://doi.org/10.25504/FAIRsharing.qv4b3c
Archived content
Chemical research data is fundamentally the most important product that chemists create as it guides future research and allows us to understand how chemistry works, what we can do with it and how it affects our lives. Chemistry, as a discipline, has long focused on its own specific needs in analyzing and communicating chemical data, especially the representation and identification of chemical structures. Today’s chemical informaticians (cheminformaticians) have pushed technology to make storing and searching chemical information easier and more standardized for increasing amounts of data.
The time has come, however, to take a step back and look at chemical informatics as it relates to the needs relative to: dealing with large amounts of heterogeneous data that are stored in different ways, how we transmit chemical information to the many other disciplines that need it, and how we semantically represent it in digital form. Looking to efforts of other disciplines and understanding and appreciating the importance and impact of generalized data technologies and RDA outcomes will strengthen chemistry and help move it toward open data in a way that is interoperable and forward thinking.
The Chemistry Research Data Interest Group (CRDIG) is focused on mechanisms by which we can improve chemical informatics and highlight its importance in the global data economy, specifically:
- Bringing together important stakeholders relative to open chemical data (e.g., the American Chemical Society - Division of Chemical Information, the International Union of Pure and Applied Chemistry (IUPAC), and others)
- Bridging the chemical informatics and RDA communities to help appreciate and understand what each has to offer
- Development of both RDA Working Group and IUPAC project proposals for important domain activities such as:
- Establishing new or revised metadata, ontology, chemical structure, or data format standards
- Characterization of the different chemical information types, identification of the critical points in the data life-cycle, and mapping of gaps in interoperability