EPFL researchers propose an open platform for chemical data management

0
EPFL researchers propose an open platform for chemical data management

Chemistry laboratories generate a significant amount of data. However, some of it is still in paper format and is difficult to access in its entirety. Three scientists from EPFL present a modular open science platform to manage the large amounts of data produced in chemistry research. Their study entitled ” Making the collective knowledge of chemistry open and machine-readable” waspublished in Nature Chemistry.

Managing data in modern chemistry is challenging. If we take the example of synthesizing a new compound, many trials and errors take place before finding the right reaction conditions and thus generate large amounts of raw data. This data is very important because, like humans, machine learning algorithms also learn from failed or partially successful experiments.

Currently, only the most successful experiments are published. Artificial intelligence, in particular machine learning, can make it possible to process the data from failed experiments, provided that it is stored in a machine-readable format that can be used by everyone.

Professor Berend Smit, who heads the Molecular Simulation Laboratory at EPFL Valais Wallis, explains:

“For a long time, we had to compress data because of the limited number of pages in paper journal articles. Today, many journals no longer even have paper editions. Yet chemists still face reproducibility problems because journal articles miss important details. Researchers waste time and resources reproducing the failed experiments of the authors. They have difficulty building on published results because the raw data are rarely published.”

Berend Smit, Luc Patiny and Kevin Jablonka of EPFL have published a perspective that presents an open platform for the entire chemistry workflow: from project initiation to publication.

Machine-readable FAIR data

Their main thesis is that, if we want to advance chemistry with data-intensive research and also solve reproducibility problems, we need to change the way experimental data is collected and reported.

Three steps are essential: data collection, processing, and publication, at minimal cost to the researchers. The guiding principle is that data must be easily found, accessible, interoperable and reusable (FAIR).

Berend Smit states:

“At the time of data collection, the data will be automatically converted into a standard FAIR format, which will allow for automatic publication of all failed or partially successful experiments, as well as the most successful experiment.”

The authors propose that the data will also be machine-readable.

Kevin Jablonka states:

“We are seeing more and more data science studies in chemistry. Indeed, the latest results from machine learning are trying to tackle some of the problems that chemists think are intractable. For example, our group has made significant progress in predicting optimal reaction conditions using machine learning models. These models would be much more valuable if they could also learn about failed reaction conditions, but they remain biased because only successful conditions are published.”

To establish a FAIR data management plan, the researchers present 5 measures:

  • The chemistry community should adopt its own standards and solutions;
  • Journals should make it mandatory to deposit reusable raw data, where community standards exist;
  • We must accept the publication of “failed” experiments;
  • Electronic lab notebooks that do not allow all data to be exported in an open machine-readable form should be avoided;
  • Data-driven research must be part of our curriculum.

Luc Patiny states:

“We believe that there is no need to invent new file formats or technologies. In principle, we have all the technologies. We need to adopt them and make them interoperable.”

The authors point out that storing data in an electronic lab notebook, which is the current trend, does not mean that humans and machines can reuse it. Structuring and publishing the data in a standardized format is the best alternative provided there is sufficient context.

Berend Smit adds:

“Our perspective offers a vision of what we think are the key elements to bridge the gap between data and machine learning for fundamental problems in chemistry. We also provide an open science solution in which EPFL can lead by example.”

Article source:

Kevin Maik Jablonka, Luc Patiny, Berend Smit. Making the collective knowledge of chemistry open and machine-actionable. Nature Chemistry 4 April 2022. DOI: 10.1038/s41557-022-00910-7

Translated from Des chercheurs de l’EPFL proposent une plateforme ouverte pour la gestion des données chimiques