Meta AI and Papers with Code present Galactica, an open source scientific language model

0
Meta AI and Papers with Code present Galactica, an open source scientific language model

Meta AI and Papers with Code, an autonomous team within Meta AI Research, on Nov. 15 introduced Galactica, an open source language model with 120 billion parameters trained on a large corpus that can store, combine and reason about scientific knowledge. The goal is to help find useful information in the mass of available information. This announcement has already caused controversy in the scientific community.

Galactica was trained on a corpus of over 360 million contextual citations and over 50 million unique references normalized across a diverse set of sources, allowing it to suggest citations and help discover related articles. Among these sources is NatureBook, a new set of quality scientific data that has trained it with scientific terminology, math and chemical formulas, and source codes.

Managing the plethora of scientific information

Information overload is a major obstacle to scientific progress. Researchers are buried under a mass of articles and find it difficult to locate information useful to their research.

Galactica is a large-scale language model (LLM) trained on more than 48 million articles, textbooks, reference materials, compounds, proteins and other sources of scientific knowledge. It can be used by academic researchers to explore the literature, ask scientific questions, write scientific code…

The dataset

The dataset used was created by tokenizing information from various scientific sources. For the interface, the team used task-specific tokens to support different types of knowledge. It processed citations with a special token, which allows a researcher to predict a citation based on any input context.

Step-by-step reasoning was also wrapped in a special token, which mimics an internal working memory.

The results

Galactica has performed very well in many scientific fields.

In tests of technical knowledge such as LaTeX equations, Galactica outperformed the latest GPT-3 by 68.2% versus 49.0%. It also showed good performance on reasoning, outperforming Chinchilla on MMLU math with a score of 41.3% vs. 35.7%, and PaLM 540B on MATH with 20.4% vs. 8.8%.

It also sets a new state of the art on downstream tasks such as PubMedQA and MedMCQA of 77.6% and 52.9%. And although it was not trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench.

For the researchers, these results demonstrate the potential of language models as a new interface for science. They have released the model as open source for the benefit of the scientific community.

The controversy

On the Galactica website, it is reminded that there is no guarantee of truthful or reliable output from language models, and that before following their advice, it is important to conduct checks: “Some of the texts generated by Galactica may seem very authentic and confident, but may be subtly wrong in many ways. This is especially the case with highly technical content.”

Galactica should be seen as a writing tool, as Yann Le Cun noted on Twitter:

“This tool is to writing on paper what driver assistance is to driving. It won’t automatically write papers for you, but it will dramatically reduce your cognitive load while you write them.”

Gary Marcus, AI expert scientist, Michael Black, Director of the Max Planck Institute reacted on Twitter, however, and warned that false information generated by Galactica can be picked up in scientific submissions and mislead.

Meta AI and Papers with Code have not yet commented, but they have disabled the Galactica demo function.

Article sources:

“Galactica: A Large Language Model for Science”
arXiv:2211.09085v1,16 Nov 2022

Authors:
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic.
Meta AI

Translated from Meta AI et Papers with Code présentent Galactica, un modèle de langage scientifique open source