China: artificial intelligence used to design language databases

23 avril 2021

China has launched the second phase of its project to protect national language resources as part of an extensive programme. The country’s ambition is to preserve everything that exists in terms of dialects and ethnic minority languages. Artificial intelligence technology is being used to great effect.

Protecting language resources

In 2019, in Beijing, UNESCO and the Chinese Ministry of Education had jointly issued the “Yuelu Proclamation”, a document aimed at promoting linguistic diversity in the world and protecting it. This text encouraged national language institutions, universities, NGOs or any other public or private institutions in UNESCO member countries, to try to apply various techniques and methods to do everything to protect linguistic diversity within their countries.

In China, this project has existed since 2015. Launched by the Ministry of Education and the Chinese Language Commission, it initially aimed to identify, present and develop language resources and protect endangered languages. But gradually, the corpus has expanded to include all the country’s languages and dialects.

Director of the Center for the Protection of Language Resources of China, Cao Zhiyun explained:

“Languages and dialects are disappearing rapidly. A language dies out every two weeks, so we are racing against time to save them. This is also a good way to protect and pass on Chinese culture.”

The use of artificial intelligence

More than 350 colleges, universities and research institutes have joined the project so far, involving over 4,500 professionals. During the first phase of the project, a large data collection and recording platform was designed to list all the language resources existing in the country. As of October 2020, 1,712 sites, including 103 with endangered Chinese dialects, have been surveyed and their data retrieved. The programme has covered 34 provinces or regions of China and 123 languages.

In particular, speech recognition and synthesis were used to better preserve languages and dialects by building up written and speech databases, including those of all ethnic minorities in the country. The second phase will use these databases to promote standard written and spoken Chinese to standardize the language within the country, while protecting local languages and dialects, which will also be offered for learning to anyone who wishes.

Translated from Chine : l’intelligence artificielle utilisée pour concevoir des bases de données linguistiques