AINA is born, the project that will guarantee the survival of the Catalan language in the digital age

10 December 2020
The project is promoted by the Department of Digital Policies with the collaboration of BSC.

The AINA project will generate the digital and linguistic resources necessary to facilitate the development of voice assistants, automatic translators or conversational agents in Catalan

The ultimate goal is for citizens to be able to participate in Catalan in the digital world at the same level as speakers of a global language, such as English, and thus avoid the digital extinction of the language

The project has a budget of € 13.5M, to be financed with NextGenerationEU funds, and starts with an initial contribution of € 250,000 from the Department of Digital Policies

The first resource generated is the Catalan 'corpus' to train Artificial Intelligence (AI) algorithms, the largest created so far, with 1,770 million metadata associated with words

The next step will be to generate the language models, speech models and models for translation using multilayer neural networks

Providing Catalan with digital and linguistic resources so that it becomes a competitive language in the digital world and thus ensuring its future survival is the objective of the AINA project, which the Minister of Digital Policies and Public Administration, Jordi Puigneró, presented today at press conference, accompanied by the general director of Sociedad Digital, Joana Barbany: the associate director of the Barcelona Supercomputing Center - National Supercomputing Center (BSC), Josep Maria Martorell, and the researcher and co-leader of the BSC data mining unit , Marta Villegas, responsible for the project.

Promoted by the Department of Digital Policies, with the collaboration of the BSC, the AINA project will generate corpus and computer models of the Catalan language for companies that create applications based on artificial intelligence (AI), such as voice assistants, automatic translators, agents conversational, etc., can easily do it in Catalan.
 

Budget and scope of the project

The AINA project has a global budget of 13.5 million euros for the period 2020 to 2024 and is one of the projects prioritized by the Department of Digital Policies to be financed with European NextGenerationEU funds. For the moment, it starts with an initial contribution of € 250,000 that the Department of Digital Policies has assigned to the BSC to expand the corpus of the Catalan language and thus obtain linguistic models that cover the different variants and registers.

The BSC already has a first textual corpus of Catalan, consisting of 1,770 million words, gathered in 95 million sentences. This corpus, the largest ever made in the Catalan language, has been obtained by downloading texts from different digital sources (web pages, files, etc.), cleaning them and deleting duplications.

The Catalan Government has provided all the information on its web pages and the DOGC, which has accounted for 33% of all downloaded content, and it took 2,000 hours of MareNostrum supercomputer processors to review the data obtained, eliminate duplications and everything that did not they were properly sentences in Catalan.

This first corpus was carried out with the financing of the Plan for the Promotion of Language Technologies, of the Vice-Presidency of Economic Affairs and Digital Transformation of the Spanish Government.

Now, with the impulse of the Department of Digital Policies, new corpus will be created to incorporate the different dialect variants of Catalan, different linguistic registers (colloquial, literary, administrative, etc.) and voice and image files. The Catalan Audiovisual Media Corporation will provide its entire documentary repository.

With all this information, the next step will be to train multilayer neural networks so that they "learn Catalan" and generate models of the language, models of speech and models for translation. These models are also very expensive to make because they require great computational capacity (what is being built based on the first textual corpus will use 9,000 GPU hours), and they will be the bases on which AI-based applications can be developed, such as assistants. speech, predictors and linguistic correctors, chatbots, automatic summary applications, intelligent searches, applications for sentiment analysis or translation engines and automatic subtitling, among others.

All the models that the BSC will create will be available to all those companies or entities that want to use them, since they will be published openly and with permissive licenses.
 

The digital world, an opportunity and a challenge for Catalan

This should allow Catalan to make a qualitative and quantitative leap in the digital ecosystem. In fact, the digital world is today an opportunity and a challenge for the Catalan language. Currently, voice technologies and voice applications and interfaces for accessing the digital world are strategic for the full development of the language in all sectors. The interaction between people and technology has entered a new phase where less and less is done through devices such as the keyboard, mouse or touch screens, to give way to a new, more natural form of interaction through the voice and speech. And this fact gives special relevance to language, which becomes one of the main interaction vehicles.

This new interaction must be able to be done, also, in Catalan. In this sense, the Government has the firm intention of guaranteeing that citizens can speak and interact in Catalan in the digital world at the same level as speakers of other languages, such as English or Spanish, which, today, are guaranteed their digital survival, because behind they have had States that have invested to provide sufficient resources in terms of learning techniques and neural networks in Artificial Intelligence.

A study carried out in 2011 by the European network of excellence META-NET, thanks to more than 200 experts in Language Technologies, warns that more than 20 European languages, including Catalan, face digital extinction if they do not receive more support technology in four areas: machine translation, voice interaction, textual analysis and the availability of linguistic resources.
 

AI and Language Technologies

Language Technologies are those that we already use in our day to day when we automatically correct a text in email, use a web browser on the Internet, automatically translate a web page, give voice commands on the mobile phone, We interact with virtual assistants or follow the directions of the GPS navigator, among others. And they are the technologies that will allow us to dialogue with computers, domestic appliances and even with our vehicle in a natural way.

The new Artificial Intelligence technologies and Language Technologies are based on the application of algorithms on large quality data sets, but the data sets on which the algorithms are trained are specific for each language.

In this sense, large multinationals such as Google, Apple and Microsoft use digital English resources created by the Agency for Advanced Research Projects of the US Department of Defense (DARPA), which have been the linguistic base of AI worldwid, since generating these same resources by a company would be very expensive.

 

A strategic project

The AINA project is part of the Government's digital strategy, through two initiatives led by the Department of Digital Policies: The Interdepartmental Board of Directors for the promotion of Catalan on the Internet and in advanced digital technologies, approved in December 2018, and the Artificial Intelligence Strategy of Catalonia (Catalonia.AI), approved in February 2020.

The first one has the participation of the General Directorate of Digital Society, the General Directorate of Language Policy, the General Directorate of Media, the Catalan Cybersecurity Agency, the General Directorate of Citizen Attention and the puntCAT Foundation, and has among its objectives promote the presence of Catalan in voice assistants. For its part, one of the priority axes of the Catalonia.AI strategy is linked to the standardized use of the Catalan language in interfaces as a key element in the development of AI, since language is the basic communication element to access, use and interact with these technologies.

 

AINA, a name where language and technology converge

The project has been baptized with the name of AINA in homage to the Menorcan philologist Aina Moll, a central figure in the promotion and normalization of Catalan and the first General Director of Language Policy of the Generalitat of Catalonia from 1980 to 1988. She was the architect the launch, in 1982, of the first institutional awareness campaign on the use of the language 'el català, cosa de tots', which, with the popular character of “Norma, al capdavant”, aimed to raise awareness in society about the Sociolinguistic situation of Catalan. A year later, the first language normalization law was approved.

AINA contains a reference to technology (AI: Artificial Intelligence) that will make it possible to standardize it in the digital field.