"The completion of AINA will ensure that never again will any company large or small have an excuse for not incorporating Catalan into their textual, visual or audio services," said BSC associate director Josep M. Martorell
The goal of AINA is to ensure the future of the Catalan language in the digital world at the same level as other languages with global reach.
The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) will receive an investment of 12 million euros over the next four years to continue developing the AINA project. The Generalitat de Catalunya (Catalan Government) has announced this Monday an annual investment of 3 million euros until 2026 to ensure the continuity and completion of an initiative whose objective is to ensure the future of the Catalan language in the digital world at the same level as other global languages.
The announcement has been made by the Catalan Minister of Business and Labour, Roger Torrent i Ramió, and the associate director of the BSC, Josep Maria Martorell, after holding a working meeting in which the secretary of Digital Policies, Gina Tost i Faus; the secretary of Linguistic Policy, Francesc Xavier Vila Moreno; the director of the BSC, Mateo Valero; and the head of the AINA project and co-leader of the Text Mining Unit of the BSC, Marta Villegas, also took part.
"Having multi-year funding until 2026 is exceptional news that allows us to give continuity to the team working on the AINA project and to continue investing in developing new technology. The completion of AINA will ensure that never again will any company large or small have an excuse for not incorporating Catalan into their textual, visual or audio services," said BSC associate director Josep M. Martorell after the meeting.
AINA is a project led by the BSC that is based on data technologies and artificial intelligence (AI) with the ultimate goal of making technology understand and speak Catalan, so that citizens can fully participate in the digital world in Catalan.
To achieve this goal, the AINA project is developing the necessary infrastructure to make the inclusion of Catalan in AI applications sufficiently attractive and viable, both for large technology companies and local industry, so that any company or organization can use the resources generated by AINA, such as corpus (massive data sets) and models of the Catalan language, to develop specific solutions or services (translators, personal assistants, speech synthesizers, text classifiers, etc.) in Catalan.
To date, the AINA project has already created the largest "text corpus" ever made of the Catalan language. This corpus has been obtained and continues to grow by downloading texts from different digital sources in Catalan (web pages, files, etc.) and processing them to be used as training data by the neural networks that use the language models.
AINA has also started to build a large corpus of Catalan voice, which is mainly fed by the data obtained through the initiative "La nostra llengua és la teva veu (Our language is your voice)", consisting of a call for altruistic participation of Catalan-speaking citizens to give their voice and validate the voice contributed by other people through the Mozilla Common Voice platform.
Among the first prototypes developed in 2022, the following stand out: new synthetic voices trained by AINA through their use in a virtual assistant from the company Bookline; an automatic transcription tool (oTranscribe+) that allows and facilitates editing while guaranteeing data privacy; and a voice chatbot that answers questions about the AINA project and that can serve as a basis for creating other conversational experiences in Catalan.
In 2023, work will continue along these lines to expand the text and speech corpus and the language models trained from these corpus. At the end of the AINA project, in 2026, all the necessary pieces will be available so that any company or organization can combine them to create their solutions or services guaranteeing that they understand and speak Catalan correctly in any of its variants.