BSC uses the latest advances in AI to develop Spanish terminology learning systems for the TeresIA project

09 July 2024

The development of medical terminologies, essential in the process of digitalisation of the healthcare sector, is one of the application scenarios in which BSC is participating

There is a growing interest in the use of artificial intelligence (AI) and language processing technologies for the creation of terminologies related to high-impact subject areas such as health, biomedical research or the legal domain. Resources such as dictionaries, ontologies or controlled vocabularies play a fundamental role in characterising, classifying and exploiting the huge volumes of textual data generated on a daily basis.

Terminologies are embedded in a wide variety of technological solutions such as advanced search tools, machine translation systems, conversational AI or question-answering systems.

Therefore, it is necessary to promote the development of AI systems that learn to automatically recognise scientific or technical terms, as well as to detect the existing relationships between these terms for content in Spanish. This would allow the development of computational resources capable of systematically enriching existing terminologies. These aspects are part of the basic objectives of the TeresIA project - Terminologies in Spain and artificial intelligence services in which BSC's Natural Language Processing for Biomedical Information Analysis (NLP4BIA) unit is participating.

The TeresIA project is part of the strategic axes of the National Artificial Intelligence Strategy (ENIA), which aims to develop data platforms and technological infrastructures that support AI and contribute to boosting the country's digital transformation process. The TeresIA project was also presented to the European Commission last December, obtaining a financial endowment of 1.4 million euros from the Secretary of State for Digitalisation and Artificial Intelligence of the Ministry of Digital Transformation. In addition, TeresIA is also part of the Artificial Intelligence Strategy 2024 approved by the government in May, and has received an award in the category of entrepreneurship and research by the Internet Prizes.

TeresIA, led by Elea Giménez's group at the Spanish National Research Council (CSIC), also involves the participation of a variety of groups including the Cervantes Institute, the Polytechnic University of Madrid, the Spanish Terminology Association and the European Commission's Directorate-General for Translation.

The TeresIA project is directly linked to the development of a public terminology platform to provide unified access to all Spanish terminology through a powerful meta-search engine. The development of this portal contributes, therefore, to the opening up of terminologies and their shared use by the scientific community in the first instance, and additionally by the Spanish productive sector. For this, TeresIA takes advantage of language models and AI systems together with the qualitative work of specialists in different subject areas, as well as linguists to implement intelligent systems for generating terminology resources. Access to computational resources such as those of the MareNostrum 5, part of the BSC facilities, is essential for the implementation of such systems based on AI models.

The NLP4BIA unit of BSC is involved in key technological aspects of the TeresIA project such as access to content, the development of datasets for training intelligent systems and the implementation of tools that take advantage of language models for the extraction of terms and semantic relations. BSC is also in charge of the technical evaluation of term extraction algorithms by analysing aspects related to the quality, robustness, interoperability and scalability of the implemented solutions. BSC's participation in TeresIA includes the development of KeyCARE, a library designed for key term extraction, classification and extraction of relationships between terms. This library has already been published and will be presented in November at the Congress of the Spanish Society of Biomedical Engineering 2024 in Seville.

Among the application scenarios and domains of use contemplated by the TeresIA project in which BSC participates are the development of medical terminologies, which play an essential role in the digitisation process of the healthcare sector, and the use of terminologies for multilingual scientific information retrieval systems.