The Language Technologies Unit at BSC aims to advance the field of natural language processing through cutting-edge research and development and use of HPC. The Unit has extensive experience in several NLP areas, such as massive language model building, machine translation, speech technologies, HPC computing and unsupervised learning for under-resourced languages and domains. High-throughput language processing and massive language data annotation are part of the unique expertise of a team comprising computational linguists, data scientists, software engineers and researchers coming together from various backgrounds and domains.
The Language Technologies Unit has been entrusted by the Spanish and the Catalan governments with the mission to develop essential open-source resources and technologies, a Language Technology and Artificial Intelligence infrastructure for Spanish and Catalan languages. As the Technical Office coordinating these national projects, the Unit hosts benchmarking platforms (EvalES and CLUB) that help define and establish the State of the Art for these technologies. In addition, the Unit participates in various EU-funded international projects and is deeply interested in promoting and transferring these technological advances to the industry as well as to society in general.
The Language Technologies Unit maintains two portals that makes its demos and benchmarks publicly accessible: planTL.bsc.es and aina.bsc.es. Besides, the Unit has developed a number of relevant open source resources that can be found in reference software and data repositories: @Huggingface (PlanTL-GOB-ES and projecte-aina), @GitHub (PlanTL-GOB-ES, TeMU-BSC and projecte-aina) and @Zenodo (spanish-ai and catalan-ai).
R+D Lines
- Language modeling: The language modeling research line is responsible for the development of large language models and its applications in Natural Language Processing tasks. This team conducts research on deep learning and machine learning algorithms to enhance the accuracy, speed, efficiency, and versatility of these models.
- Speech technologies: Speech technologies team is responsible for developing machine learning model tasks, such as speech recognition, speech synthesis, speaker detection and any other task that supports speech related tasks. The team is involved in creating the full end-to-end pipelines for these tasks; starting with the data gathering process and up to the deployment of efficient models in production.
- Machine translation: The translation technologies team is responsible for the collection and curation of parallel data, creation of synthetic corpora, and training and evaluation of translation models, including translation of sign language to oral language. Our main lines of research include ddata-scarce scenarios, finetuning of multilingual models, adaptation of large language models to the translation task, detection and mitigation of bias in translation, and tools for accessibility.
- Tech Transfer & Innovation: Promotes the innovative use of artificial intelligence (and language technologies particularly) to help companies, administrations and institutions adopt advanced tools that will help them grow and become more productive and competitive in the global digital world.
- Machine learning digital infrastructure: The Unit has a dedicated team of MLOps (Machine Learning Operations) experts for streamlining and accelerating the development of machine learning models, through application of good practices. The team develops innovative solutions for the unique problems that the HPC environments provide.
Objectives
- Advance the field of natural language processing through cutting-edge research and development: The Unit aims to continue its work in developing innovative solutions to NLP tasks, such as language modeling, speech technologies, and machine translation.
- Develop open-source resources and technologies with a special focus on large language models for Spanish and Catalan, but also for other co-official languages: The Unit has been entrusted with the mission to provide language technology and AI infrastructure for these languages, and is also responsible for hosting benchmarking platforms to define the state of the art.
- Promote and transfer technological advances to industry and society: The Unit is committed to transferring its expertise and knowledge to companies and institutions, helping them to adopt advanced tech tools and grow more productive and competitive in the global digital world.
- Streamline and accelerate the development of machine learning models: The Unit has a dedicated team of MLOps experts who work to improve the efficiency of the development process and tackle the unique challenges posed by HPC environments.
- Another important goal of the Unit is to compile, design and store the large corpus and datasets needed to implement modern, transformer-based language models and task-specific fine-tunings.