NLP for Biomedical Information Analysis

The Natural Language Processing for Biomedical Information Analysis (NLP4BIA) research group led by Dr. Martin Krallinger at BSC is a multidisciplinary team of engineers, computational linguists, healthcare experts, and software developers dedicated to the development, application and evaluation of Text Mining, Natural Language Processing (NLP) and Language Technology systems for a diversity of health and biomedical user scenarios. The NLP4BIA team focuses on the creation of publicly accessible high quality biomedical NLP resources to unlock key information and improve analysis of a variety of unstructured data sources, including clinical reports, biomedical literature, clinical trials, patents, or social media content written in different languages, mainly Spanish and English (but also other languages like Catalan).

The results of these HPC-empowered AI and deep-learning based text mining resources developed by the group represent the base of more sophisticated semantic search and information retrieval technologies as well as large scale semantic annotation and document indexing strategies to generate structured data from clinical and biomedical texts. This in turn facilitates enhanced mining and data analytics approaches to be exploited for predictive modeling purposes of healthcare data.

Moreover, in order to address a key bottleneck influencing the development of robust biomedical NLP solutions, namely the lack of access to high quality annotated Gold Standard datasets (with well-defined data selection criteria, annotation guidelines and consistency/quality analysis), the NLP4BIA group is working on the creation and release of biomedical corpora and annotation protocols.

In this line, the NLP4BIA group has developed and released a range of corpora that contributed to foster the development of new, cutting-edge deep learning, Transformer and language model-based solutions by a global NLP research community through high impact open benchmark shared tasks (e.g. BioCreative, IberEVAL, IberLEF, BioASQ CLEF, eHealth CLEF, Biomedical WMT, BioNLP-OST or SMM4H).

Through international and national research collaborations the NLP4BIA group's research output aims to unlock information from unstructured health data, critical to empower AI-based medical data analytics tools of benefit for both research and public healthcare systems (professionals, patients, industry) through technological development by integrating technology in the healthcare value chain for clinical applications and clinical use cases.

Among the biomedical NLP practical exploitation scenarios, the group is working on use-cases related to cardiology and cardiovascular diseases (e.g. heart failure), occupational health, biomaterial and chemical entity text mining, rheumatology and rare diseases, COVID-19, cancer (incl. comorbidities and tumor morphology), as well as generation of knowledge graphs from text (e.g. extraction of gene regulatory networks, and drug-target interactions).

The group participates in different international consortia such as the BIOMATDB project, the DataTools4Heart project and AI4HF. It participates also in national projects such as AI4PROFHEALTH and BARITONE.

Much of the group's research output is openly available on @Zenodo (medical-nlp), @GitHub (PlanTL-GOB-ES, TeMU-BSC) or @YouTube (Biomedical Text Mining), as well as on websites created by the group for especific Shared Tasks and resources (see DrugProt, PharmaCoNER, MESINESP2, CANTEMIST, DisTEMIST, MedProcNER, LivingNER, MEDDOCAN, MEDDOPROF, MEDDOPLACE or ClinSpEn). The group also offers online demonstrators of some of the developed systems at textmining.bsc.es.

R+D Lines

- Deep learning based semantic annotation technologies: A key aspect for exploitation of unstructured data consists of the automatic detection, extraction, and harmonization of relevant concepts from texts, such as mention of diseases/disorders, symptoms/signs, procedures, genes/proteins/chemicals, medications, observable entities, professions/occupations, or places. Linking such mentions to structured vocabularies or terminologies is critical for advanced semantic search technologies, data analytics, predictive modeling, knowledge discovery/knowledge graph generation from text. We develop, evaluate, adapt and apply Transformers, deep learning based approaches and language models using BSC-HPC infrastructure for biomedical named entity recognition and entity linking purposes.

- Applied Text Mining and unstructured content processing: One of our main aims is to implement real world text mining applications that are directly aligned with practical use case scenarios in biomedical research or healthcare. The application scenarios we are working on relate to NLP applied to cardiovascular diseases including heart failure, infectious disease text mining (COVID19, respiratory infections), rare diseases and clinical phenotype mining (e.g. rheumatic diseases) or NLP for biomaterials or text mining.

- Multilingual clinical NLP: There is a diversity of healthcare data being produced in English, especially research data and scientific literature. Nonetheless, most clinical content, in particular clinical records, are being written in a diversity of languages. We are working on the development of NLP resources not only for English, but also for other languages such as Spanish and Catalan.

- Benchmarking, quality assessment and data annotation: A key issue for the development of NLP components in healthcare is to determine the quality of their results. Our group has been particularly active in constructing high quality Gold Standard corpora and using them for training and benchmarking purposes to enable quality evaluation of NLP tools through international evaluation campaigns and scientific shared tasks.

- Anonymization and privacy preserving NLP: one of the main obstacles for the development and exploitation of clinical NLP and data mining applications is the access to privacy-preserving data collections that comply with the existing legal framework relative to sensitive data. We have been working on the implementation of technical solutions, anonymization protocols and strategies to detect sensitive data elements in clinical documents, as well as evaluation scenarios to account for potential leakage and the technical evaluation of anonymization results.

Objectives

Generate beyond state of the art NLP resources to automatically process, extract and exploit highly relevant information for biomedical and clinical application scenarios from heterogenous content types (clinical records, trials, literature or social media; including semantic annotation, NER, entity linking and text similarity components.
Foster transparent, community-wide quality evaluation scenarios, shared tasks and benchmark environments to promote the development of highly competitive technical NLP and text mining solutions aligned with practically relevant healthcare and biomedical research use cases.
Promote transfer, adaptation and training of clinical NLP and Text Mining applications tailored towards the needs and characteristics of clinical professionals (hospitals), healthcare and biomedical research users as well as language technology and health-tech industry.
Generate and share annotated datasets, corpora, lexical and terminological resources, content collections and annotation protocols to serve as a basic infrastructure for generating reproducible and extendable text mining strategies.
Develop predictive models, hypothesis generation solutions, knowledge graphs and databases/datalakes to empower advanced health data analytics solutions.