The Natural Language Processing for Biomedical Information Analysis (NLP4BIA) research group led by Dr. Martin Krallinger at BSC is a multidisciplinary team of engineers, computational linguists, healthcare experts, and software developers dedicated to the development, application and evaluation of Text Mining, Natural Language Processing (NLP) and Language Technology systems for a diversity of health and biomedical user scenarios. The NLP4BIA team focuses on the creation of publicly accessible high quality biomedical NLP resources to unlock key information and improve analysis of a variety of unstructured data sources, including clinical reports, biomedical literature, clinical trials, patents, or social media content written in different languages, mainly Spanish and English (but also other languages like Catalan).
The results of these HPC-empowered AI and deep-learning based text mining resources developed by the group represent the base of more sophisticated semantic search and information retrieval technologies as well as large scale semantic annotation and document indexing strategies to generate structured data from clinical and biomedical texts. This in turn facilitates enhanced mining and data analytics approaches to be exploited for predictive modeling purposes of healthcare data.
Moreover, in order to address a key bottleneck influencing the development of robust biomedical NLP solutions, namely the lack of access to high quality annotated Gold Standard datasets (with well-defined data selection criteria, annotation guidelines and consistency/quality analysis), the NLP4BIA group is working on the creation and release of biomedical corpora and annotation protocols.
In this line, the NLP4BIA group has developed and released a range of corpora that contributed to foster the development of new, cutting-edge deep learning, Transformer and language model-based solutions by a global NLP research community through high impact open benchmark shared tasks (e.g. BioCreative, IberEVAL, IberLEF, BioASQ CLEF, eHealth CLEF, Biomedical WMT, BioNLP-OST or SMM4H).
Through international and national research collaborations the NLP4BIA group's research output aims to unlock information from unstructured health data, critical to empower AI-based medical data analytics tools of benefit for both research and public healthcare systems (professionals, patients, industry) through technological development by integrating technology in the healthcare value chain for clinical applications and clinical use cases.
Among the biomedical NLP practical exploitation scenarios, the group is working on use-cases related to cardiology and cardiovascular diseases (e.g. heart failure), occupational health, biomaterial and chemical entity text mining, rheumatology and rare diseases, COVID-19, cancer (incl. comorbidities and tumor morphology), as well as generation of knowledge graphs from text (e.g. extraction of gene regulatory networks, and drug-target interactions).
The group participates in different international consortia such as the BIOMATDB project, the DataTools4Heart project and AI4HF. It participates also in national projects such as AI4PROFHEALTH and BARITONE.
Much of the group's research output is openly available on @Zenodo (medical-nlp), @GitHub (PlanTL-GOB-ES, TeMU-BSC) or @YouTube (Biomedical Text Mining), as well as on websites created by the group for especific Shared Tasks and resources (see DrugProt, PharmaCoNER, MESINESP2, CANTEMIST, DisTEMIST, MedProcNER, LivingNER, MEDDOCAN, MEDDOPROF, MEDDOPLACE or ClinSpEn). The group also offers online demonstrators of some of the developed systems at textmining.bsc.es.
R+D Lines
- Deep learning based semantic annotation technologies: A key aspect for exploitation of unstructured data consists of the automatic detection, extraction, and harmonization of relevant concepts from texts, such as mention of diseases/disorders, symptoms/signs, procedures, genes/proteins/chemicals, medications, observable entities, professions/occupations, or places. Linking such mentions to structured vocabularies or terminologies is critical for advanced semantic search technologies, data analytics, predictive modeling, knowledge discovery/knowledge graph generation from text. We develop, evaluate, adapt and apply Transformers, deep learning based approaches and language models using BSC-HPC infrastructure for biomedical named entity recognition and entity linking purposes.