"Computational Models for Semantic Textual Similarity" aims to advance on computational models of meaning and their evaluation.
Aitor Gonzalez-Agirre, BSC researcher at the Text Mining group, received the 2017 Best Thesis Award at the SEPLN Conference (Sociedad Española para el Procesamiento del Lenguaje Natural) held in Sevilla from 19-21 September 2018. SPLN Conference aims to offer a forum for debate and communication where the scientific community and industry can present their research works and the most recent findings in the area of Natural Language Processing (PLN).
Gonzalez-Agirre’s thesis, "Computational Models for Semantic Textual Similarity", aims to advance on computational models of meaning and their evaluation. To achieve this goal, he defines two tasks and develop state-of-the-art systems that tackle both task: Semantic Textual Similarity (STS) and Typed Similarity. STS aims to measure the degree of semantic equivalence between two sentences by assigning graded similarity values that capture the intermediate shades of similarity. He has collected pairs of sentences to construct datasets for STS, a total of 15,436 pairs of sentences, being by far the largest collection of data for STS. He has designed, constructed and evaluated a new approach to combine knowledge-based and corpus-based methods using a cube. This new system for STS is on par with state-of-the-art approaches that make use of Machine Learning (ML) without using any of it, but ML can be used on this system, improving the results. Typed Similarity task tries to identify the type of relation that holds between a pair of similar items in a digital library. Providing a reason why items are similar has applications in recommendation, personalization, and search. A range of types of similarity in this collection were identified and a set of 1,500 pairs of items from the collection were annotated using crowdsourcing. Finally, he presents systems capable of resolving the Typed Similarity task. The best system resulted in a real-world application to recommend similar items to users in an online digital library.
About Text Mining group at BSC
The Biological Text Mining Unit focuses on the application and development of biomedical text mining technologies, which are becoming a key tool for the efficient exploitation of information, contained in unstructured data repositories including the scientific literature, electronic health records (EHRs), patents, biobank metadata, clinical trials and social media. The unit has a particular interest in processing clinical documents written in Spanish and other co-official languages in the area of health-related topics and the integration of molecular and biological information derived from the literature. The unit is fully funded through the “Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (PITL)”, in the framework of an agreement (“encomienda”) between the Secretary of State of Telecommunications of the Spanish Ministry of Energy, Tourism and the Digital Agenda (MINETAD) and CNIO.