RSS-Feed http://example.com en-gb TYPO3 News Wed, 25 Apr 2018 16:12:14 +0000 Wed, 25 Apr 2018 16:12:14 +0000 TYPO3 EXT:news news-2098 Tue, 17 Apr 2018 09:27:36 +0000 Paper accepted at IJCAI 2018 http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-ijcai-2018/ Together with our colleagues Paola, Irene and Stefano at Sapienza University in Rome we have a paper accepted at the 27th International Joint Conference on Artificial Intelligence (IJCAI), the premier conference in the field of AI:

  • Stefano Faralli, Irene Finocchi, Simone Paolo Ponzetto and Paola Velardi: Efficient Pruning of Large Knowledge Graphs.
]]>
Publications Simone Research
news-2097 Tue, 17 Apr 2018 09:24:14 +0000 Paper accepted at JCDL 2018 http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-jcdl-2018/ We have a paper accepted at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries

  • Federico Nanni, Simone Paolo Ponzetto and Laura Dietz: Entity-Aspect Linking:  Providing Fine-Grained Semantics of Entities in Context.

The work presented in the paper is a collaboration between the DWS group and Prof. Laura Dietz at the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

 

 

]]>
Research Publications Simone
news-2096 Tue, 17 Apr 2018 09:08:19 +0000 Paper accepted at SIGIR 2018 http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-sigir-2018/ Together with our colleague Ivan Vulic at the University of Cambridge we have a paper accepted at the 41st International ACM Conference on Research and Development in Information Retrieval (SIGIR), the premier conference in the field of Information Retrieval:

  • Robert Litschko, Goran Glavas, Ivan Vulic and Simone Paolo Ponzetto: Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only.
]]>
Research Publications Simone
news-2084 Mon, 12 Mar 2018 11:57:47 +0000 Third Cohort of Students starts Part-time Master in Data Science http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/third-cohort-of-students-starts-part-time-master-in-data-science/ The third cohort consisting of 32 students has started their studies in the part-time master program in Data Science that professors of the DWS group offer together with the Hochschule Albstadt-Sigmaringen.

This weekend the students of the third cohort of the master program as well as students participating in the certificate program Data Science were in Mannheim for a data mining project weekend.

The students worked in teams on two case studies, one in the area of online marketing, the other in the area of text mining. The teams were coached by Prof. Christian Bizer, Dr. Robert Meusel, and Alexander Diete and we were very happy to see an exciting competition between the teams for the best F1 scores as well as the highest raises in sales.

Additional Information:

 

]]>
Projects Chris
news-2075 Fri, 23 Feb 2018 14:41:28 +0000 Dmitry Ustalov has defended his PhD thesis http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dmitry-ustalov-has-defended-his-phd-thesis/ Dmitry Ustalov has successfully defended his Kandidat Nauk (PhD) thesis on “Models, Methods and Algorithms for Constructing a Word Sense Network for Natural Language Processing” («Модели, методы и алгоритмы построения семантической сети слов для задач обработки естественного языка» in Russian). The defense was held at the South Ural State University (Chelyabinsk, Russia) on February 21, 2018. This thesis, among many other contributions, proposes the Watset and Watlink methods for extracting, inducing, clustering, and linking the word senses from the unstructured data.

Abstract

The goal of the thesis is to develop models, methods, and algorithms for constructing a semantic network that establishes semantic links between individual word senses using the weakly structured dictionaries; as well as to implement them as the software system for word sense network construction. Therefore, Part I reviews the state-of-the-art in the field of natural language processing and urges the development of new efficient ontology induction algorithms for under-resourced languages.

Part II proposes two new algorithms, Watset and Watlink, that extract and structure the knowledge available in unstructured form. Watset is a meta-algorithm for fuzzy graph clustering. This algorithm creates an intermediate representation of the input graph that naturally reflects the “ambiguity” of its nodes. Then, it uses hard clustering to discover clusters in this intermediate graph. This makes it possible to discover synsets in a synonymy graph. Watlink is an algorithm for discovering the disambiguated hierarchical links between individual word senses. This algorithm uses the synsets obtained using Watset to contextualize the input asymmetric word links. To increase the recall of the linking, it optionally uses a regularized projection learning approach to predict additional relevant links.

Part III describes the implementation of the proposed models, methods, and algorithms as a software system. The system is implemented in Python, AWK, and Bash programming languages using the scikit-learn, TensorFlow, NetworkX, and Raptor libraries. Also, it defines the representation of the produced word sense network as Linked Data.

Part IV reports the results of the experiments conducted on the Russian language, an under-resourced natural language. Both Watset and Watlink show state-of-the-art performance on the synset induction and hypernymy detection tasks on the RuWordNet and Yet Another RussNet gold standards.

]]>
Research Group
news-2073 Tue, 20 Feb 2018 14:28:00 +0000 Paper accepted at AAAI: On Multi-Relational Link Prediction with Bilinear Models http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-aaai-on-multi-relational-link-prediction-with-bilinear-models/ The paper "On Multi-Relational Link Prediction with Bilinear Models" (pdf) by Y. Wang, R. Gemulla and H. Li has been accepted at the 2018 AAAI Conference on Artificial Intelligence (AAAI).

Abstract
We study bilinear embedding models for the task of multi-relational link prediction and knowledge graph completion. Bilinear models belong to the most basic models for this task, they are comparably efficient to train and use, and they can provide good prediction performance. The main goal of this paper is to explore the expressiveness of and the connections between various bilinear models proposed in the literature. In particular, a substantial number of models can be represented as bilinear models with certain additional constraints enforced on the embeddings. We explore whether or not these constraints lead to universal models, which can in principle represent every set of relations, and whether or not there are subsumption relationships between various models. We report results of an independent experimental study that evaluates recent bilinear models in a common experimental setup. Finally, we provide evidence that relation-level ensembles of multiple bilinear models can achieve state-of-the art prediction performance.

]]>
Publications Rainer
news-2060 Fri, 19 Jan 2018 13:07:59 +0000 Paper accepted for Digital Scholarship in the Humanities http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-digital-scholarship-in-the-humanities/ We have a paper accepted in Digital Scholarship in the Humanities, the premier journal in the field of Digital Humanities.

Federico Nanni, Laura Dietz and Simone Paolo Ponzetto. Toward a computational history of universities: Evaluating text mining methods for interdisciplinarity detection from PhD dissertation abstracts. To appear in Digital Scholarship in the Humanities. DOI: 10.1093/llc/fqx062 (available with a free-access article link here). 

The work presented in the paper is a collaboration between the DWS group and Prof. Laura Dietz at the University of New Hampshire.

Abstract

For the first time, historians of higher education have large data sets of primary sources that reflect the complete output of academic institutions at their disposal. To analyze this unprecedented abundance of digital materials, scholars have access to a large suite of computational methods developed in the field of Natural Language Processing. However, when the intention is to move beyond exploratory studies and use the results of such analyses as quantitative evidences, historians need to take into account the reliability of these techniques. The main goal of this article is to investigate the performance of different text mining methods for a specific task: the automatic identification of interdisciplinary works from a corpus of PhD dissertation abstracts. Based on the output of our study, we provide the research community of a new data set for analyzing recent changes in interdisciplinary practices in a large sample of European universities. We show the potential of this collection by tracking the growth in adoption of computational approaches across different research fields, during the past 30 years.

]]>
Research Simone Publications
news-2059 Fri, 19 Jan 2018 12:56:24 +0000 Paper accepted for Knowledge-Based Systems http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-knowledge-based-systems/ Together with our colleagues of the Natural Language Engineering (NLE) Lab of the University of Valencia we have a paper accepted for Knowledge-Based Systems journal (2016 Impact Factor: 4.529).

Goran Glavaš, Marc Franco-Salvador, Simone P. Ponzetto and Paolo Rosso. A resource-light method for cross-lingual semantic textual similarity. To appear in Knowledge-Based Systems. DOI: 10.1016/j.knosys.2017.11.041. A pre-print version is available here

Abstract

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.

]]>
Research Simone Publications
news-2058 Fri, 19 Jan 2018 12:44:46 +0000 Paper accepted for the Journal of Natural Language Engineering http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-the-journal-of-natural-language-engineering/ We have a new journal paper in the Natural Language Engineering journal summarizing the findings of the first part of our DFG JOIN-T (Joining Ontologies and semantics INduced from Text) project with the colleagues of the Language Technology Group of the University of Hamburg

Chris Biemann, Stefano Faralli, Alexander Panchenko and Simone Paolo Ponzetto: A framework for enriching lexical semantic resources with distributional semantics. To appear in the Journal of Natural Language Engineering. DOI: 10.1017/S135132491700047X. A pre-print version is available here

You can find the project homepage here.

Abstract

We present an approach to combining distributional semantic representations induced from text corpora with manually constructed lexical semantic networks. While both kinds of semantic resources are available with high lexical coverage, our aligned resource combines the domain specificity and availability of contextual information from distributional models with the conciseness and high quality of manually crafted lexical networks. We start with a distributional representation of induced senses of vocabulary terms, which are accompanied with rich context information given by related lexical items. We then automatically disambiguate such representations to obtain a full-fledged proto-conceptualization, i.e. a typed graph of induced word senses. In a final step, this proto-conceptualization is aligned to a lexical ontology, resulting in a hybrid aligned resource. Moreover, unmapped induced senses are associated with a semantic type in order to connect them to the core resource. Manual evaluations against ground-truth judgments for different stages of our method as well as an extrinsic evaluation on a knowledge-based Word Sense Disambiguation benchmark all indicate the high quality of the new hybrid resource. Additionally, we show the benefits of enriching top-down lexical knowledge resources with bottom-up distributional information from text for addressing high-end knowledge acquisition tasks such as cleaning hypernym graphs and learning taxonomies from scratch.

]]>
Simone Research Publications
news-2055 Mon, 15 Jan 2018 15:48:14 +0000 Petar Ristoski has defended his PhD thesis http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/petar-ristoski-has-defended-his-phd-thesis/ Petar Ristoski has successfully defended his PhD thesis on "Exploiting Web Knowledge Graphs in Data Mining" today. Among many other contributions, his thesis proposes the RDF2Vec method for generating vector space embeddings of RDF graphs.

Abstract

Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. Semantic Web knowledge graphs are a backbone of many information systems that require access to structured knowledge. Such knowledge graphs contain factual knowledge about real word entities and the relations between them, which can be utilized in various natural language processing, information retrieval, and any data mining applications. Following the principles of the Semantic Web, Semantic Web knowledge graphs are publicly available as Linked Open Data. Linked Open Data is an open, interlinked collection of datasets in machine-interpretable form, covering most of the real world domains.

In this thesis, we investigate the hypothesis if SemanticWeb knowledge graphs can be exploited as background knowledge in different steps of the knowledge discovery process, and different data mining tasks. More precisely, we aim to show that Semantic Web knowledge graphs can be utilized for generating valuable data mining features that can be used in various data mining tasks.

Identifying, collecting and integrating useful background knowledge for a given data mining application can be a tedious and time consuming task. Furthermore, most data mining tools require features in propositional form, i.e., binary, nominal or numerical features associated with an instance, while Linked Open Data sources are usually graphs by nature. Therefore, in Part I, we evaluate unsupervised feature generation strategies from types and relations in knowledge graphs, which are used in different data mining tasks, i.e., classification, regression, and outlier detection. As the number of generated features grows rapidly with the number of instances in the dataset, we provide a strategy for feature selection in hierarchical feature space, in order to select only the most informative and most representative features for a given dataset. Furthermore, we provide an end-to-end tool for mining theWeb of Linked Data, which provides functionalities for each step of the knowledge discovery process, i.e., linking local data to a SemanticWeb knowledge graph, integrating features from multiple knowledge graphs, feature generation and selection, and building machine learning models. However, we show that such feature generation strategies often lead to high dimensional feature vectors even after dimensionality reduction, and also, the reusability of such feature vectors across different datasets is limited.

In Part II, we propose an approach that circumvents the shortcomings introduced with the approaches in Part I. More precisely, we develop an approach that is able to embed complete Semantic Web knowledge graphs in a low dimensional feature space, where each entity and relation in the knowledge graph is represented as a numerical vector. Projecting such latent representations of entities into a lower dimensional feature space shows that semantically similar entities appear closer to each other. We use several Semantic Web knowledge graphs to show that such latent representation of entities have high relevance for different data mining tasks. Furthermore, we show that such features can be easily reused for different datasets and different tasks.

In Part III, we describe a list of applications that exploit Semantic Web knowledge graphs, besides the standard data mining tasks, like classification and regression. We show that the approaches developed in Part I and Part II can be used in applications in various domains. More precisely, we show that Semantic Web graphs can be exploited for analyzing statistics, building recommender systems, entity and document modeling, and taxonomy induction.

]]>
Group Research