RSS-Feed en-gb TYPO3 News Mon, 23 Jul 2018 00:20:30 +0000 Mon, 23 Jul 2018 00:20:30 +0000 TYPO3 EXT:news news-2131 Wed, 11 Jul 2018 16:36:52 +0000 Paper accepted at ISWC 2018: Fine-grained Evaluation of Rule- and Embedding-based Systems for Knowledge Graph Completion https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-iswc-2018-fine-grained-evaluation-of-rule-and-embedding-based-systems-for-knowle/ The paper "Fine-grained Evaluation of Rule- and Embedding-based Systems for Knowledge Graph Completion" by Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla, and Heiner Stuckenschmidt has been accepted at the 2018 International Semantic Web Conference (ISWC).

Over the recent years, embedding methods have attracted increasing focus as a means for knowledge graph completion. Similarly, rule-based systems have been studied for this task in the past. What is missing so far is a common evaluation that includes more than one type of method. We close this gap by comparing representatives of both types of systems in a frequently used evaluation protocol. Leveraging the explanatory qualities of rule-based systems, we present a fine-grained evaluation that gives insight into characteristics of the most popular datasets and points out the different strengths and shortcomings of the examined approaches. Our results show that models such as TransE, RESCAL or HolE have problems in solving certain types of completion tasks that can be solved by a rule-based approach with high precision. At the same time, there are other completion tasks that are difficult for rule-based systems. Motivated by these insights, we combine both families of approaches via ensemble learning. The results support our assumption that the two methods complement each other in a beneficial way.

Publications Rainer
news-2123 Fri, 22 Jun 2018 09:35:32 +0000 JCDL 2018 - Vannevar Bush Best Paper Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/jcdl-2018-vannevar-bush-best-paper-award/ Our paper "Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context" has recently won the Vannevar Bush best paper award at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries!

The work, coauthored by Federico Nanni, Simone Paolo Ponzetto and Laura Dietz, is part of a collaboration between the DWS group and the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

Congratulations also to Myriam Traub, Thaer Samar, Jacco van Ossenbruggen and Lynda Hardman, who, with their work, share with us the 2018 best paper award!

Simone Research - Data Analytics Publications
news-2105 Fri, 27 Apr 2018 09:58:42 +0000 Data Science Conference LWDA 2018 in Mannheim https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/data-science-conference-lwda-2018-in-mannheim-1/ The Data and Web Science Group is hosting the Data Science Conference LWDA 2018 in Mannheim on August 22-24, 2018.

LWDA, which expands to „Lernen, Wissen, Daten, Analysen“ („Learning, Knowledge, Data, Analytics“), covers recent research in areas such as knowledge discovery, machine learning & data mining, knowledge management, database management & information systems, information retrieval. 

The LWDA conference is organized by and brings together the various special interest groups of the Gesellschaft für Informatik (German Computer Science Society) in this area. The program comprises of joint research sessions and keynotes as well as of workshops organized by each special interest group.

Further information can be found on the conference website:

Download the conference poster.

Other Topics - Künstliche Intelligenz I Topics - Data Mining Topics - Decision Support Topics - Web Search and IR Chris Heiner Rainer Simone
news-2073 Tue, 20 Feb 2018 14:28:00 +0000 Paper accepted at AAAI: On Multi-Relational Link Prediction with Bilinear Models https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-aaai-on-multi-relational-link-prediction-with-bilinear-models/ The paper "On Multi-Relational Link Prediction with Bilinear Models" (pdf) by Y. Wang, R. Gemulla and H. Li has been accepted at the 2018 AAAI Conference on Artificial Intelligence (AAAI).

We study bilinear embedding models for the task of multi-relational link prediction and knowledge graph completion. Bilinear models belong to the most basic models for this task, they are comparably efficient to train and use, and they can provide good prediction performance. The main goal of this paper is to explore the expressiveness of and the connections between various bilinear models proposed in the literature. In particular, a substantial number of models can be represented as bilinear models with certain additional constraints enforced on the embeddings. We explore whether or not these constraints lead to universal models, which can in principle represent every set of relations, and whether or not there are subsumption relationships between various models. We report results of an independent experimental study that evaluates recent bilinear models in a common experimental setup. Finally, we provide evidence that relation-level ensembles of multiple bilinear models can achieve state-of-the art prediction performance.

Publications Rainer
news-1707 Thu, 14 Sep 2017 13:28:00 +0000 Paper accepted at EMNLP: MinIE: Minimizing Facts in Open Information Extraction https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-emnlp-minie-minimizing-facts-in-open-information-extraction/ The paper "MinIE: Minimizing Facts in Open Information Extraction (pdf) by K. Gashteovski, L. del Corro, and R. Gemulla has been accepted for the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP).

The goal of Open Information Extraction (OIE) is to extract surface relations and their arguments from natural-language text in an unsupervised, domain-independent manner. In this paper, we propose MinIE, an OIE system that aims to provide useful, compact extractions with high precision and recall. MinIE approaches these goals by (1) representing information about polarity, modality, attribution, and quantities with semantic annotations instead of in the actual extraction, and (2) identifying and removing parts that are considered overly specific. We conducted an experimental study with several real-world datasets and found that MinIE achieves competitive or higher precision and recall than most prior systems, while at the same time producing shorter, semantically enriched extractions.

Publications Rainer
news-1981 Thu, 07 Sep 2017 09:48:14 +0000 Master thesis: Address management and geocoding (Gemulla, DHL) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-address-management-and-geocoding-gemulla-dhl/ eCommerce is on the rise. Logistics companies like Deutsche Post DHL are expanding and building up new logistics networks in emerging markets around the globe. Delivering parcels to end-consumers quickly, reliably and efficiently with an outstanding service requires much planning effort. This is especially complex in emerging countries where the infrastructure poses additional challenges. One essential part of making the delivery more efficient is route optimization of the driven tours on the “last mile”. It is done by using optimization algorithms connecting the delivery locations considering all kinds of restrictions (capacity, working hours of courier, traffic etc.).

In emerging countries, the delivery location cannot be easily deduced from the address provided by the customer. Addresses can follow different local address logics or can take the form of something like "Slip road from Megenagna to Imperial Hotel, In front of Anbessa Garage, P.O. Box 184 Code 1110, Addis Ababa, Ethiopia". In order to use the address for geocoding, it has to be broken down into a structured format, analyzed, compared with other existing addresses in databases, possibly updated and then translated into a geocode first. There are many ways and possible methods to achieve this. In the course of this master thesis the student is asked to give a scientific overview of current methods and algorithms in that context and come up with a suitable solution incorporating latest developments in Machine Learning.

The student is expected to have excellent analytical skills, knowledge in how to create and describe an algorithm and some previous Machine Learning experience. Knowledge in an object-oriented programming language is a plus.

The master thesis will be written in cooperation with Deutsche Post DHL. The student will work closely together with project teams in Singapore, Thailand, Malaysia, Vietnam and the global headquarters in Bonn. To apply for this master thesis, please send your CV to Gunnar Buchhold <gunnar.buchhold(at)> and briefly state your motivation and why you deem yourself suitable to cover this topic.

DHL eCommerce is the e-commerce logistics specialist of Deutsche Post DHL Group, the biggest logistics company worldwide. We offer choice, convenience, control and quality for both the merchant and the consumer. Our global team of e-commerce experts is dedicated to providing innovative solutions that create a great online shopping experience.

ma master data science theses Thesis Thesis - Bachelor Rainer
news-1939 Fri, 14 Jul 2017 07:55:09 +0000 DSGDpp library for parallel matrix factorization available as open source https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dsgdpp-library-for-parallel-matrix-factorization-available-as-open-source/ We have open-sourced the DSGDpp library, which contains implementations of various parallel algorithms for computing low-rank matrix factorizations. Both shared-memory and shared-nothing (via MPI) implementations are provided.

This library (finally!) makes our implementations of the DSGD++ algorithm of Teflioudi et al. (2012) and the CSGD algorithm of Makari et al. (2013) publicly available.

More information can be found at the GitHub page:



Research - Data Analytics Research Topics - Data Mining Rainer
news-1934 Thu, 06 Jul 2017 19:55:31 +0000 Four papers accepted at EMNLP 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/four-papers-accepted-at-emnlp-2017/ We have a few papers accepted at the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), one of the top-tier conferences in the field of Natural Language Processing:

  • Kiril Gashteovski, Rainer Gemulla, Luciano del Corro: MinIE: Minimizing Facts in Open Information Extraction
  • Goran Glavaš and Simone Paolo Ponzetto: Dual Tensor Model for Detecting Asymmetric Lexico-Semantic Relations
  • Stefano Menini, Federico Nanni, Simone Paolo Ponzetto and Sara Tonelli: Topic-Based Agreement and Disagreement in US Electoral Manifestos
  • Alexander Panchenko, Fide Marten, Eugen Ruppert, Stefano Faralli, Dmitry Ustalov, Simone Paolo Ponzetto and Chris Biemann: Unsupervised & Knowledge-Free & Interpretable Word Sense Disambiguation
Rainer Simone Research Publications
news-1816 Fri, 17 Feb 2017 11:38:00 +0000 Master thesis: Text Mining for Cyber Threat Analysis (Gemulla, Schönhofer GmbH) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-text-mining-for-cyber-threat-analysis-gemulla-schoenhofer-gmbh/ Since mid-September 2015, the threat from ransomware has grown considerably [1]. Against this background, comprehensive geographical and temporal mapping of cyber attacks and early detection of such attacks have become particularly important. Attacks on an organisation's own IT-infrastructure are typically analysed and defended against at network level. Outside an organisation's own infrastructure, other sources, e.g., news portals and social media, usually have to be used. Given the large volume and variety of this unstructured data as well as the speed with which it is generated, automated analytical procedures from the fields of text mining and machine learning to handle it are not only particularly promising but also the only practical approach. 


A review and evaluation of sources dealing with cyber-threat analysis, e.g.,

  • News websites, news portals and social media such as Facebook & Twitter
  • Pre-evaluation / Prediction / forecasting websites such as Google Trends or Europe Media Monitor
  • Reports from Computer Emergency Response Teams (CERTs)
  • Reports from anti-virus software companies. e.g., Kaspersky

The sources may contain previously evaluated and summarised results. The sources should be analysed and metadata extracted. The following directions are of particular interest:

  • Reports on new threats
  • Differentiation between duplicated confirmation and new reports
  • Sentiment analysis, classification of phishing, hoaxes & fake news
  • Regional / geographical and temporal distribution
  • Significant parties (parties issuing threats as well as those analysing / defending against threats)

Moreover, on the basis of configurable taxonomies the texts should be subjected to an entity analysis and, if possible, to relations analysis.
A data corpus, which has been created on the basis of relevant RSS feeds, is available to test the procedure and can be expanded during the work. In addition, the possibilities of adding further metadata while importing data should be investigated, e.g., designation of source / publisher, evaluation of source (reliability, trustworthiness etc.), which can then be considered when extracting the metadata.


Detailed knowledge of text analysis / text mining as well as programming skills in Java/Scala, Python or a comparable programming language is required. Knowledge of virtualisation and databases is an advantage. In-depth knowledge of cyber security is not required.


The Master thesis is supervised by the Chair for Data Analytics (Prof. Gemulla) as well as by the Schönhofer Sales & Engineering GmbH.

Schönhofer Sales & Engineering GmbH is an innovative systems and software company. The company, which is located in Siegburg, realises complex projects and products for complex event prediction, big-data analytics and metadata processing for public sector clients, banks, insurance companies and corporates.

If you are interested in this thesis topic, please contact

Holger Krispin
Schönhofer S&E GmbH, IT-Systems area
Tel. +49 (0)2241 3099 37


[1] Ransomware: Bedrohungslage, Prävention & Reaktion. BSI-Report. March 2016

Thesis Rainer ma master data science theses Thesis - Master partner companies
news-1786 Tue, 17 Jan 2017 14:47:30 +0000 44.2 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/442-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2016 version of the CommonCrawl covering 3.2 billion HTML pages which originate from 34 million websites (pay-level domains).

Altogether we discovered structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.6 million different pay-level domains out of the 34 billion pay-level domains covered by the crawl (16.5%).

Approximately 2.5 million of these websites use Microdata, 2.1 million websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats are used by over 1.6 million websites within the crawl.



More and more websites annotate structured data within their HTML pages using markup formats such as RDFa, Microdata, embedded JSON-LD and Microformats. The annotations  cover topics such as products, reviews, people, organizations, places, events, and cooking  recipes.

The WebDataCommons project extracts all Microdata, RDFa data, and Microformat data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. 

Besides the markup data, the WebDataCommons project also provides large web table corpora and web graphs for download. General information about the WebDataCommons project is found at 

Data Set Statistics: 

Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa  
and Microformat data sets as well as the vocabularies that are used together with each 
markup format are found at:

Comparing the statistics to the statistics about the November 2015 release of the data sets

we see that the Microdata syntax remains the most dominant annotation format. Although it is hard to compare the adoption of the syntax between the two years in absolute numbers, as the October 2016 crawl corpus is almost double the size of the November 2015 one, a relative increase can be observed: In the October 2016 corpus over 44% of the pay-level domains containing markup data make use of the Microdata syntax in comparison to 40% one year earlier. Even though the absolute numbers concerning the RDFa markup syntax adoption rise, the relative increase does not follow up the increase of the corpus size indicating that RDFa is less used by the websites. Similar to the 2015 release, the adoption of embedded JSON-LD has considerably increased, even though the main focus of the annotation remains the search action offered by the websites (70%).

As already observed in the previous years, the vocabulary is most frequently used in the context of Microdata while the adoption of its predecessor, the data vocabulary, continues to decrease. In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress). More concretely, the October 2016 corpus includes more than 682 million product records originating from 249 thousand websites which use the vocabulary. The new release contains postal address data for more than 291 million entities originating from 338 thousand websites. Furthermore, the content describing hotels has doubled in size in this release, with a total of 61 million hotel descriptions.

Visualizations of the main adoption trends concerning the different annotation formats, popular, as well as RDFa classes within the time span 2012 to 2016 are found at



The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,661 files with a total size of 987 GB.

In addition, we have created for over 40 different classes separate files, including all quads from pages, deploying at least once the specific class.


Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which supported by means of the ViCe project the extraction and analysis of the October 2016 corpus.

Have fun with the new data set. 

Anna Primpeli, Robert Meusel and Chris Bizer

Research - Data Mining and Web Mining Research - Data Analytics Topics - Data Mining Topics - Linked Data Projects Chris