RSS-Feed en-gb TYPO3 News Wed, 19 Dec 2018 14:49:41 +0000 Wed, 19 Dec 2018 14:49:41 +0000 TYPO3 EXT:news news-2175 Wed, 28 Nov 2018 08:34:00 +0000 Paper accepted at EDBT 2019 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-edbt-2019/ Our systems and applications paper

Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data (Yaser Oulabi, Christian Bizer)

got accepted at the 22nd International Conference on Extending Database Technology (EDBT 2019), one of the top-tier conferences in the data management field!

Abstract of the paper:

 Cross-domain knowledge bases such as YAGO, DBpedia, or the Google Knowledge Graph are being used as background knowledge within an increasing range of applications including web search, data integration, natural language understanding, and question answering. The usefulness of a knowledge base for these applications depends on its completeness. Relational HTML tables that are published on the Web cover a wide range of topics and describe very specific long tail entities, such as small villages, less-known football players, or obscure songs. This systems and applications paper explores the potential of web table data for the task of completing cross-domain knowledge bases with descriptions of formerly unknown entities. We present the first system that handles all steps that are necessary to complete this task: schema matching, row clustering, entity creation, and new detection. The evaluation of the system using a manually labeled gold standard shows that it can construct formerly unknown instances from table data with an F1 score of 0.82. In a second experiment, we apply the system to a large corpus of web tables that has been extracted from the Common Crawl. This experiment allows us to get an overall impression of the potential of web tables for augmenting knowledge bases with long tail entities. The experiment shows that we can augment the DBpedia knowledge base with descriptions of 15 thousand new football players as well as 173 thousand new songs. The accuracy of the facts describing these instances is 0.73.




Publications Research - Web-based Systems Chris
news-2166 Thu, 04 Oct 2018 08:52:00 +0000 WInte.r Web Data Integration Framework Version 1.3 released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/winter-web-data-integration-framework-version-13-released/ We are happy to announce the release of Version 1.3 of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements a wide variety of different methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions.

The following features have been added to the framework for the new release:

  • Value Normalization: New ValueNormaliser class for normalizing quantifiers and units of measurement. New DataSetNormaliser class for detecting data types and transform complete datasets into a normalised base format.
  • External Rule Learning: In addition to learning matching rules directly inside of WInte.r, the new release also supports learning matching rules using external tools such as Rapidminer and importing the learned rules back into WInte.r.
  • Debug Reporting: The new release features detailed reports about the application of matching rules, blockers, and data fusion methods which lay the foundation for fine-tuning the methods.
  • Step-by-Step Tutorial: In order to get users started with the framework, we have written a step-by-step tutorial on how to use WInte.r for identity resolution and data fusion and how to debug and fine-tune the different steps of the integration process.

The WInte.r famework forms a foundation for our research on large-scale web data integration. The framework is used by the T2K Match algorithm for matching millions of Web tables against a central knowledge base, as well as within our work on Web table stitching for improving matching quality. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve case studies and implement their term projects.  

Detailed information about the WInte.r framework is found at

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

Lots of thanks to Alexander Brinkmann and Oliver Lehmberg for their work on the new release as well as on the tutorial and extended documentation in the WInte.r wiki.

Research - Web-based Systems Chris Projects
news-2105 Fri, 10 Aug 2018 09:58:00 +0000 Data Science Conference LWDA 2018 in Mannheim https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/data-science-conference-lwda-2018-in-mannheim-1/ The Data and Web Science Group is hosting the Data Science Conference LWDA 2018 in Mannheim on August 22-24, 2018.

LWDA, which expands to „Lernen, Wissen, Daten, Analysen“ („Learning, Knowledge, Data, Analytics“), covers recent research in areas such as knowledge discovery, machine learning & data mining, knowledge management, database management & information systems, information retrieval. 

The LWDA conference is organized by and brings together the various special interest groups of the Gesellschaft für Informatik (German Computer Science Society) in this area. The program comprises of joint research sessions and keynotes as well as of workshops organized by each special interest group.

Further information can be found on the conference website:

Download the conference poster.

Other Topics - Künstliche Intelligenz I Topics - Data Mining Topics - Decision Support Topics - Web Search and IR Chris Heiner Rainer Simone
news-2084 Mon, 12 Mar 2018 11:57:47 +0000 Third Cohort of Students starts Part-time Master in Data Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/third-cohort-of-students-starts-part-time-master-in-data-science/ The third cohort consisting of 32 students has started their studies in the part-time master program in Data Science that professors of the DWS group offer together with the Hochschule Albstadt-Sigmaringen.

This weekend the students of the third cohort of the master program as well as students participating in the certificate program Data Science were in Mannheim for a data mining project weekend.

The students worked in teams on two case studies, one in the area of online marketing, the other in the area of text mining. The teams were coached by Prof. Christian Bizer, Dr. Robert Meusel, and Alexander Diete and we were very happy to see an exciting competition between the teams for the best F1 scores as well as the highest raises in sales.

Additional Information:


Projects Chris
news-2049 Thu, 11 Jan 2018 09:42:41 +0000 38.7 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/387-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2017 version of the Common Crawl covering 3.2 billion HTML pages which originate from 26 million websites (pay-level domains).

In summary, we found structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38.9%). These pages originate from 7.4 million different pay-level domains out of the 26 million pay-level-domains covered by the crawl (28.4%). Approximately 3.7 million of these websites use Microdata, 2.6 million websites use JSON-LD, and 1.2 million websites make use of RDFa. Microformats are used by more than 3.3 million websites within the crawl.


More and more websites annotate data describing for instance products, people, organizations, places, events, reviews, and cooking recipes within their HTML pages using markup formats such as Microdata, embedded JSON-LD, RDFa and Microformat. The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and Microformat data from the Common Crawl web corpus, the largest web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. We run yearly extractions since 2012 and we provide the dataset series as well as the related statistics at:

Statistics about the November 2017 Release:

Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

Markup Format Adoption

The page below provides an overview of the increase in the adoption of the different markup formats as well as widely used classes from 2012 to 2017:

Comparing the statistics from the new 2017 release to the statistics about the October 2016 release of the data sets (, we see that the adoption of structured data keeps on increasing while Microdata remains the most dominant markup syntax. The different nature of the crawling strategy that was used makes it hard to compare absolute as well as certain relative numbers between the two releases. More concretely, we observe that the November 2017 Common Crawl corpus is much deeper for certain domains like and while other domains are covered in a shallower way, with fewer URLs crawled in comparison to the October 2016 Common Crawl corpus. Nevertheless, it is clear that the growth rate of Microdata and Microformats is much higher than the one of RDFa and embedded JSON-LD. Although, the latter format is widely spread, it is mainly used to annotate metadata for search actions (80% of the domains using JSON-LD) while only a few domains use it for annotating content information such as Organizations (25% of the domains using JSON-LD), Persons (4% of the domains using JSON-LD) or Offers (0.1% of the domains using JSON-LD).

Vocabulary Adoption

Concerning the vocabulary adoption,, the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant in the context of Microdata with 78% of the webmasters using it in comparison to its predecessor, the data-vocabulary, which is only used by 14% of the websites containing Microdata. In the context of RDFa, the Open Graph Protocol recommended by Facebook remains the most widely used vocabulary.

Parallel Usage of Multiple Formats

Analyzing topic-specific subsets, we discover some interesting trends. As observed in the previous extractions, content related information is mostly described either with the Microdata format or less frequently with the JSON-LD format, in both cases using the vocabulary. However, we find out that 30% of the websites that use JSON-LD annotations to describe product related information, make use of Microdata as well as JSON-LD to cover the same topic. This is not the case for other topics, such as Hotels or Job Postings, for which webmasters use only one format to annotate their content.

Richer Descriptions of Job Postings

Following the release of the “Google for Jobs” search vertical and the more detailed guidance by Google on how to annotate job postings (, we see an increase in the number of websites annotating job postings (2017: 7,023, 2016: 6,352). In addition, the job posting annotations tend to become richer in comparison to the previous years as the number of Job Posting related properties adopted by at least 30% of the websites containing job offers has increased from 4 (2016) to 7 (2017). The newly adopted properties are JobPosting/url, JobPosting/datePosted, and JobPosting/employmentType. You can find a more extended analysis concerning specific topics, like Job Posting and Product, here


The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 38.7 billion RDF quads. For download, we split the data into 8,433 files with a total size of 858 GB.

In addition, we have created for over 40 different classes separate files, including all quads extracted from pages, using a specific class.

Lots of thanks to:

  • the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
  • the Any23 project for providing their great library of structured data parsers.
  • Amazon Web Services in Education Grant for supporting WebDataCommons.
  • the Ministry of Economy, Research and Arts of Baden – Württemberg which supported through the ViCE project the extraction and analysis of the November 2017 corpus.

General Information about the WebDataCommons Project:

The WebDataCommons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of WebTables, a corpus of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

Have fun with the new data set!


Anna Primpeli, Robert Meusel, and Christian Bizer



Research - Web-based Systems Topics - Linked Data Projects Chris
news-2021 Mon, 06 Nov 2017 14:11:00 +0000 Dominique Ritze defended her PhD Thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dominique-ritze-defended-her-phd-thesis/ On November 6th, Dominique Ritze successfully defended her PhD thesis Web-Scale Web Table to Knowledge Base Matching. Supervisor was Prof. Christian Bizer, second reader was Prof. Kai Eckert from Hochschule der Medien Stattgart. 

Abstract of the thesis:

Millions of relational HTML tables are found on the World Wide Web. In contrast to unstructured text, relational web tables provide a compact representation of entities described by attributes. The data within these tables covers a broad topical range. Web table data is used for question answering, augmentation of search results, and knowledge base completion. Until a few years ago, only search engines companies like Google and Microsoft owned large web crawls from which web tables are extracted. Thus, researches outside the companies have not been able to work with web tables.

In this thesis, the first publicly available web table corpus containing millions of web tables is introduced. The corpus enables interested researchers to experiment with web tables. A profile of the corpus is created to give insights to the characteristics and topics. Further, the potential of web tables for augmenting cross-domain knowledge bases is investigated. For the use case of knowledge base augmentation, it is necessary to understand the web table content. For this reason, web tables are matched to a knowledge base. The matching comprises three matching tasks: instance, property, and class matching. Existing web table to knowledge base matching systems either focus on a subset of these matching tasks or are evaluated using gold standards which also only cover a subset of the challenges that arise when matching web tables to knowledge bases.

This thesis systematically evaluates the utility of a wide range of different features for the web table to knowledge base matching task using a single gold standard. The results of the evaluation are used afterwards to design a holistic matching method which covers all matching tasks and outperforms state-of-the-art web table to knowledge base matching systems. In order to achieve these goals, we first propose the T2K Match algorithm which addresses all three matching tasks in an integrated fashion. In addition, we introduce the T2D gold standard which covers a wide variety of challenges. By evaluating T2K Match against the T2D gold standard, we identify that only considering the table content is insufficient. Hence, we include features of three categories: features found in the table, in the table context like the page title, and features that base on external resources like a synonym dictionary.

We analyze the utility of the features for each matching task. The analysis shows that certain problems cannot be overcome by matching each table in isolation to the knowledge base. In addition, relying on the features is not enough for the property matching task. Based on these findings, we extend T2K Match into T2K Match++ which exploits indirect matches to web tables about the same topic and uses knowledge derived from the knowledge base. We show that T2K Match++ outperforms all state-of-the-art web table to knowledge base matching approaches on the T2D and Limaye gold standard. Most systems show good results on one matching task but T2K Match++ is the only system that achieves F-measure scores above 0:8 for all tasks. Compared to results of the best performing system TableMiner+, the F-measure for the difficult property matching task is increased by 0:08, for the class and instance matching task by 0:05 and 0:03, respectively.

Bibliographic meta-information and download of the thesis.



Chris Research Research - Web-based Systems Publications
news-2014 Fri, 27 Oct 2017 08:41:55 +0000 SWSA Ten-Year Award won by DBpedia Paper https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/swsa-ten-year-award-won-by-dbpedia-paper/ We are happy to announce that Professor Christian Bizer has received the SWSA Ten-Year Award at the 16th International Semantic Web Conference (ISWC2017) in Vienna for the paper "DBpedia: A Nucleus for a Web of Open Data” that he co-authored in 2007.

The SWSA Ten-Year Award recognizes the highest impact papers from the ISWC proceedings ten years prior (i.e., in 2017 the award honors a paper from 2007). The decision is based primarily, but not exclusively, on the number of citations to the papers from the proceedings in the intervening decade.

DBpedia is a large-scale cross-domain knowledge base which we extract from Wikipedia and make available on the Web under an open license. DBpedia allows users to ask sophisticated queries against Wikipedia knowledge and serves as an interlinking hub in the Web of Linked Data. In addition, DBpedia is widely used as background knowledge for applications such as search, natural language understanding, and data integration.

According to Google Scholar, the paper "DBpedia: A Nucleus for a Web of Open Data” has been cited 2770 times as of October 2017.


Research - Web-based Systems Projects Chris
news-1937 Tue, 11 Jul 2017 08:10:36 +0000 Paper accepted at VLDB 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-vldb-2017/ We have a paper accepted at the 43th International Conference on Very Large Data Bases (VLDB 2017), a premier conference in the field of databases and data management. The conference takes place in Munich at the end of August 2017.

Oliver Lehmberg, Christian Bizer 

Stitching Web Tables for Improving Matching Quality

HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables.
Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.

Research - Web-based Systems Publications Chris
news-1928 Fri, 23 Jun 2017 08:52:12 +0000 Web Data Integration Framework (WInte.r) released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/web-data-integration-framework-winter-released/ We are happy to announce the release of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions. In addition, these pre-defined building blocks can be used as foundation for implementing advanced integration methods.

The WInte.r famework forms the foundation for our research on large-scale web data integration. The framework contains an implementation of the T2K Match algorithm for matching millions of Web tables against a central knowledge base. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve the course case study. In addition, most students use the framework as foundation for their term projects.  

Detailed information about the WInte.r framework is found at

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

Research - Web-based Systems Chris Projects
news-1837 Mon, 13 Mar 2017 10:36:46 +0000 Robert Meusel defended his PhD Thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/robert-meusel-defended-his-phd-thesis/ On March 10th, Robert Meusel successfully defended his PhD thesis Web-Scale Profiling of Semantic Annotations in HTML Pages. Supervisor was Prof. Christian Bizer, second reader was Prof. Wolfgang Nejdl from Leibniz Universität Hannover. 

Abstract of the thesis:

The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by profiling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the first step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates first the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-specific adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-specific issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated.


Dataspace Profiling , RDFa, Microformats , Microdata , , Crawling


The full-text of the thesis is available from the MADOC document server. 



Chris Research