RSS-Feed en-gb TYPO3 News Sat, 19 Jan 2019 16:50:15 +0000 Sat, 19 Jan 2019 16:50:15 +0000 TYPO3 EXT:news news-2186 Thu, 17 Jan 2019 08:28:16 +0000 31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/315-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-originating-from-96-mill/ The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).

In summary, we found structured data within 900 million HTML pages out of the 2.5 billion pages contained in the crawl (37.1%). These pages originate from 9.6 million different pay-level domains out of the 32.8 million pay-level-domains covered by the crawl (29.3%). Approximately 5.1 million of these websites use Microdata, 3.8 million websites use JSON-LD, and 1.3 million websites make use of RDFa. Microformats are used by more than 3.3 million websites within the crawl.


More and more websites annotate data describing for instance products, people, organizations, places, events, reviews, and cooking  recipes within their HTML pages using markup formats such as Microdata, embedded JSON-LD, RDFa and Microformat. 

The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and Microformat data from the Common Crawl web corpus, the largest web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. We run yearly extractions since 2012 and we provide the dataset series as well as the related statistics at:

Statistics about the November 2018 Release

Basic statistics about the November 2018 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

Markup Format Adoption

The page below provides an overview of the increase in the adoption of the different markup formats as well as widely used classes from 2012 to 2018:

Comparing the statistics from the new 2018 release to the statistics about the November 2017 release of the data sets

we see that the adoption of structured data keeps on increasing while Microdata remains the most dominant markup syntax. Differences in the crawling strategies that were used for the two crawls make it difficult to directly compare absolute as well as certain relative numbers. More concretely, we observe that the November 2018 Common Crawl corpus is shallower but wider, as fewer URLs from more PLDs are crawled compared to the November 2017 Common Crawl corpus. Nevertheless, it is clear that the growth rates of Microdata and embedded JSON-LD are much higher than the one of RDFa. Comparing the number of PLDs per markup format for certain classes, we observe that there is a tendency to use specific annotation formats for some domains in comparison to others. For example, for annotating data about organizations and persons, JSON-LD format is more widely used whereas the Microdata format is preferred for annotating product and event data.

Vocabulary Adoption

Concerning the vocabulary adoption,, the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant in the context of Microdata with 75% of the webmasters using it in comparison to its predecessor, the data-vocabulary, which is only used by 13% of the websites containing Microdata. In the context of RDFa, the Open Graph Protocol recommended by Facebook remains the most widely used vocabulary. The file below analyzes the adoption of terms that have been newly introduced in the last two years. The file also provides statistics on how many websites use specific classes together with the JSON-LD and Microdata syntax.


The overall size of the November 2018 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 31.5 billion RDF quads. For download, we split the data into 7,263 files with a total size of 728 GB.

In addition, we have created for over 40 different classes separate files, including all quads extracted from pages, using a specific class.

Lots of thanks to 

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 

Training Dataset and Gold Standard for Large-Scale Product Matching

As a side note on what else is happening in the Web Data Commons project around data: Using the November 2017 Product data corpus, we created a training dataset and gold standard for large-scale product matching. The training dataset consists of more than 26 million product offers originating from 79 thousand websites that use annotations. Using annotated identifiers such as MPN and GTINs, we grouped the offers into 16 million clusters with each cluster referring to the same real-world product. The gold standard consists of 2000 pairs of offers which were manually verified as matches or non-matches. We provide the training dataset and gold standard for public download thus hoping to contribute to improving the evaluation and comparison of different entity matching algorithms.

General Information about the WebDataCommons Project

The WebDataCommons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

Have fun with the new data set. 


Anna Primpeli, Robert Meusel and Chris Bizer

Other Research - Web-based Systems Projects Chris
news-2178 Thu, 20 Dec 2018 08:29:45 +0000 WDC Training Dataset and Gold Standard for Large-Scale Product Matching released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/wdc-training-dataset-and-gold-standard-for-large-scale-product-matching-released/ The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.

With publishing the WDC Training Dataset and Gold Standard for Large-scale Product Matching, we hope to contribute to solving this problem. The training dataset consists of 26 million product offers (16 million English language offers) originating from 79 thousand different e-shops. For grouping the offers into clusters describing the same product, we rely on product identifers such as GTINs or MPNs that are annotated with markup in the HTML pages of the e-shops. Using these identifiers and a specific cleansing workflow, we group the offers into 16 million clusters. Only considering clusters of English offers having a size larger than five and excluding clusters of sizes bigger than 80 offers which may introduce noise, 20.7 million positive training examples (pairs of matching product offers) and a maximum of 2.6 trillion negative training examples can be derived from the dataset. The training dataset is thus several orders of magnitude larger than the largest training set for product matching that has been accessible to the public so far.

In addition to the training dataset, we have also build a gold standard for evaluating matching methods by manually verifying that 2000 pairs of offers refer or do not refer to the same products. The gold standard covers the product categories computers, shoes, watches, and cameras. Using both artefacts to publicly verify the results that Mudgal et al. (SIGMOD 2018) recently achieved using private training data, we find that embeddings and deep learning based methods outperform traditional symbolic matching methods (SVMs and random forests) by 6% to 10% in F1 on our gold standard.

We think that the creation of the WDC Training Dataset nicely demonstrates the utility of the Semantic Web. Without the website owners putting semantic annotations into their HTML pages it would have been much harder, if not impossible, to extract product offers from from 79 thousand e-shops and we would likely not have dared to approach this task. 

More information about the WDC Training Dataset and Gold Standard for Large-scale Product Matching is found on the WDC website which also offers both artefacts for public download.

Lots of thanks to

  • Anna Primpeli for extracting the training dataset from the CommonCrawl and developing the cleansing workflow.
  • Ralph Peeters for creating the gold standard and performing the matching experiments.
Research - Web-based Systems Projects Chris
news-2175 Wed, 28 Nov 2018 08:34:00 +0000 Paper accepted at EDBT 2019 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-edbt-2019/ Our systems and applications paper

Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data (Yaser Oulabi, Christian Bizer)

got accepted at the 22nd International Conference on Extending Database Technology (EDBT 2019), one of the top-tier conferences in the data management field!

Abstract of the paper:

 Cross-domain knowledge bases such as YAGO, DBpedia, or the Google Knowledge Graph are being used as background knowledge within an increasing range of applications including web search, data integration, natural language understanding, and question answering. The usefulness of a knowledge base for these applications depends on its completeness. Relational HTML tables that are published on the Web cover a wide range of topics and describe very specific long tail entities, such as small villages, less-known football players, or obscure songs. This systems and applications paper explores the potential of web table data for the task of completing cross-domain knowledge bases with descriptions of formerly unknown entities. We present the first system that handles all steps that are necessary to complete this task: schema matching, row clustering, entity creation, and new detection. The evaluation of the system using a manually labeled gold standard shows that it can construct formerly unknown instances from table data with an F1 score of 0.82. In a second experiment, we apply the system to a large corpus of web tables that has been extracted from the Common Crawl. This experiment allows us to get an overall impression of the potential of web tables for augmenting knowledge bases with long tail entities. The experiment shows that we can augment the DBpedia knowledge base with descriptions of 15 thousand new football players as well as 173 thousand new songs. The accuracy of the facts describing these instances is 0.73.




Publications Research - Web-based Systems Chris
news-2164 Thu, 04 Oct 2018 20:27:13 +0000 New DFG Project on joining graph- and vector-based sense representations for semantic end-user information access https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-dfg-project-on-joining-graph-and-vector-based-sense-representations-for-semantic-end-user-infor/ We are happy to announce that the Deutsche Forschungsgemeinschaft accepted our proposal for extending a joint research project on hybrid semantic representations together with our friends and colleagues of the Language Technology Group of the University of Hamburg.

The project, titled "Joining graph- and vector-based sense representations for semantic end-user information access" (JOIN-T 2) builds upon and aims at bringing our JOIN-T project (also funded funded by DFG) one step forward. Our vision for the next three years is to explore ways to produce semantic representations that combine the interpretability of manually crafted resources and sparse representations with the accuracy and high coverage of dense neural embeddings.

Stay tuned for forthcoming research papers and resources!

Research Topics - Artificial Intelligence (NLP) Simone
news-2166 Thu, 04 Oct 2018 08:52:00 +0000 WInte.r Web Data Integration Framework Version 1.3 released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/winter-web-data-integration-framework-version-13-released/ We are happy to announce the release of Version 1.3 of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements a wide variety of different methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions.

The following features have been added to the framework for the new release:

  • Value Normalization: New ValueNormaliser class for normalizing quantifiers and units of measurement. New DataSetNormaliser class for detecting data types and transform complete datasets into a normalised base format.
  • External Rule Learning: In addition to learning matching rules directly inside of WInte.r, the new release also supports learning matching rules using external tools such as Rapidminer and importing the learned rules back into WInte.r.
  • Debug Reporting: The new release features detailed reports about the application of matching rules, blockers, and data fusion methods which lay the foundation for fine-tuning the methods.
  • Step-by-Step Tutorial: In order to get users started with the framework, we have written a step-by-step tutorial on how to use WInte.r for identity resolution and data fusion and how to debug and fine-tune the different steps of the integration process.

The WInte.r famework forms a foundation for our research on large-scale web data integration. The framework is used by the T2K Match algorithm for matching millions of Web tables against a central knowledge base, as well as within our work on Web table stitching for improving matching quality. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve case studies and implement their term projects.  

Detailed information about the WInte.r framework is found at

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

Lots of thanks to Alexander Brinkmann and Oliver Lehmberg for their work on the new release as well as on the tutorial and extended documentation in the WInte.r wiki.

Research - Web-based Systems Chris Projects
news-2151 Tue, 28 Aug 2018 12:34:04 +0000 Paper accepted at EMNLP 2018 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-emnlp-2018/ Our long paper submission

"Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications with Neural Multi-Task Learning Models " (Anne Lauscher, Goran Glavaš, Kai Eckert, and Simone Paolo Ponzetto)

got accepted at the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), one of the top-tier conferences in natural language processing!

Simone Publications Kai Topics - Artificial Intelligence (NLP) Research - Data Analytics
news-2150 Fri, 24 Aug 2018 14:27:52 +0000 André Melo has defended his PhD thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/andre-melo-has-defended-his-phd-thesis/ André Melo has defended his PhD thesis on "Automatic Refinement of Large-Scale Cross-Domain Knowledge Graphs", supervised by Prof. Heiko Paulheim.

In his thesis, André has developed different methods to improve large-scale, cross-domain knowledge graphs along various dimensions. His contributions include, among others, a benchmarking suite for knowledge graph completion and correction, an effective method for type prediction using hierarchical classification, and a machine-learning based method for detection wrong relation assertions. Moreover, he has proposed methods for error correction in knowledge graph, and for distilling high-level tests from individual errors identified.

As of September, André will start a new job as a knowledge engineer for Babylon Health in London. We wish him all the best!

Group Research
news-2131 Wed, 11 Jul 2018 16:36:52 +0000 Paper accepted at ISWC 2018: Fine-grained Evaluation of Rule- and Embedding-based Systems for Knowledge Graph Completion https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-iswc-2018-fine-grained-evaluation-of-rule-and-embedding-based-systems-for-knowle/ The paper "Fine-grained Evaluation of Rule- and Embedding-based Systems for Knowledge Graph Completion" by Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla, and Heiner Stuckenschmidt has been accepted at the 2018 International Semantic Web Conference (ISWC).

Over the recent years, embedding methods have attracted increasing focus as a means for knowledge graph completion. Similarly, rule-based systems have been studied for this task in the past. What is missing so far is a common evaluation that includes more than one type of method. We close this gap by comparing representatives of both types of systems in a frequently used evaluation protocol. Leveraging the explanatory qualities of rule-based systems, we present a fine-grained evaluation that gives insight into characteristics of the most popular datasets and points out the different strengths and shortcomings of the examined approaches. Our results show that models such as TransE, RESCAL or HolE have problems in solving certain types of completion tasks that can be solved by a rule-based approach with high precision. At the same time, there are other completion tasks that are difficult for rule-based systems. Motivated by these insights, we combine both families of approaches via ensemble learning. The results support our assumption that the two methods complement each other in a beneficial way.

Publications Rainer
news-2124 Wed, 27 Jun 2018 06:03:30 +0000 Mannheim Students Score Second Place at Data Mining Cup https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/mannheim-students-score-second-place-at-data-mining-cup/ The Data Mining Cup is an annual data mining competition for students from all over the world. Since 2014, students from Mannheim take part in the competition as an integral part of the Data Mining 2 lecture, held by Prof. Paulheim. In the course of the competition, the students have to solve a data mining task based on real e-commerce data.

This year, the data was provided by an online sports apparel retailer, and the task was to predict the sellout date for individual articles. Students had six weeks time to develop their solution. In the course of the lecture, they worked in different teams and had regular discussions about solution approaches and results.

One of the student teams from Mannheim qualified for the final round of the 10 best teams in May and was invited to present their solution Berlin at the prudsys personalization & pricing summit. In the final ranking, they scored second out of 197 solutions in total. Overall, teams from 148 universities from 47 countries took part in the 2018 data mining cup.

The DWS group wants to congratulate the winning team:

  • Nele Ecker
  • Thilo Habrich
  • Andreea Iana
  • Adrian Kochsiek
  • Alexander Luetke
  • Laurien Theresa Lummer
  • Nils Richter
  • Fabian Oliver Schmitt

Picture: Members of the winnig team in Berlin. Left to right: Nele Ecker, Laurien Lummer, Adrian Kochsiek, Alexander Lütke
Picture credits: Data Mining Cup/prudsys AG

Group Research
news-2123 Fri, 22 Jun 2018 09:35:32 +0000 JCDL 2018 - Vannevar Bush Best Paper Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/jcdl-2018-vannevar-bush-best-paper-award/ Our paper "Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context" has recently won the Vannevar Bush best paper award at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries!

The work, coauthored by Federico Nanni, Simone Paolo Ponzetto and Laura Dietz, is part of a collaboration between the DWS group and the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

Congratulations also to Myriam Traub, Thaer Samar, Jacco van Ossenbruggen and Lynda Hardman, who, with their work, share with us the 2018 best paper award!

Simone Publications Research