The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus.
The data has been extracted from the October 2016 version of the CommonCrawl covering 3.2 billion HTML pages which originate from 34 million websites (pay-level domains).
Altogether we discovered structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.6 million different pay-level domains out of the 34 billion pay-level domains covered by the crawl (16.5%).
Approximately 2.5 million of these websites use Microdata, 2.1 million websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats are used by over 1.6 million websites within the crawl.
More and more websites annotate structured data within their HTML pages using markup formats such as RDFa, Microdata, embedded JSON-LD and Microformats. The annotations cover topics such as products, reviews, people, organizations, places, events, and cooking recipes.
The WebDataCommons project extracts all Microdata, RDFa data, and Microformat data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.
Besides the markup data, the WebDataCommons project also provides large web table corpora and web graphs for download. General information about the WebDataCommons project is found at
Data Set Statistics:
Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa
and Microformat data sets as well as the vocabularies that are used together with each
markup format are found at:
Comparing the statistics to the statistics about the November 2015 release of the data sets
we see that the Microdata syntax remains the most dominant annotation format. Although it is hard to compare the adoption of the syntax between the two years in absolute numbers, as the October 2016 crawl corpus is almost double the size of the November 2015 one, a relative increase can be observed: In the October 2016 corpus over 44% of the pay-level domains containing markup data make use of the Microdata syntax in comparison to 40% one year earlier. Even though the absolute numbers concerning the RDFa markup syntax adoption rise, the relative increase does not follow up the increase of the corpus size indicating that RDFa is less used by the websites. Similar to the 2015 release, the adoption of embedded JSON-LD has considerably increased, even though the main focus of the annotation remains the search action offered by the websites (70%).
As already observed in the previous years, the schema.org vocabulary is most frequently used in the context of Microdata while the adoption of its predecessor, the data vocabulary, continues to decrease. In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.
Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress). More concretely, the October 2016 corpus includes more than 682 million product records originating from 249 thousand websites which use the schema.org vocabulary. The new release contains postal address data for more than 291 million entities originating from 338 thousand websites. Furthermore, the content describing hotels has doubled in size in this release, with a total of 61 million hotel descriptions.
Visualizations of the main adoption trends concerning the different annotation formats, popular schema.org, as well as RDFa classes within the time span 2012 to 2016 are found at
The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,661 files with a total size of 987 GB.
In addition, we have created for over 40 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class.
Lots of thanks to:
+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
+ the Any23 project for providing their great library of structured data parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which supported by means of the ViCe project the extraction and analysis of the October 2016 corpus.
Have fun with the new data set.
Anna Primpeli, Robert Meusel and Chris Bizer