24.4 billion quads RDFa, Microdata and Microformat data published

The DWS group is happy to announce a new release of the Web Data Commons RDFa, Microdata, Embedded JSON-LD and Microformat data corpus.

The data corpus have been extracted from the November 2015 version of the Common Crawl covering 1.77 billion HTML pages which originate from 14.4 million websites (pay-level domains). 

Altogether we discovered structured data within 541 million HTML pages out of the 1.77 billion pages contained in the crawl (30%). These pages originate from 2.7 million different pay-level-domains out of the 14.4 million pay-level domains covered by the crawl (19%). 

Approximately 521 thousand of these websites use RDFa, while 1.1 million websites use Microdata. Microformats are used also by over 1 million websites within the crawl. For the first time, we have also extracted embedded json-ld which we can report to be used by more than 596 thousand websites. 

Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, reviews, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.

The WebDataCommons project extracts all Microformat, Microdata and RDFa data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

Besides the data extracted from the named markup syntaxes the WebDataCommons project also provides one of the largest public accessible corpora of WebTables extracted from web crawls as well as a collection of hypernyms extract from billions of web pages for download.

General information about the WebDataCommons project is found at http://webdatacommons.org/  

Data Set Statistics

Basic statistics about the November 2015 RDFa, Microdata, Embedded JSON-LD and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2015-11/stats/stats.html

Comparing the statistics to the statistics about the December 2014 release of the data sets

http://webdatacommons.org/structureddata/2014-12/stats/stats.html

we see that the adoption of the Microdata markup syntax has again increased (1.1 million websites in 2015 compared to 819 thousand in 2014, where both crawls cover a comparable number of websites). Where the deployment of RDFa and Microformats is more or less stable.

As already observed in the former year the vocabulary schema.org, recommended by Google, Microsoft, Yahoo!, and Yandex is most frequently used by the webmasters in the context of Microdata. We observe a decreasing deployment of its predecessor, the data vocabulary.  In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, that many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).

For the first time, we have also extracted information marked up using embedded JSON-LD. Over 99% of all webmasters using this syntax use it to mark-up search boxes on their webpages (http://schema.org/SearchAction). Only a small part of the websites also use embedded JSON-LD to annotate other information, e.g. about organizations (92 thousand websites) or persons (18 thousand websites).

Download 

The overall size of the November 2015 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 24.4 billion RDF quads. For download, we split the data into 3,961 files with a total size of 404 GB. 

http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class. 

http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html 

Lots of thanks to

+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 


Have fun with the new data set. 

Robert Meusel and Christian Bizer