We are happy to announce a new release of the WebDataCommons RDFa, Microdata, and Microformat data sets.
The data sets have been extracted from the November 2013 version of the Common Crawl covering 2.24 billion HTML pages which originate from 12.8 million websites (pay-level-domains).
Altogether we discovered structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%). These pages originate from 1.7 million different pay-level-domains out of the 12.8 million pay-level-domains covered by the crawl (13%).
Approximately 471 thousand of these websites use RDFa, while 463 thousand websites use Microdata. Microformats are used on 1 million websites within the crawl.
More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.
The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.
General information about the WebDataCommons project is found at
Data Set Statistics
Basic statistics about the November 2013 RDFa, Microdata, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:
Comparing the statistics to the statistics about the August 2012 release of the data sets
we see that the adoption of the Microdata markup syntax has strongly increased (463 thousand websites in 2013 compared to 140 thousand in 2012, even given that the 2013 version of the Common Crawl covers significantly less websites than the 2012 version).
Looking at the adoption of different vocabularies, we see that webmasters mostly follow the recommendation by Google, Microsoft, Yahoo, and Yandex to use the schema.org vocabularies as well as their predecessors in the context of Microdata. In the context of RDFa, the most widely used vocabulary is the Open Graph Protocol recommended by Facebook.
Looking at the most frequently used classes, we see that beside of navigational, blog and CMS related meta-information many websites markup e-commerce related data (products, offers, and reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).
The overall size of the November 2013 RDFa, Microdata, and Microformat data sets is 17.2 billion RDF quads. For download, we split the data into 3,398 files with a total size of 332 GB.
Lots of thanks to
+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data parsers.
+ the Amazon Web Services for supporting WebDataCommons.
Christian Bizer, Petar Petrovski and Robert Meusel, University of Mannheim