We are happy to announce WebDataCommons.org, a joined effort of the PlanetData and the LOD2 projects to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public.
WebDataCommons.org provides the extracted data for download in the form of RDF-quads. In addition, we produce basic statistics about the extracted data.
Up till now, the project has extracted data from two Common Crawl web corpora: One corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a second corpus consisting of 1.4 billion HTML pages dating from February 2012.
The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe 1.5 billion entities and originate from 19.1 million websites.
The February 2012 extraction resulted in 3.2 billion RDF quads which describe 1.2 billion entities and originate from 65.4 million websites.
More detailed statistics about the distribution of formats, entities and websites serving structured data, as well as growth between 2009/2010 and 2012 is provided on the project website:
It is interesting to see form the statistics that the RDFa and Microdata deployment has grown a lot over the last years, but that Microformat data still makes up the majority of the structured data that is embedded into HTML pages (when looking at the amount of quads as well as the amount of websites).
We hope that Web Data Commons will be useful to the community by:
- easing the access to Mircodata, Mircoformat and RDFa data, as you do not need to crawl the Web yourself anymore in order to get access to a fair portion of the structured data that is currently available on the Web.
- laying the foundation for the more detailed analysis of the deployment of the different technologies.
- providing seed URLs for focused Web crawls that dig deeper into the websites that offer a specific type of data.
Web Data Commons is a joint effort of Christian Bizer and Hannes Mühleisen (Web-based Systems Group at Freie Universität Berlin) and Andreas Harth and Steffen Stadtmüller (Institute AIFB at the Karlsruhe Institute of Technology).
Lots of thanks to:
the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
the Any23 project for providing their great library of structured data parsers.
the PlanetData and the LOD2 EU research projects for supporting the development of the extraction framework as well as supporting running the extraction on EC2.
For the future, we plan to update the extracted datasets on a regular basis as new Common Crawl corpora are becoming available. We also plan to provide the extracted data in the in the form of CSV-tables for common entity types (e.g. product, organization, location, …) in order to make it easier to mine the data.
Christian Bizer, Hannes Mühleisen, Andreas Harth and Steffen Stadtmüller