Weblog

Scaling curation of public sector data: The Big Clean, November 3rd 2012, Prague

Big Clean 2012 logo

Public sector bodies record and store facts about themselves, their actions and the settings in which they operate. Unfortunately, the way these public institutions collect such vast amounts of data often does not result in big data: it results in a big, unusable mess instead. The methods of maintenance of information assets established in the public sector simply do not scale to the requirements posed by big data. Manual data curation cannot cope with the huge volume of public sector data.

As the volume of data in the public sector grows, this issue is becoming more prominent and getting more attention both from the public and public bodies themselves. As numerous open data initiatives worldwide demonstrate, one way to approach this problem is to set up a roadmap for extracting raw data from documents held within in the public sector, opening the data up and, ultimately, turning it into linked (open) data.

We need to recognize that big data needs big cleaning. The Big Clean conference is here to address that, focusing on three key topics:

  • Screen-scraping — structuring documents into data, inferring semantic descriptions from layout
  • Data refining — making raw data usable, improving data quality
  • Data-driven journalism — telling the stories hidden in data, getting the big picture from big data

The Big Clean’s programme reflects the choice of core topics, with talks on subjects from Google’s BigQuery to the use of EU data in journalism or the missing steps between screen-scraping and data analysis. Tomáš Knap from the LOD2 project will give a presentation describing the ODCleanStore framework for cleaning and linking data, which is scheduled to be integrated into the LOD2 Stack.

PRACTICAL INFORMATION

The Big Clean is jointly organized by the National Technical Library in Prague, the Open Knowledge Foundation and the LOD2 project.

Leave a Reply

Your email address will not be published. Required fields are marked *