Weblog

From messy data to linked data: LOD-enabled Google Refine

Some of you are probably already familiar with Google Refine (GR), a simple yet very powerful tool for working with messy data. GR is a web application, based on a modular web application framework, which makes it an interesting blend of versatility, performance, portability, simplicity and extendability. GR has no database, all data is in the in-memory data-store, which optimized for the operations used by faceted browsing – one of the most useful features when dealing with messy data; another one is support for reconciliation services. However, currently GR’s data cleansing, reconciliation abilities and user interface are mostly limited to working with subsets of Freebase entities. Wouldn’t it be nice to be able to use DBpedia, too?

Now it is possible: we made Google Refine LOD-friendly with GR extensions provided by DERI and Zemanta. DERI’s RDF extension takes care of reconciliation with any SPARQL point or a RDF dump, and it does the trick when data needs to be exported into RDF. On the other hand, Zemanta’s DBpedia extension adds two more LOD-related functionalities: the first one is augmentation of reconciled data with additional data from DBpedia and the second one is extraction of entities in full text by entity type to new columns using Zemanta API.

LOD-enabled Google Refine (LODGrefine) is also available as a package with pre-integrated extensions mentioned above. The source code of extensions and the package is available on Github under the BSD Licence: RDF extension, DBpedia extension and LODGrefine.

Stay tuned – we have more plans with LODGrefine. First, LODGrefine will be integrated into LOD2 Stack, then we’ll explore the possibility of integrating crowd-sourcing solution like Amazon Mechanical Turk…┬ábut more about this when the time comes.

Good news for the end: LODGrefine will be presented at SemTech2012 conference in San Francisco in June. Hope to see you there.

Enhanced by Zemanta

Leave a Reply

Your email address will not be published. Required fields are marked *