Weblog

European Data Forum 2014 in Athens was great success

About 3 weeks ago the European Data Forum 2014 (EDF2014) took place in Athens, Greece on 19-20 March 2014. LOD2 was again one of the main organisers in this 3rd edition of the free community event to foster the European Data Economy.

More than 500 industry leaders, researchers, and policy makers, were meeting to discuss the challenges and opportunities of data-driven innovation. With expert speakers in two parallel tracks, an executive panel on Big Data, several networking sessions, a poster session, a 2-days exhibition area, the awarding of the European Data Innovator Award 2014 and 5 specialised collocated events, EDF2014 is the greatest ever assembly of data experts in the EU.

Neelie Kroes, the Vice President of the European Commission, in charge of the Digital Agenda, opened EDF2014 and addressed a warm welcome to all participants: “It is an inspiring event today, it is important for Greece, it is important for Europe”. She pointed out that we are in the Big Data era which means more data and more ways to collect, manage, manipulate and use it, making an impact and difference to people’s life: “Data to empower people”.

The Minister of Administrative Reform and eGovernance of Greece, Kyriakos Mitsotakis, addressed the EDF2014 and reinforced the focus of the Greek Government in core Data Economy areas as means to promote transparency, government efficiency and growth: “The open data revolution is a trend which my country cannot simply afford to ignore”.

Continue reading

New release of the WebDataCommons RDFa, Microdata and Microformat data sets

We are happy to announce a new release of the WebDataCommons RDFa, Microdata, and Microformat data sets.

The data sets have been extracted from the November 2013 version of the Common Crawl covering 2.24 billion HTML pages which originate from 12.8 million websites (pay-level-domains).

Altogether we discovered structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%). These pages originate from 1.7 million different pay-level-domains out of the 12.8 million pay-level-domains covered by the crawl (13%).

Approximately 471 thousand of these websites use RDFa, while 463 thousand websites use Microdata. Microformats are used on 1 million websites within the crawl.

Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa,  Microdata and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

General information about the WebDataCommons project is found at
http://webdatacommons.org/

Data Set Statistics

Basic statistics about the November 2013 RDFa, Microdata, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:
http://webdatacommons.org/structureddata/2013-11/stats/stats.html

Comparing the statistics to the statistics about the August 2012 release of the data sets
http://webdatacommons.org/structureddata/2012-08/stats/stats.html
we see that the adoption of the Microdata markup syntax has strongly increased (463 thousand websites in 2013 compared to 140 thousand in 2012, even given that the 2013 version of the Common Crawl covers significantly less websites than the 2012 version).

Looking at the adoption of different vocabularies, we see that webmasters mostly follow the recommendation by Google, Microsoft, Yahoo, and Yandex to use the schema.org vocabularies as well as their predecessors in the context of Microdata. In the context of RDFa, the most widely used vocabulary is the Open Graph Protocol recommended by Facebook.

Looking at the most frequently used classes, we see that beside of navigational, blog and CMS related meta-information many websites markup e-commerce related data (products, offers, and reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).

Download

The overall size of the November 2013 RDFa, Microdata, and Microformat data sets is 17.2 billion RDF quads. For download, we split the data into 3,398 files with a total size of 332 GB.
http://webdatacommons.org/structureddata/2013-11/stats/how_to_get_the_data.html

Lots of thanks to
+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data parsers.
+ the Amazon Web Services for supporting WebDataCommons.

Christian Bizer, Petar Petrovski and Robert Meusel, University of Mannheim

LOD2 Plenary and Open Data Meet-up in Mannheim

There was a LOD2 plenary meeting hosted by Chris Bizer from the University of Mannheim this week.

The plenary meeting was preceded by a Linked Open Data Meetup with talks from Springerfluid Operations, and several LOD2 partners (Universität LeipzigUniversity of Mannheim, theSemantic Web Company, and Wolters Kluwer Deutschland GmbH (WKD)).

Below is a summary write up of the events by Orri Erling from OpenLink Software.

LOD2 Plenary Mannheim Group Photo
Group Photo at LOD2 Plenary Mannheim

Wolters Kluwer Deutschland GmbH (WKD) gave a presentation on the content production pipeline of their legal publications and their experiences in incorporating LOD2 technologies for content enrichment. This is a very successful LOD2 use case and demonstrates the value of linked data for the information industry.

Springer gave a talk about their interest in linked data for enriching the Lecture Notes in Computer Science product. Also conference proceedings could be enhanced with structured metadata inRDF. I asked about nanopublications. The comment was that content authors might perceive nanopublications as an extra imposition. On the other hand, in the life sciences field there is a lot of enthusiasm for the idea. We will see; anyway, biology will likely lead the way for nanopublications. I referred Aliaksandr Birukou of Springer to the companies Euretos and its parent S&T in Delft, Netherlands, and to Barend Mons, scientific director of NBIC, the Netherlands Bioinformatics Centre. These are among the founding fathers of the Nano Republic, as they themselves put it.

Sebastian Hellman gave a talk on efforts to set up the DBpedia Foundation as a not-for-profit organization, hopefully in the next 10 days, to aid in the sustainability and growth of the DBpedia project. The Foundation would identify stakeholders, their interests, and ways to generate income to further improve DBpedia. Planned areas of improvement include the development of high-availability value-added DBpedia services with quality of service (QoS) agreements for enterprise users; additional tools in the DBpedia stack to support improved and cost-efficient data curation and internationalization; and improved documentation, tutorials, and support to speed uptake.

I had a word with Peter Haase of fluid Operations about the Optique project and their cloud management offerings. The claim is to do ontology-directed querying over thousands of terabytes of heterogenous data. This turns out to be a full-force attempt at large scale SQL federation with ontology-directed query rewriting for covering OWL 2 QL semantics. With Ian Horrocks of Oxford leading the ontology side, the matter is in good hands. Still the matter is not without its problems. Simple lookups can be directed to the data but if there are terabytes of it, it is more likely thataggregations are what is desired. Federated aggregation tends to move a lot of data. So the problems are as they ever were. However, if the analytics are already done and stored in the relational space, finding these based on ontologies is a worthwhile thing for streamlining end user access to information.

The LOD2 plenary itself was structured in the usual way, covering the work packages in two parallel tracks.

On the database side, the final victory will be won by going to adaptive schema for RDF. We brought the RDF penalty against relational to a factor of 2.5 for common analytics style queries, e.g., Star Schema Benchmark. This is a comparison to Virtuoso SQL, which offers very high performance in this workload, over 2x the speed of column store pioneer MonetDB and 300x MySQL. So this is where matters stand. To move them significantly forward, exploitation of structure for guiding physical storage will be needed. Also the project still has to deliver the 500 Gtriple results. The experiments around Christmas at CWI support the possibility, but they are not final. Putting triples into tables when the triples in fact form table-shaped structures, which is the case most of the time, may turn out to be necessary for this. At least, this will be a significant help.

Be the case as it may, using a table schema for regularly shaped data, while preserving the RDF quad flexibility, would essentially abolish the RDF tax and bring the LOD2 project to a glorious conclusion in August.

I took the poetic license to compare the data journey into RDF and back to the Egyptian myth of Osiris: The data gets shut in a silo and then gets cut into 14 pieces; and subsequently thrown into the Nile (i.e., the LOD cloud, or the CKAN catalog). Grief-stricken Isis sees what is become of her love: She patiently reassembles the pieces, reconstructing Osiris in fact so well that he sires her a child, hawk-headed Horus, who proceeds to reclaim his father’s honor. (See, Isis means Intelligent Structured Information Storage.)

I had many interesting conversations with Chris Bizer about his research in data integration, working with the 150M HTML tables in the common crawl. The idea is to resolve references and combine data from the tables. Interestingly enough, the data model in these situations is basically triples, while these are generally not stored as RDF but in Lucene. This makes sense due to the string-matching nature of the task. There appears to be opportunity in bringing together the state of the art in database, meaning the very highly optimized column-store and vectored execution in Virtuoso with the search-style workload found in instance matching and other data integration tasks. The promise goes in the direction of very fast ETL and subsequent discovery of structural commonalities and enrichment possibilities. This is also not infinitely far from the schema discovery that one may do in order to adaptively optimize storage based on the data.

Volha Bryl gave a very good overview of the Mannheim work in the data integration domain. For example, learning data fusion rules from examples of successful conflict resolution seems very promising. Learning text extraction rules from examples is also interesting. The problem of data integration is that the tasks are very heterogenous and therefore data integration suites have very large numbers of distinct tools. This is labor intensive but there is progress in automation. An error-free, or near enough, data product remains case by case and has human curation but automatic methods seem, based on Volha’s and Chris’ presentation, to be in the ballpark for statistics.

Giovanni Tummarello of Insight/SindiceTech, always the life of the party, presented his Solr-based relational faceted browser. The idea is to show and drill down by facets over a set of related tables; in the demo, this was investments, investment targets, and investors. You can look at the data from any of the points and restrict the search based on attributes of any. Well, this is what a database does, right? That is so, but the Sindice tool is on top of Solr and actually materializes joins into a document. This blows up the data but has all the things colocated so it can even run from disk. We also talked about the Knowledge Graph package Sindice offers on the Google cloud, this time a Virtuoso application.

We hope that negotiations between SindiceTech and Insight (formerly DERI) around open sourcing the SPARQL editor and other items come to a successful conclusion. The SPARQL editor especially would be of general interest to the RDF community. It is noteworthy that there is no SPARQL query builder in common use out there (even OpenLink‘s own open source iSPARQL has been largely (but not entirely!) overlooked and misunderstood, though it’s been available as part of the OpenLink Ajax Toolkit for several years). OK, a query builder is useful when there is schema. But if the schema is an SQL one, as will be the case if RDF is adaptively stored, then any SQL query builder can be applied to the regular portion of the data. 40 years of calendar time and millennia of person years have gone into making SQL front ends and these will become applicable overnight; Virtuoso does speak SQL, as you may know.

I had the breakout session about the database work in LOD2. What will be done is clear enough, the execution side is very good, and our coverage of the infinite space of query optimization continues to grow. One more revolution for storage may come about, as suggested above. There is not very much to discuss, just to execute. So I used the time to explain how you run

 

SELECT  SUM ( l_extendedprice )
  FROM  lineitem
     ,  part
 WHERE  l_partkey = p_partkey
   AND  p_name LIKE '%green%'

 

Simple query, right? Sure, but application guys or sem-heads generally have no clue about how these in fact need to be done. I have the probably foolish belief that a little understanding of database, especially in the RDF space which does get hit by every query optimization problem, would be helpful. At least one would know what goes wrong. So I explained to Giovanni, who is in fact a good geek, that this is a hash join, and with only a little prompting he suggested that you should also put a Bloom filter in front of the hash. Good. So in the bar after dinner I was told I ought to teach. Maybe. But the students would have to be very fast and motivated. Anyway, the take-home message is that the DBMS must figure it out. In the SQL space this is easier, and of course, if most of RDF reduces to this, then RDF too will be more predictable in this department.

I talked with Martin Kaltenböck of the Semantic Web Company about his brilliant networking accomplishments around organizing the European Data Forum and other activities. Martin is a great ambassador and lobbyist for linked data across Europe. Great work, also in generating visibility for LOD2.

The EU in general, thanks in great part to Stefano Bertolo’s long term push in this direction, is putting increasing emphasis on measuring progress in the research it funds. This is one of the messages from the LOD2 review also. Database is the domain of performance race par excellence; the matters on that side are well attended to by LDBC and, of course, the unimpeachably authoritative TPC, among others. In other domains, measurement is harder, as it involves a human-curated ground truth for any extraction, linking, or other integration. There is good work in both Mannheim and Leipzig in these areas, and I may at some point take a closer look, but for now it is appropriate to stick to core database.

 

 

LOD2 at Mannheim Linked Open Data Meetup

Co-located with the last LOD2 project plenary in Mannheim, Germany, February 24-25, 2014, the Mannheim Linked Open Data Meetup will take place on Sunday, February 23. The meetup is organized and hosted by the Data and Web Science research group of the University of Mannheim.

The meetup is meant to bring together researchers and practitioners interested in Linked Open Data from Southern Germany, and is half-way between a workshop and an informal gathering, with drinks and snacks provided by the organizers.

The following people will give talks at the meetup:

  1. Sören Auer (University of Bonn), Overview of the LOD2 project
  2. Katja Eck (Wolters Kluwer), LOD in the publishing industry
  3. Martin Kaltenböck (Semantic Web Company, Vienna), Enterprise Semantics – Information Management with PoolParty
  4. Mirjam Kessler, Aliaksandr Birukou (Springer), Linked Data Initiatives at Springer Verlag
  5. Peter Haase (Fluid Operations, Heidelberg), Linked Data Applications with the Information Workbench
  6. Sebastian Hellman (University of Leipzig), Dutch DBpedia for Searching and Visualising Dutch Library Data

The meetup will take place at the University of Mannheim in the building B6 26, Room A101 on Sunday evening, February 23, starting at 18:30.

Up-to-date information is also available at

http://www.meetup.com/OpenKnowledgeFoundation/Mannheim-DE/1092882/

Looking forward to meeting you there!

The LOD2 Team

Preview release of conTEXT for Linked-Data based text analytics

We are happy to announce the preview release of conTEXT — a platform for lightweight text analytics using Linked Data (soon to be part of the LOD2 Stack).
conTEXT enables social Web users to semantically analyze text corpora (such as blogs, RSS/Atom feeds, Facebook, G+, Twitter or SlideWiki.org decks) and provides novel ways for browsing and visualizing the results.

conTEXT workflow

The process of text analytics in conTEXT starts by collecting information from the web. conTEXT utilizes standard information access methods and protocols such as RSS/ATOM feeds, SPARQL endpoints and REST APIs as well as customized crawlers for WordPress and Blogger to build a corpus of information relevant for a certain user. The assembled text corpus is then processed by Natural Language Processing (NLP) services (currently FOX and DBpedia-Spotlight) which link unstructured information sources to the Linked Open Data cloud through DBpedia. The processed corpus is then further enriched by de-referencing the  DBpedia URIs as well as  matching with pre-defined natural-language patterns for DBpedia predicates (BOA patterns). The processed data can also be joined with other existing corpora in a text analytics mashup. The creation of analytics mashups requires dealing with the heterogeneity of different corpora as well as the heterogeneity of different NLP services utilized for annotation. conTEXT employs NIF (NLP Interchange Format) to deal with this heterogeneity. The processed, enriched and possibly mixed results are presented to users using different views for exploration and visualization of the data. Additionally, conTEXT provides an annotation refinement user interface based on the RDFa Content Editor (RDFaCE) to enable users to revise the annotated results. User-refined annotations are sent back to the NLP services as feedback for the purpose of learning in the system.

For more information on conTEXT visit:

Publicdata.eu becoming more pan-European

As part of LOD2 we are developing the portal publicdata.eu. This is an open data portal which aggregates open data from other government portals throughout Europe. At the start of the project, most portals that were included in publicdata.eu were scraped from other sites, or based on portals that were set up by members of the open data community, often volunteers. Over the course of the years we have seen more and more portals being published by municipalities, regions and EU member states. This has substantially increased the number of the datasets that are available on the portal.

About a year ago, only about 17,000 datasets were available on publicdata.eu. Today, more than 30,000 datasets can be found, many of which published by national governments. For example, we have included the German national open data portal and more recently the Romanian open data portal.

On 27 Nov 2013, the Open Knowledge Foundation attended a meeting of the public sector information sub-group on the pan-European Open Data portal. This was an excellent opportunity to show the work that has been going on to build publicdata.eu, as well as talking with representatives of more than 18 EU member states, who are working on open data portals on the national level. One of the exciting developments is the re-launch of the French portal which is built on CKAN, software developed by the Open Knowledge Foundation, and the same software that is being built to run publicdata.eu. Because CKAN is open source and AGL licensed, any adaptations of CKAN made by the French team will be released as open source as well.

Apart from the presentation of publicdata.eu, the meeting also highlighted some other developments from EU member states. For example, Ross Jones of the data.gov.uk team demonstrated the work his team has done around further developing the portal. He remarked that the focus will be more on data quality in the near future, a topic that is still relevant across many open data portal. Additionally, Jean-Charles Quertinmont of Belgium shared their experiences. In their experience in setting up two separate open data portals, interest for using the datasets is higher from developers in civil society than from businesses. Finally, the Open data support project presented their work in harvesting datasets from portals and turning those into RDF. The publicdata.eu team and open data support team are looking into working together to ensure that the data harvested by the open data support team is also available via the publicdata.eu support team. Other work in the final year of publicdata.eu as part of LOD2 focuses mainly on support for multilingualism, support for the DCAT Application profile and better categorisation of the data.

High visibility of LOD2 in the Publishing Industry

I am frequently attending conferences and symposia, mainly within the publishing and media industry. My talks always touch LOD2 and the impact it has on our progress from being a traditional publisher to become an information service provider. The positive feedback of the audience especially on this point is significant and I suppose it is mainly based on the following factors:

Semantic Web in general consists of three major parts and all of them actually touch major pain points in the industry: the need of standards (addressed by W3C standards like RDF and SKOS), the need of information (addressed by the growing LOD cloud) and technology to get a grip on both (addressed to a remarkable degree by tools developed within LOD2). More precisely, the Linked Data Lifecycle is all about generating large amounts of information in a good quality and making this information accessible to users. In fact, it is supporting the same process that publishing houses are doing for centuries. That’s the main reason why the toolset fits so well to the needs of this particular industry. In addition, the industry in Europe is very fragmented with a lot of companies with a very small to medium size. They are dependent on open source technology for being able to realize any technological progress; which is also absolutely in line with the LOD2 stack approach. And last but not least they see that some of the tools are already in operational use within their industry, e.g. at Wolters Kluwer Germany.

So the feedback is positive and interest is there. The main task for the consortium now is to make sure that the majority of the toolset finally evolves towards becoming sustainable industry applications.

 

Christian Dirschl (Wolters Kluwer)

 

Linked Data: The Future of Statistical Data

On December 12, the IMP team (Valentina Janev , Uroš Milošević and Vuk Mijović) together with the Statistical Office of the Republic of Serbia (Branko Jireček, Assistant Director and Jelena Milojković, Head of Internet technologies department) and the Statistical Society of the Republic of Serbia organised an event on Linked Data and the future of statistical data.  The event was an opportunity to present and discuss the latest results of the LOD2 consortium in the field of managing statistical Linked Data. Continue reading

Linked data in libraries: global standards and local practices – interview with Esther Guggenheim and Ido Ivri

Ido Ivri and Esther Guggenheim

Esther Guggenheim and Ido Ivri work for the National Library of Israel (NLI). During 2013 they collaborated with LOD2 members on a joint PUBLINK project that focused on converting the NLI’s extension of
Library of Congress Subject Headings to RDF. At the end of the year, when this effort is coming to a close, the LOD2’s project Jindřich Mynarz had a chance to ask Esther and Ido about their experiences with the project and discuss library linked open data in general.

Continue reading

Cryptocurrencies, read-write web and all things linked data with Melvin Carvalho

Melvin Carvalho

During the last LOD2 project plenary meeting in Prague the project’s Jindřich Mynarz had a chance to talk with Melvin Carvalho, who joined the meeting as an invited expert. Melvin was able to learn about the LOD2 project’s activities and provided the project members with feedback on the basis of his extensive experience with linked data, read-write web or online financial transactions. We dwelled into some of these topics in depth in the following interview.

Continue reading