Weblog

There is (no) money in Linked Data?

A few days ago Prateek Jain, Pascal Hitzler, Krzysztof Janowicz, Chitra Venkatramani from Knoesis published a short writeup “There is no money in Linked Data”. And started a corresponding discussion on the W3C Semantic Web mailinglist. In this post I want to summarize the short discussion I had on this topic with Pascal Hitzler, since my impression is that we often do not see the historic analogy between the developments regarding open-source software a few years ago and open data now.

Pascal and colleagues argue, that “… using these (Linked Open) datasets in realistic settings is not always easy. Surprisingly, in many cases the underlying issues are not technical but legal barriers erected by the LD data publishers.” Pascal concluded in the mailinglist thread:

Pascal: The notion of (Linked Open Data) simply is rather unclear. “Linked Open Data must have an open licence” is – in the light of the analysis in the paper – almost meaningless, as “openness” of licences is not a boolean. There are many shades to it, and most of these shades do not allow readily for commercialization.

I somewhat disagree with that statement: there is (and should be) a clear (boolean) definition what open means: the Open Definition.

The Open Definition precisely defines the requirements for a license in order to be called open. Allowing remixing and republishing, availability of data in bulk and non-discriminatory licensing allowing commercial reuse are core requirements of the open definition. Open Data is not cardinally different from other “open” domains, e.g.open source software and for open source software there exists also a clear definition (overseen by OSI), which is meanwhile widely enforced. I’m a big fan of both — Linked Data as a data integration paradigm within and between organizations AND Linked Open Data as a way to share data and knowledge openly on the Web. With the Open Definition we have a clear way to distinguish between the two.

Pascal: Attribution or share-alike can already be showstoppers, and for some context can render LOD/LD non-reusable – in which case the term “open” appears to be rather misleading.

Continue reading

Virtuoso 7 Released …

The quest of OpenLink Software is to bring flexibility, efficiency, and expressive power to people working with data. For the past several years, this has been focused on making graph data models viable for the enterprise. Flexibility in schema evolution is a central aspect of this, as is the ability to share identifiers across different information systems, i.e., giving things URIs instead of synthetic keys that are not interpretable outside of a particular application.

With Virtuoso 7, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain.

Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility.

So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time.

So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on.

Column-wise storage brings space efficiency, since values in one column of a table tend to be alike — whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same — just substitute the word predicate for column. Space efficiency brings speed — first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring.

Continue reading

Free Webinar: CubeViz – a facetted browser for statistical data

 

In this Webinar Michael Martin presents CubeViz – a facetted browser for statistical data utilizing the RDF Data Cube vocabulary which is the state-of-the-art in representing statistical data in RDF. This vocabulary is compatible with SDMX and increasingly being adopted. Based on the vocabulary and the encoded Data Cube, CubeViz is generating a facetted browsing widget that can be used to filter interactively observations to be visualized in charts. Based on the selected structure, CubeViz offer beneficiary chart types and options which can be selected by users.

If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the free LOD2 webinar series!

When: Tue, May 28, 2013 4:00 PM – 5:00 PM CEST
Presenter: Michael Martin, University of Leipzig, AKSW
Information & free Registration: https://www4.gotomeeting.com/register/687215863

The LOD2 team is looking forward to meeting you at the webinar!!

 

The LOD2 webinar series is powered by LOD2 – Creating Knowledge out of Interlinked Data (http://lod2.eu), organised & produced by Semantic Web Company (http://www.semantic-web.at), Austria. This series will provide a monthly webinar about Linked (Open) Data tools and services around the LOD2 project, the LOD2 Stack and the Linked Open Data Life Cycle, also in the form of 3rd party tools. Please find continuously updated information here: http://lod2.eu/BlogPost/webinar-series

 

Big Data RDF Store Benchmarking Experiences

Recently we were able to present new BSBM results, testing the RDF triple stores Jena TDB, BigData,  BIGOWLIM and Virtuoso on various data sizes. These results extend the state-of-the-art in various dimensions:

  • scale: this is the first time that RDF store benchmark results on such a large size have been published. The previous published BSBM results published were on 200M triples, the 150B experiments thus mark a 750x increase in scale.
  • workload: this is the first time that results on the Business Intelligence (BI) workload are published. In contrast to the Explore workload, which features short-running “transactional” queries, the BI workload consists of queries that go through possibly billions of triples, grouping and aggregating them (using the respective functionality, new in SPARQL1.1).
  • architecture: this is the first time that RDF store technology with cluster functionality has been publicly benchmarked.

These results come more than 18 months since I released the call for participation for large-scale benchmarking on the LISA cluster of the latest v3.1 of the Berlin SPARQL Benchmark.

The LISA cluster of Dutch national supercomputing center SARA: powerful but hard to use for cluster database benchmarks.

While the new results are nothing short of spectacular, in this blog I will try to answer the question: what took us so long?

There were many hurdles that have stood in the way of running these experiments succesfully; and they can be divided into two kinds of hurdles: hardware – and software-related hurdles.

The hardware-related hurdles lie in the strictly time-shared nature of the LISA cluster, which is quite typical for centralized scientific computing environments used for e-science. To access such machinery, one has to create a fully scripted job that is put into a queue and this job subsequently gets executed after some while (minutes,hours, or days). The final results all have to be saved somewhere (typically in data files), and of course one generally also collects all kinds of logs recording performance data and error messages. After the job has run, the scientist then needs to inspect these files to find out what happened and whether the job produced the intended effects, or errors.
Further, the LISA cluster has various classes of machines, but only the most modern type of node had significant hard disk resources, needed for our large-scale experiments. And these nodes tended to be highly popular, therefore sparsely available only. It would typically take 2-3 days before our job would have found a  slot to run in the cluster. One may imagine that this non-interactive access is quite cumbersome to use if one is developing set-ups for multiple cluster database systems and tuning these. It is all too easy to make a small mistake, only to find out about the syntax error three days later, after the job ran; leading to a subsequent re-try. Of course, such job scripts can be tested on a single machine beforehand, but testing the behavior of a cluster database system on a single machine is not the same as testing it on the real hardware. Further, performance tuning, by experimenting with configuration file settings for the various products and reacting to the observed performance differences, homing in to ‘good’ parameter settings at every next try, is also very hard to do non-interactively.

For these reasons our efforts on the LISA cluster which started during winter 2011, and continued over spring 2012  did not make enough progress, and the project slowly grinded down. This changed when the  CWI brought into production its new SCILENS cluster, and we started using it in fall 2012.

Continue reading

Open Data Conference in Seoul

Sören visited LOD2 partner KAIST and gave a lecture on Linking Enterprise Data in KAIST’s distinguished lecture series. On the second day of Sören’s short trip to Korea, we participated in the Open Data Conference of the Information Society Agency (NIA). NIA seems to be implementing a comprehensive Open Data strategy (also involving LinkedData). Looks like South Korea is quite advanced in this regard already. In addition to my talk about Linked Open Government Data, there was also a talk by Haklae Kim about the Korean Open Knowledge foundation chapter. There were also some industry representatives (Samsung, LG) in the audience and interested in applying Linked Data in enterprise environments.

You can find pictures from the event on Facebook.

LOD2 stack Usability survey started

In the recent years the LOD2 stack established a collection of applications developed in the context of the LOD2 project, presented as an unified environment. These applications are referred to as components although they can also be installed independently. However, having all these components in a single environment eases the access from one application to the other and improves the UI experience.

 

As LOD2 stack is now available in it’s second version, questions of usability and end users experience came more in the focus of the ungoing development. So the LOD2 consortium set up  a survey asking users of the LOD2 stack (or the online Demonstrator) for feedback, regarding their experiences with the LOD2stack and the separate components. The outcome of this, is to fine tune development and improve the user experience in each phase of the Linked Data life cycle.

 

The survey is open from April 15 to June 30 and will only demand 15 minutes of your time.

LOD2 Webinar 30.04.2013: Virtuoso Column Store

 

This webinar in the course of the LOD2 webinar series will present Virtuoso 7. Virtuoso Column Store, Adaptive Techniques for RDF Graph Databases. In this webinar we shall discuss the application of column store techniques to both graph (RDF) and relational data for mixed work-loads ranging from lookup to analytics.

Virtuoso is an innovative enterprise grade multi-model data server for agile enterprises & individuals. It delivers an unrivaled platform agnostic solution for data management, access, and integration. The unique hybrid server architecture of Virtuoso enables it to offer traditionally distinct server functionality within a single product

 

If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the free LOD2 webinar series!

When: Tue, Apr 30, 2013 4:00 PM – 5:00 PM CEST
Presenter: OpenLink Software
Information & free Registration: https://www4.gotomeeting.com/register/506548519

The LOD2 team is looking forward to meeting you at the webinar!!

 

The LOD2 webinar series is powered by LOD2 – Creating Knowledge out of Interlinked Data (http://lod2.eu), organised & produced by Semantic Web Company (http://www.semantic-web.at), Austria. This series will provide a monthly webinar about Linked (Open) Data tools and services around the LOD2 project, the LOD2 Stack and the Linked Open Data Life Cycle, also in the form of 3rd party tools. Please find continuously updated information here: http://lod2.eu/BlogPost/webinar-series

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. “Michael Jordan”), and subsequently matches these names to unique identifiers (e.g. dbpedia:Michael_I._Jordan, the machine learning professor or dbpedia:Michael_Jordan, the basketball player). Besides common entity classes such as People, Locations and Organisations, DBpedia Spotlight also spots concepts from any of the 320 classes in the DBpedia Ontology.
DBpedia Spotlight is a tool employed in the Extraction stage of the LOD Life Cycle, performing Entity Recognition and Linking. Although the tool currently specializes in English language, the support for other languages is currently being tested, and demos for German, Dutch and others are available or underway. The tool can be used to enable faceted browsing, semantic search, among other applications. In this webinar we will describe what is DBpedia Spotlight, how it works and how can you benefit from it in your application.

To live in the Cloud


The EU-funded lighthouse project LOD2 is entering its critical phase. Many core developments will be more or less finalized in 2013. Therefore, the plenary meeting at the end of March, which was this time taking place in Amsterdam, was full of intense and partly controversial discussions between the partners. It became obvious that everybody wants to achieve the best results in the project. But what is “best” is of course also a matter of preference and interpretation!

The location Amsterdam was a good fit for this plenary meeting in the current project phase!


First, when you look at the town map with a bird’s eye view with all its concentric town canals, it resembles the LOD cloud with all the different resources linked to each other. Of course, in Amsterdam the “resources” are mainly small artificial islands connected by around 1.500 bridges and more than 100 km of water canals! But still, only the interplay of all these elements constitutes the town of Amsterdam!


Second, when we took our boat trip one evening, we passed the replica of a huge sailing boat, which was used for global trade in the 18th century. The Dutch were already at that time known to be open-minded and pragmatic, which are actually very useful characteristics for international projects as well.

Finally, our dinner took place in the restaurant of the “Tropenmuseum”, which covers centuries of history as well as cultures from all over the globe in its exhibitions. This was very well reflected in the buffet, where food from eight different countries from several continents was served. And yet another thing became apparent: everyone found something he enjoyed a lot in the end – and – it is not a good idea to mix everything on one single plate!

So, apart from core project work and the socializing part, there was as usual a (Linked) Open Data, organized by OKFN and Waag Society with an impressing number of 65 participants.

On Tuesday morning, Wolters Kluwer Corporate was doing a presentation on the potential of semantic web technologies and standards for supporting the strategy of a global media company like Wolters Kluwer. It was also stressed that the requirements expressed in the media & publishing use case in LOD2 are in line with many activities that Wolters Kluwer is currently executing.

So we were facing intense times with a perfect host CWI and I was lucky that on my third boat trip in Amsterdam it was the first time that the sun was shining!

All Commission’s data – Interview with Daniele Rizzi about the EC Open Data portal

Photo of Daniele Rizzi

The Open Data Hub of the European Union, which was launched at the end of 2012, sparked a considerable amount interest among open data enthusiasts and media as well. As a result of European Commission’s open data strategy, the portal boasts with almost 6000 datasets, some of which are available as linked data through SPARQL endpoint. LOD2 project’s Jindřich Mynarz had a chance to ask Daniele Rizzi, an EC representative, about the current status and plans for the EC’s Open Data portal.

Jindřich Mynarz: Open data was explicitly mentioned in the original call for tenders for the EC Open Data portal from July 2011. What drove you to ask specifically for “open data”?

Daniele Rizzi: The central aim of the EU 2020 strategy is to put Europe’s economies onto a high and sustainable growth path. To this end, Europe should use its resources in the most positive way. Public sector information is an important source of economic growth through the development of innovative value-added products and services. By making public sector information available on transparent, effective and non-discriminatory terms, governments including the European institutions can help boost economic growth, by up several tens of billions €/year.

In 2011 the Commission decided to put its own policy into practice and launch an Open Data portal where all Commission documents should be made available for reuse for commercial or non-commercial purposes, without charge and without the need to make an individual application.

This is also in line with the initiatives of most EU Member States, who have themselves developed or planned national Open Data portals to give easier access to their own public sector information.

Jindřich Mynarz: How do you plan to determine if the EC Open Data portal was a success or not? What measurable impacts would you consider to tell if it succeeded?

Daniele Rizzi: Our initial measure of success is of course the number of datasets (currently around 6000), the number of publishers (currently 14 Commission DGs and the European Environment Agency), the number of users and downloaded datasets in the Open Data portal and how these increase over time.

Our objective is to progressively make available through the portal all Commission’s data and possibly data of other EU institutions. On the basis of a pragmatic approach, datasets from Commission departments, other EU institutions and agencies will be constantly added to the initial content available when the portal opened in December 2012.

The impact on the information market, in terms of quantified re-use of datasets and number of applications building on those datasets, will only be possible to assess at a later stage, through dedicated studies and user feedback.

Jindřich Mynarz: Is the EC Open Data portal used inside the European Commission and other EU institutions to simplify data exchange? Do you plan to incorporate it in the EC’s internal workflows?

Daniele Rizzi: EC departments and EU institutions and agencies are not always aware of the information already available outside their own specific domain of activity. The EC Open Data portal, as a single entry point for discovery and access of the information generated within the EU Institutions, will definitively contribute to a better sharing and exploitation of information within the institutions themselves. Specific procedures should also be put in place in order to guarantee that new or updated information is immediately made available also through the portal. This is what is already happening e.g., for the Eurostat statistical tables, updated on the portal twice a day to be fully aligned with the data on the Eurostat web site.

Jindřich Mynarz: Linked data is prominently listed on the landing page of the portal. Why do you see it as a key ingredient of the portal?

Daniele Rizzi: The portal end-user interface allows easy query and access to its content in an interactive way. It easily serves users wishing to discover data, search and download datasets, get in touch with the publishing authorities or using pre-defined applications, such as viewers, web interfaces, etc.

In order to allow the development of third party applications, however, it is also necessary to facilitate the exploitation of the portal content as a platform for data and information integration in addition to the interactive document search. For this purpose, linked data is currently the most pragmatic and efficient solution.

Jindřich Mynarz: As of now, the EC Open Data portal is in beta. Do you intend to leave it in “permanent beta” and work on fast and continuous improvement? Could you tell us what are the next steps for the portal?

Daniele Rizzi: At the moment the portal is labelled as being in “beta” version because its deployment phase will still take some time, and not all the foreseen features, both in terms of portal capabilities and content completeness, are already available. Our goal is to open the portal as soon as possible and progressively improve and enrich it, rather than wait for “perfection”. Our approach is indeed to work on fast and continuous improvements, based on the implementation an initial set of predefined features and on the feedback of users.

The next steps will be the complete redesign of the user interface, making it consistent with the interinstitutional graphic chart for web sites, and the implementation of a fully multilingual interface, both planned to become available in the next weeks. Later this year a version 1.0 of the portal will include new functionalities and will rely on a new internal architecture which, while invisible to the user, will guarantee better performances and support an increased number of concurrent users.

 


 

Short bio: Daniele Rizzi has a degree in civil engineering from the Politecnico di Milano University. He spent most of his professional life working on the development of information and communication systems and tools, in particular in the domain of spatial information, both in the private and the public sector. Daniele joined the European Commission in 1993, where from 2004 to 2012 he worked on the adoption and implementation of a European spatial data infrastructure (INSPIRE). Since December 2012 he deals with the Commission Open Data policies in DG Communications Networks, Content and Technology, working in particular on the deployment of Open Data Portals.

LOD2 Plenary taking place in March13 including a LOD side event in Amsterdam

In about 1 week from today the next LOD2 project plenary (first of 2 meetings in 2013) will take place hosted by CWI in Amsterdam, the Netherlands. LOD2 plenaries are very special events as lot of Linked Open Data people from all 15 LOD2 partner organisations meet at one place to exchange their respective knowledge and expertise on the topic of LOD as well as the planning of the next project steps takes place in such 2 days meetings along the work package structure of the LOD2 project.

Furthermore additional team members from regional LOD2 partners can participate and get in touch with the topic of Linked Open Data in more detail – sometimes also invited guests from the region the meeting is located are invited to participate. This time people from CWI as well as from Wolters Kluwer Netherlands will take a chance to get in touch with the LOD2 team…

The already ‘established LOD2 tradition’ to get in touch with the local & regional Open Data and Linked Open Data community in form of a LOD2 plenary side event will happen again at the Amsterdam plenary: together with the OKFN Netherlands and the Waag Society the LOD2 dissemination team organised the: Linked Open Data MeetUp Amsterdam taking place on Sunday, 24.3.2013, at the The Waag Society Amsterdam starting at 18.30pm CET!

See all details on this event: http://www.meetup.com/OpenKnowledgeFoundation/Amsterdam-NL/884522/

As space is limited to 60-65 people for this event please register asap if you are located in the Amsterdam area and interested to participate – and join the discussions and presentations on Linked Open Data with us!

Looking forward to meeting you there!
The LOD2 Team