“I like tables.” – Interview with Richard Cyganiak

Richard Cyganiak

On February 15, 2013, LOD2 project’s Jindřich Mynarz sat down with Richard Cyganiak to talk about linked open data. Richard is a prominent voice in the domain of linked open data as well as an active semantic web software developer. He’s a linked data technologist and a member of Linked Data Research Centre at DERI.

Jindřich Mynarz: You’ve been very active in W3C standardization processes, working on semantic web standards, such as in RDF Working Group or Government Linked Data Working Group. What results of the standardization processes you’ve been involved with are you most proud of?

Richard Cyganiak: I’ve started to be involved with W3C standardization 3 to 4 years ago. Actually, standardization is a really slow process and many of the things I’ve been involved with are still not finished. There’s still some work left before there’s something to be really proud of.

One thing that I spent much time, thought and energy on, and that has been finished, is the R2RML specification, which defines a mapping language to map relational data from SQL databases to RDF. That was very hard work but we’ve managed to get it to recommendation last year and implementations are progressing very nicely, including my own, which I hope to finally release in the next couple of weeks.

Another piece of work that takes a lot of energy is the RDF working group, where I’m editing the RDF Concepts specification; the document that defines the data model for RDF and is the basis for many other specifications in the RDF world, including SPARQL and all RDF syntax specifications. This is RDF 1.1, it’s an update of the existing specifications from 2004. The changes to the underlying technology in this revision will actually be small, but the amount of discussion that went into this is staggering. RDF is used for so many different things and there are so many different angles to it, that it’s really difficult to find a common base for communicating among the members of the working group, because they range from database people over web developers, philosophers or mathematicians. Nevertheless, being a member of such a group is certainly a fascinating experience.

Jindřich Mynarz: Linked data constitutes an area you may have been involved with even longer than with W3C. As a proponent of linked data, can you think of an example showing the competitive advantage of linked data over other approaches for data management? They say that there’s a right tool for the right job. What is the right job for linked data?

Richard Cyganiak: In my mind it’s data integration. In situations where you deal with very heterogeneous data and where there is a benefit to coupling the datasets very loosely. For example, when you do it over the Web, where you don’t have central control over all the data sources. That’s the area where I can see it as being most useful.

Jindřich Mynarz: Looking at the ever-growing LOD Cloud diagram, which you help to make, it seems that plenty of data producers grasps these benefits of linked data. In fact, the number of linked datasets has grown up to a point that creating the LOD Cloud may become unwieldy. What would need to happen in order for you to stop drawing it?

Richard Cyganiak: The last update of the LOD Cloud was in September 2011. We haven’t done an update in way more than one year, which is unprecedented, and so according to a certain interpretation you could say that we’ve stopped for the moment doing the diagram. In the last couple of days I’ve been dusting off some things for the LOD Cloud here and fixing some things there, so I’m still committed to producing this diagram.

Before I answer the question let me say what would need to happen for us to finish and release the next version of the LOD Cloud. The work on creating the diagram is mostly done by Anja Jentzch from Hasso Plattner Institut and myself. We both have to find the time to do it and we really need to improve our tooling. Already the last version brought us pretty close to the limit of what can be done with manual work when producing the diagram. We’ve now done some work on creating it automatically, so that the graph layout we did manually for the last version is done automatically. We have running code for that but it doesn’t yet work completely. We used it to produce some smaller sub-clouds for specific topics, and that worked quite well.

The reason why we do the LOD Cloud is to show that there is lots of data out there that is published according to a common set of standards. If there was some better way of showing that, that might be a reason to stop doing the diagram. I can well imagine the possibilities here. There might be other ways of structuring the diagram. For example, you could start not with datasets but with vocabularies and show how the vocabularies hang together and how the datasets share the vocabularies. It’s not just the instance links, which we focus on in this diagram, but also the shared vocabularies, which are really important part of the commonality among linked datasets. I would be quite interested in showing that in a better way.

Jindřich Mynarz: Unlike visualizing common data structures like tables or trees, graph visualizations, such as the LOD Cloud, are notoriously difficult to do well. What do you consider as the kind of data structure that is easiest for people to think of, to visualize and to input through user interfaces?

Richard Cyganiak: I like tables. Tables are easy to enter and interact with and there are many ways of visualizing their contents. A lot of our computing infrastructure revolves around tables. On the other hand, one reason why we do things with tables has somewhat gone away. That’s performance: both in storage, retrieval and processing. Now, just in terms of performance we wouldn’t need to stick everything into a strict schema. However, there is a reason why so much data is being managed in a tabular format. It’s just a really convenient form. Coming back to an earlier question, it’s when you try to integrate quite heterogenous data from multiple sources, that’s where tables are really inconvenient. For everything else, I think, they are a good way of organizing data.

Jindřich Mynarz: Do you think there are some other data structures or data shapes that might replace tables for some of the purposes for which they are used today?

Richard Cyganiak: Broadly speaking, the data structures we deal with fall into 3 catagories: tables, graphs and trees. Everything pretty much fits into one of these categories and I can’t think of many examples that wouldn’t fit quite naturally in these. These are the existing data structures, in which we find most data. Most software, algorithms or visualization methods expect data in one of these shapes. However, it is usually possible to convert between them, one way or the other. For example, the main thing SPARQL can be used for is extracting tables from graphs with the normal SELECT query, but it can also turn graphs into graphs with CONSTRUCT .

One of the things I’ve recently been playing with is to use SPARQL to turn tables into different tables or graphs. That’s the Tarql project. We are thinking about how we get trees into the picture there as well in the form of JSON and XML.

Jindřich Mynarz: Tarql is only one piece in your extensive developer’s portfolio. As a software maintaner, you’ve learnt how to package semantic web tools. Do you think packaging software as Debian packages, which is used for LOD2 Stack, is a good way to go?

Richard Cyganiak: To be honest I don’t have much of an opinion on Debian packages because I’ve been living in the Mac world for probably ten years now. On the server I tend to use much Java software and the Java ecosystem has a couple of its own solutions to packaging issues. I’m not often installing Debian packages myself.

Jindřich Mynarz: How do like semantic web software to be packaged yourself? What’s your personal preference?

Richard Cyganiak: As long as I can install it with a few commands and clicks I’m fine with it. Debian packages do that, but there are many other approaches that work for me as well, such as unzipping something somewhere. The truth is of course that a lot of semantic web software is research software. I’ve been working with lots of semantic web software over the years and my expectations regarding packaging of software are probably pretty low. As long as it somehow works I’ll take it in whatever form and I’ll also build it from source, or whatever is necessary. I think though that packaging it in better ways, integrating it better and providing it in a form where tools that can work together are already setup, so that you don’t need to do the duct-taping to connect them yourself, is definitely something valuable.

Jindřich Mynarz: LOD2 project demonstrates the application of linked data in three principal domains: media and publishing, enterprise and government. Do you think there are some domains for which linked open data is a natural fit and some domains where there are still great opportunities in applying it?

Richard Cyganiak: I’ll evade the question a little bit by saying that the technology is quite domain-neutral and can work in almost any domain. I think the decision whether to use linked data or not has more to do with the kinds of data management problems that you’re facing, and that’s more important than the domain.

One domain where I would like to see linked data being used more is software development world. For example, in providing information about the available software, packages and libraries. I think it also might have a place in organizing support for developers in finding answers to various problems. For example, I’ve been toying with the idea of having a URI for every stack trace, because that would allow you to find related information in a better way. Pretty much every software developer every once in a while googles for a stack trace, exception names or error messages to find related information. I think there is an opportunity to provide more explicit links between the point in the code where the error occurs and the points on the Web where those who face the error can see what others have done who had the same problem.

Leave a Reply

Your email address will not be published. Required fields are marked *