Weblog

Semantic CKAN – Revisited

A year ago or more the service http://semantic.ckan.net/ was
established as a side-project to provide a Linked Data version of
the CKAN dataset registry. It was a very simple Python script
that used the CKAN API to generate the linked data version in flat
files on the disk which were then served by a web server that had
simple content-type autonegotiation awareness.

In the meantime the number of data registries running the CKAN
Software
has grown quite impressively. The main http://ckan.net/
registry isn’t even the biggest anymore having been eclipsed by
http://data.gov.uk/, but it is still easily the most varied and
perhaps the most useful depending on what sort of data you are looking
for. In any event whilst it would still be possible to run the same
script for multiple registries, and make the corresponding
configurations in the web server, the situation really wants an
approach that is designed from the start to work with a distributed
network of (meta)data sources. The current version of
http://semantic.ckan.net/ is the second iteration of work discussed in
a previous article.

There is parallel work happening within the core CKAN software. In the
context of the UKLP it has acquired the ability to harvest metadata
about georeferenced datasets for INSPIRE. In the context of LOD2
it is being extended to be able to handle information about linked
data sources. In both of these cases the data model of CKAN is
severely strained. The INSPIRE data is only minimally modelled – this
is not a criticism, it is the right decision given the tight deadlines
for compliance with the EC directive – and for linked data it means
that dataset publishers might typically publish a voiD description
of their data and this canonical rich description needs to be
shoehorned into the CKAN data model. CKAN is production software,
however, used in some high-profile places so a certain amount of
caution is required that tends to prevent taking a radically different
approach to the problem in favour of incremental changes.

In this case, the radical approach was, instead of thinking of the
system as a fairly traditional website, to think of it as aggregation
point where harvesting agents meet to dump the information about
datasets that they have found. In order not to constrain the data
model, to be able to easily support what might be called polymorphic
metadata records from different types of sources, the central storage
is an RDF triplestore. The first type of input source implemented was,
of course, CKAN, but two more are planned for the immediate future,
native RDF (DCat + voiD) and ISO19139/INSPIRE.

By taking this approach of putting RDF in the core it means that the
schema is effectively extensible by third party data publishers to
express whichever nuance they think is appropriate for describing
their data – this is information that we cannot possibly anticipate in
a centralised way. A certain amount of special cases are needed, for
example for handling the conventions formalised by some curated
dataset groups (e.g. lodcloud and lld) to take information in
CKAN’s free-form tags and key-value pairs and give it some explicit
structure, but the code that does this is well-circumscribed and lives
exclusively in the module responsible for understanding the CKAN
protocol and doesn’t leak into the core engine nor into the user
interface.

The user interface is done entirely with Javascript and uses the
SPARQL endpoint as well as a subject-oriented JSON representation
of the various resources as appropriate (this JSON representation is
the one produced by the raptor serialiser library). As such it is
completely decoupled from the back-end machinery. It understands the
standard DCat representation of data catalog records and yet also has
some goodies for ones that have been augmented with voiD, particularly
the expression of interlinking relationships between datasets such as
are used to generate the LOD cloud diagram. If you look at the page
for a dataset that is described with voiD, for example the Gemeinsame
Normdatei
a navigable diagram is produced showing that dataset at
the centre and its neighbours. In this way you can navigate stepwise
through the LOD cloud.

Since CKAN has the notion of curated groups and group membership is
recorded, we can also visualise the group membership and their
interconections as well as interconnections to datasets outside of the
group. A good example of this is the LLD group where internal
datasets are coloured blue and external ones grey. Contrast this
visualisation to the non-linked data group of dictionaries. There
is nothing new or earth-shattering in this, but the difference is
nevertheless quite striking.

Finally, because for many purposes the CKAN API is quite convenient
to use and there are a growing number of tools that take advantage of
it, it is quite easy given a store of rich data to clone the relevant
parts of it. In addition to making linked data available to non-linked
data aware clients, it makes it possible to do search and retrieval
operations from a computer program across the whole network of dataset
registries instead of one by one.

Despite this success, some shortcomings of the data are apparent. It
is not possible to, or there is no convention for, linking of datasets
between two different CKAN servers. This is of course directly
possible and in fact easy with voiD, but there is no way to say that
such and such a linked dataset recorded on the Czech CKAN server links
to DBPedia whose metadata lives on the main server. Instead the data
must be duplicated and in this duplication it will get assigned a
different package id and we will have no particular way to know that
it is indeed the same dataset. Similarly it is impossible to have a
curated group of datasets that spans instances. Again this is trivial
to represent in the RDF storage, and if we had such data it would
automatically work with this system, but there is direct way to create
such information at this time.

The heaviest users of CKAN, apart from the UK government which has its
own metadata scheme and whose linked data initiative does not touch
its CKAN infrastructure, is the LOD community. They have developed a
well specified set of conventions that are accordingly easy to work
with. Other users of CKAN are more haphazard with their use of tags
and such which means it is very difficult to infer any particular
meaning from them. It has been suggested before to have wiki pages for
particular tags and keys from the key-value parirs and in fact this is
the way they are modelled, with links to http://wiki.ckan.net/ so that
the community can collaboratively decide on the meaning of their
metadata elements. It would be quite useful were this to be followed
through, and more interesting still if the wiki in question were to
use Ontowiki or Semantic MediaWiki so that the information about
these metadata elements can be fed back into the system.

Roadmap

Concerning future work, the following are planned for the near future:

  • Pass-through write operations on the API. This will require that
    one has the appropriate credentials for the data source.
  • Native RDF data sources – consumption of the well known void.ttl
    files and use of DCat catalogues as sources.
  • Consumption of data from INSPIRE CSW servers and transformation
    into RDF together with lightweight geodata support.

Leave a Reply

Your email address will not be published. Required fields are marked *