The Science in LOD2

The Linked Data principle is a very simple best practice for publishing and interlinking data on the Web. You might now ask, so what is the LOD2 project all about. Do they perform any research? In fact, although the LOD principles are so simple, the publishing of large amounts of data on the Web reveals a number of research challenges. On the other hand the Web of Linked Data is also a perfect test-bed to showcase the practicability and efficiency of scientific approaches. In this post we briefly outline the main research objectives of LOD2:

1. Improving the performance of very large-scale RDF Data Management. Experience demonstrates that an RDF database can be an order of magnitude less efficient than a relational representation running on the same engine. This lack of efficiency is perceived as the main obstacle for a large-scale deployment of semantic technologies in corporate applications or for expressive Data Web search. For RDF to be the lingua franca of data integration, which is its birthright, its use must not bring significant performance penalty over the much less flexible best practices prevalent today. We, therefore, intend to overcome the performance gap between relational and RDF data management by developing adaptive automatic data indexing technologies that create and exploit indexing structures as and when needed, entirely based on received query workload. We will make sure such new RDF indexing technology finds its way in RDF processing systems, directly through participation in the project, as well as by publishing these technologies openly, and also by making them available in open-source systems such as Virtuoso and MonetDB.

2. Increase and ease the interlinking and fusion of information. While the sum of data published as linked data amounts already to billions and grows steadily, the number of links between them is several orders of magnitude smaller and by far more difficult to maintain. In LOD2 we cope with this problem by integrating schema mapping and data interlinking algorithms into a mutual refinement cycle, where results on either side (schema and data) help to improve mapping and interlinking on the other side. We will investigate, both unsupervised and supervised machine learning techniques for this task, where the latter enable knowledge base maintainers to produce high quality mappings. Mappings will be contributed by the project for well-known published data sets. In addition, further research is needed in the area of data fusion, i.e. the process of integrating multiple data items, representing the same real-world object into a single, consistent, and clean representation. The main challenge in data fusion is the reliable resolution of data conflicts, i.e. choosing a value in situations where multiple sources provide different values for the same property of an object.

3. Improving the structure, semantic richness and quality of Linked Data. Many data sets on the current Data Web lack structure as well as rich knowledge representation and contain defects as well as inconsistencies. Hence, methods for learning of ontology concept definitions from instance data have to be investigated in order to facilitate the easy incremental and self-organizing creation and maintenance of semantically rich LOD2 knowledge bases. Existing machine learning algorithms have to be extended from basic Description Logics such as ALC to expressive ones such as SROIQ(D) serving as the basis of OWL 2. The algorithms have to be optimized for processing very large-scale knowledge bases. In addition, we will pursue the development of tools and algorithms user friendly knowledge base maintenance and repair, which allow to detect and fix inconsistencies and modelling errors.

4. User interfaces and interaction paradigms. All the different Data Web aspects heavily rely on end-user interaction: We have to empower users to formulate expressive queries for exploiting the rich structure of Linked Data. They have to be engaged in authoring and maintaining knowledge derived from heterogeneous and dispersed sources on the Data Web. For interlinking and fusing as well as for the classification, structure and quality improvements, end users have to be enabled to effortlessly give feedback on the automatically obtained suggestions. Last but not least, user interaction has to preserve privacy, ensure provenance and, particularly in corporate environments, be regulated using access control.

In LOD2 we plan to tackle these challenges not in isolation, but by investigating methods which facilitate a mutual fertilization of approaches developed to solve these challenges. Examples include the following:

  • The detection of mappings on the schema level, for example, will directly affect instance level matching and vice versa.
  • Ontology schema mismatches between knowledge bases can be compensated for by learning which concepts of one are equivalent to which concepts of the other knowledge base.
  • Feedback and input from end users can be taken as training input (i.e. as positive or negative examples) for the machine learning techniques in order to perform inductive reasoning on larger knowledge bases, whose results can again be assessed by end users for iterative refinement.
  • Semantically enriched knowledge bases improve the detection of inconsistencies and modelling problems, which in turn results in benefits for interlinking, fusion, and classification.
  • The querying performance of the RDF data management directly affects all other components and the nature of queries issued by the components affects the RDF data management.

As a result of such interdependence, we intend to realize an improvement cycle for LOD2 knowledge bases (as depicted in the figure), in which an improvement of a knowledge base with regard to one aspect (e.g. a new alignment with another interlinking hub) triggers a number of possible further improvements (e.g. additional instance matches).

Leave a Reply

Your email address will not be published. Required fields are marked *