Weblog

LOD2 book published

We are happy to present the book summarizing the results of the LOD2 project – Linked Open Data — Creating Knowledge Out of Interlinked Data. Results of the LOD2 Project edited by Sören Auer, Volha Bryl and Sebastian Tramp. The book is published as a Springer Lecture Notes in Computer Science volume, and is open access, that is available for free in electronic form.

The book is a joint effort of the LOD2 consortium, and consists of two parts. The first part, the technology, covers the advances in RDF data management, extraction, creation, enrichment, interlinking, data fusion, authoring, exploration and visualization, as well as Linked Data Stack. The second part of the book presents use cases in publishing, linked enterprise data, and open government data. The book gives an overview of a diverse number of research, technology, and application advances and refers the reader to further detailed technical information in the project deliverables and original publications. In that regard, the book is targeted at IT professionals, practitioners, and researchers aiming to gain an overview of some key aspects of the emerging field of Linked Data.

lod2book

Posted in Uncategorized

DBpedia Version 2014 released

The LOD2 project is happy to announce the release of DBpedia 2014.

Knowledge bases are playing an increasingly important role in enhancing the intelligence of Web and enterprise search and in supporting information integration as well as natural language processing. Today, most knowledge bases cover only specific domains, are created by relatively small groups of knowledge engineers, and are very cost intensive to keep up-to-date as domains change. At the same time, Wikipedia has grown into one of the central knowledge sources of mankind, maintained by thousands of contributors.

The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia and by making this information accessible on the Web as a large, multilingual, cross-domain knowledge base.

The most important improvements of the new release compared to DBpedia 3.9 are:

1. the new release is based on updated Wikipedia dumps dating from April / May 2014 (the 3.9 release was based on dumps from March / April 2013), leading to an overall increase of the number of things described in the English edition from 4.26 to 4.58 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner data.

The English version of the DBpedia knowledge base currently describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology (http://wiki.dbpedia.org/Ontology2014), including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases.

Continue reading

Posted in Uncategorized | Tagged , , , , ,

LOD2 Pilod cooperation summary

To gain more knowledge on linked data, the Dutch PiLOD project has applied for LOD2 support in the PUBLINK Call2013. With the help of Martin Kaltenböck (SWC), we arranged several knowledge exchange & sharing sessions.  This showed that linked data is a real international topic. We talked with, and had sessions with, people from the Netherlands, UK, Belgium, Germany and Austria.

We had three webinar sessions and one session where the presenter took the effort to visit us from Germany. An average of ten interested people followed the sessions and gained quite a lot of knowledge. Even though most participants where quite knowledgeable on the topics of linked data, RDF and Sparql, most presentations required full attention to be able to follow them. PiLOD participant Pieter van Everdingen joined all sessions and reported on them on the pilod.nl site.

The first session was on Virtuoso where Hugh Williams and Patrick van Kleef answered a lot of questions on the Triple Store Virtuoso. Luckily several people already had experience with Virtuoso so they were able to ask interesting question. See for the questions and answers: http://www.pilod.nl/wiki/Virtuoso_questions

The second session was on Ontowiki. Sebastian Tramp took the plane to visit us in Nieuwegein where he explained and demonstrated Ontowiki. PiLOD participant Richard Nagelmaeker arranged office room at Ordina. This presentation was so interesting we took more time than planned originally. People even skipped lunch to be able to absorb the knowledge. See for the questions and answers: http://www.pilod.nl/wiki/OntoWiki_questions

Pilod-OW-Session
The third session was on the linking tools Silk and Limes. In this session Robert Isele explained and demonstrated the workings of Silk. Due to some technical difficulties with the webinar startup we started late, which resulted in the situation that the presentation of Axel wasn’t live. Luckily there was a recorded webinar on Limes, which we displayed. Again we gained a lot of knowledge. See details: http://www.pilod.nl/wiki/Details_Linking_Tools_Silk_and_Limes_presentation_on_june_20

Pilod-Silk-Session
The fourth session was on Poolparty. Product manager Andreas Blumauer explained Poolparty while giving several examples and demonstrations. See details: http://www.pilod.nl/wiki/Details_Poolparty_on_june_27

Pilod-Poolparty-Session
Next to the four knowledge sessions we had a separate session where Bert van Nuffelen helped us install the LOD2 software on our Pilod platform server. This server is for all Pilod participants to be able to try out software and experiment with linked data. See here for the installed software: http://www.pilod.nl/wiki/Pilod_installed_software

Our thanks are directed to:

  • Martin Kaltenböck (Semantic Web Company, LOD2 PUBLINK)
  • Hugh Williams (OpenLink Software)
  • Patrick van Kleef (OpenLink Software)
  • Sebastian Tramp (University of Leipzig)
  • Robert Isele (brox GmbH)
  • Axel Ngonga (University Leipzig)
  • Thomas Thurner (Semantic Web Company)
  • Andreas Blumauer (Semantic Web Company)
  • Bert Van Nuffelen (tenforce)

More details on the sessions can be found here:

Gerard Persoon (mail@gpersoon.com) and Richard Nagelmaeker (richard.nagelmaeker@ordina.nl)

Posted in LOD Training, PUBLINK | Tagged , , , , , , , , , , , , , ,

An Award for LOD2 and Rozeta

We’re pleased to announce that Rozeta has won a technology innovation award at the recently organized 58th International Fair of Technics and Technical Achievements in Belgrade, Serbia!

Priznanje_simage003

The International Fair of Technics and Technical Achievements is a major fair and conference event in the region, showcasing state-of-the-art achievements in the industry, with more than 600 exhibitors from 22 countries (over 15.000 sqm of exhibition space) every year.

Posted in Uncategorized

European Data Forum 2014 in Athens was great success

About 3 weeks ago the European Data Forum 2014 (EDF2014) took place in Athens, Greece on 19-20 March 2014. LOD2 was again one of the main organisers in this 3rd edition of the free community event to foster the European Data Economy.

More than 500 industry leaders, researchers, and policy makers, were meeting to discuss the challenges and opportunities of data-driven innovation. With expert speakers in two parallel tracks, an executive panel on Big Data, several networking sessions, a poster session, a 2-days exhibition area, the awarding of the European Data Innovator Award 2014 and 5 specialised collocated events, EDF2014 is the greatest ever assembly of data experts in the EU.

Neelie Kroes, the Vice President of the European Commission, in charge of the Digital Agenda, opened EDF2014 and addressed a warm welcome to all participants: “It is an inspiring event today, it is important for Greece, it is important for Europe”. She pointed out that we are in the Big Data era which means more data and more ways to collect, manage, manipulate and use it, making an impact and difference to people’s life: “Data to empower people”.

The Minister of Administrative Reform and eGovernance of Greece, Kyriakos Mitsotakis, addressed the EDF2014 and reinforced the focus of the Greek Government in core Data Economy areas as means to promote transparency, government efficiency and growth: “The open data revolution is a trend which my country cannot simply afford to ignore”.

Continue reading

Posted in Uncategorized

New release of the WebDataCommons RDFa, Microdata and Microformat data sets

We are happy to announce a new release of the WebDataCommons RDFa, Microdata, and Microformat data sets.

The data sets have been extracted from the November 2013 version of the Common Crawl covering 2.24 billion HTML pages which originate from 12.8 million websites (pay-level-domains).

Altogether we discovered structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%). These pages originate from 1.7 million different pay-level-domains out of the 12.8 million pay-level-domains covered by the crawl (13%).

Approximately 471 thousand of these websites use RDFa, while 463 thousand websites use Microdata. Microformats are used on 1 million websites within the crawl.

Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa,  Microdata and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

General information about the WebDataCommons project is found at
http://webdatacommons.org/

Data Set Statistics

Basic statistics about the November 2013 RDFa, Microdata, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:
http://webdatacommons.org/structureddata/2013-11/stats/stats.html

Comparing the statistics to the statistics about the August 2012 release of the data sets
http://webdatacommons.org/structureddata/2012-08/stats/stats.html
we see that the adoption of the Microdata markup syntax has strongly increased (463 thousand websites in 2013 compared to 140 thousand in 2012, even given that the 2013 version of the Common Crawl covers significantly less websites than the 2012 version).

Looking at the adoption of different vocabularies, we see that webmasters mostly follow the recommendation by Google, Microsoft, Yahoo, and Yandex to use the schema.org vocabularies as well as their predecessors in the context of Microdata. In the context of RDFa, the most widely used vocabulary is the Open Graph Protocol recommended by Facebook.

Looking at the most frequently used classes, we see that beside of navigational, blog and CMS related meta-information many websites markup e-commerce related data (products, offers, and reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).

Download

The overall size of the November 2013 RDFa, Microdata, and Microformat data sets is 17.2 billion RDF quads. For download, we split the data into 3,398 files with a total size of 332 GB.
http://webdatacommons.org/structureddata/2013-11/stats/how_to_get_the_data.html

Lots of thanks to
+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data parsers.
+ the Amazon Web Services for supporting WebDataCommons.

Christian Bizer, Petar Petrovski and Robert Meusel, University of Mannheim

Posted in Uncategorized

LOD2 Plenary and Open Data Meet-up in Mannheim

There was a LOD2 plenary meeting hosted by Chris Bizer from the University of Mannheim this week.

The plenary meeting was preceded by a Linked Open Data Meetup with talks from Springerfluid Operations, and several LOD2 partners (Universität LeipzigUniversity of Mannheim, theSemantic Web Company, and Wolters Kluwer Deutschland GmbH (WKD)).

Below is a summary write up of the events by Orri Erling from OpenLink Software.

LOD2 Plenary Mannheim Group Photo
Group Photo at LOD2 Plenary Mannheim

Wolters Kluwer Deutschland GmbH (WKD) gave a presentation on the content production pipeline of their legal publications and their experiences in incorporating LOD2 technologies for content enrichment. This is a very successful LOD2 use case and demonstrates the value of linked data for the information industry.

Springer gave a talk about their interest in linked data for enriching the Lecture Notes in Computer Science product. Also conference proceedings could be enhanced with structured metadata inRDF. I asked about nanopublications. The comment was that content authors might perceive nanopublications as an extra imposition. On the other hand, in the life sciences field there is a lot of enthusiasm for the idea. We will see; anyway, biology will likely lead the way for nanopublications. I referred Aliaksandr Birukou of Springer to the companies Euretos and its parent S&T in Delft, Netherlands, and to Barend Mons, scientific director of NBIC, the Netherlands Bioinformatics Centre. These are among the founding fathers of the Nano Republic, as they themselves put it.

Sebastian Hellman gave a talk on efforts to set up the DBpedia Foundation as a not-for-profit organization, hopefully in the next 10 days, to aid in the sustainability and growth of the DBpedia project. The Foundation would identify stakeholders, their interests, and ways to generate income to further improve DBpedia. Planned areas of improvement include the development of high-availability value-added DBpedia services with quality of service (QoS) agreements for enterprise users; additional tools in the DBpedia stack to support improved and cost-efficient data curation and internationalization; and improved documentation, tutorials, and support to speed uptake.

I had a word with Peter Haase of fluid Operations about the Optique project and their cloud management offerings. The claim is to do ontology-directed querying over thousands of terabytes of heterogenous data. This turns out to be a full-force attempt at large scale SQL federation with ontology-directed query rewriting for covering OWL 2 QL semantics. With Ian Horrocks of Oxford leading the ontology side, the matter is in good hands. Still the matter is not without its problems. Simple lookups can be directed to the data but if there are terabytes of it, it is more likely thataggregations are what is desired. Federated aggregation tends to move a lot of data. So the problems are as they ever were. However, if the analytics are already done and stored in the relational space, finding these based on ontologies is a worthwhile thing for streamlining end user access to information.

The LOD2 plenary itself was structured in the usual way, covering the work packages in two parallel tracks.

On the database side, the final victory will be won by going to adaptive schema for RDF. We brought the RDF penalty against relational to a factor of 2.5 for common analytics style queries, e.g., Star Schema Benchmark. This is a comparison to Virtuoso SQL, which offers very high performance in this workload, over 2x the speed of column store pioneer MonetDB and 300x MySQL. So this is where matters stand. To move them significantly forward, exploitation of structure for guiding physical storage will be needed. Also the project still has to deliver the 500 Gtriple results. The experiments around Christmas at CWI support the possibility, but they are not final. Putting triples into tables when the triples in fact form table-shaped structures, which is the case most of the time, may turn out to be necessary for this. At least, this will be a significant help.

Be the case as it may, using a table schema for regularly shaped data, while preserving the RDF quad flexibility, would essentially abolish the RDF tax and bring the LOD2 project to a glorious conclusion in August.

I took the poetic license to compare the data journey into RDF and back to the Egyptian myth of Osiris: The data gets shut in a silo and then gets cut into 14 pieces; and subsequently thrown into the Nile (i.e., the LOD cloud, or the CKAN catalog). Grief-stricken Isis sees what is become of her love: She patiently reassembles the pieces, reconstructing Osiris in fact so well that he sires her a child, hawk-headed Horus, who proceeds to reclaim his father’s honor. (See, Isis means Intelligent Structured Information Storage.)

I had many interesting conversations with Chris Bizer about his research in data integration, working with the 150M HTML tables in the common crawl. The idea is to resolve references and combine data from the tables. Interestingly enough, the data model in these situations is basically triples, while these are generally not stored as RDF but in Lucene. This makes sense due to the string-matching nature of the task. There appears to be opportunity in bringing together the state of the art in database, meaning the very highly optimized column-store and vectored execution in Virtuoso with the search-style workload found in instance matching and other data integration tasks. The promise goes in the direction of very fast ETL and subsequent discovery of structural commonalities and enrichment possibilities. This is also not infinitely far from the schema discovery that one may do in order to adaptively optimize storage based on the data.

Volha Bryl gave a very good overview of the Mannheim work in the data integration domain. For example, learning data fusion rules from examples of successful conflict resolution seems very promising. Learning text extraction rules from examples is also interesting. The problem of data integration is that the tasks are very heterogenous and therefore data integration suites have very large numbers of distinct tools. This is labor intensive but there is progress in automation. An error-free, or near enough, data product remains case by case and has human curation but automatic methods seem, based on Volha’s and Chris’ presentation, to be in the ballpark for statistics.

Giovanni Tummarello of Insight/SindiceTech, always the life of the party, presented his Solr-based relational faceted browser. The idea is to show and drill down by facets over a set of related tables; in the demo, this was investments, investment targets, and investors. You can look at the data from any of the points and restrict the search based on attributes of any. Well, this is what a database does, right? That is so, but the Sindice tool is on top of Solr and actually materializes joins into a document. This blows up the data but has all the things colocated so it can even run from disk. We also talked about the Knowledge Graph package Sindice offers on the Google cloud, this time a Virtuoso application.

We hope that negotiations between SindiceTech and Insight (formerly DERI) around open sourcing the SPARQL editor and other items come to a successful conclusion. The SPARQL editor especially would be of general interest to the RDF community. It is noteworthy that there is no SPARQL query builder in common use out there (even OpenLink‘s own open source iSPARQL has been largely (but not entirely!) overlooked and misunderstood, though it’s been available as part of the OpenLink Ajax Toolkit for several years). OK, a query builder is useful when there is schema. But if the schema is an SQL one, as will be the case if RDF is adaptively stored, then any SQL query builder can be applied to the regular portion of the data. 40 years of calendar time and millennia of person years have gone into making SQL front ends and these will become applicable overnight; Virtuoso does speak SQL, as you may know.

I had the breakout session about the database work in LOD2. What will be done is clear enough, the execution side is very good, and our coverage of the infinite space of query optimization continues to grow. One more revolution for storage may come about, as suggested above. There is not very much to discuss, just to execute. So I used the time to explain how you run

 

SELECT  SUM ( l_extendedprice )
  FROM  lineitem
     ,  part
 WHERE  l_partkey = p_partkey
   AND  p_name LIKE '%green%'

 

Simple query, right? Sure, but application guys or sem-heads generally have no clue about how these in fact need to be done. I have the probably foolish belief that a little understanding of database, especially in the RDF space which does get hit by every query optimization problem, would be helpful. At least one would know what goes wrong. So I explained to Giovanni, who is in fact a good geek, that this is a hash join, and with only a little prompting he suggested that you should also put a Bloom filter in front of the hash. Good. So in the bar after dinner I was told I ought to teach. Maybe. But the students would have to be very fast and motivated. Anyway, the take-home message is that the DBMS must figure it out. In the SQL space this is easier, and of course, if most of RDF reduces to this, then RDF too will be more predictable in this department.

I talked with Martin Kaltenböck of the Semantic Web Company about his brilliant networking accomplishments around organizing the European Data Forum and other activities. Martin is a great ambassador and lobbyist for linked data across Europe. Great work, also in generating visibility for LOD2.

The EU in general, thanks in great part to Stefano Bertolo’s long term push in this direction, is putting increasing emphasis on measuring progress in the research it funds. This is one of the messages from the LOD2 review also. Database is the domain of performance race par excellence; the matters on that side are well attended to by LDBC and, of course, the unimpeachably authoritative TPC, among others. In other domains, measurement is harder, as it involves a human-curated ground truth for any extraction, linking, or other integration. There is good work in both Mannheim and Leipzig in these areas, and I may at some point take a closer look, but for now it is appropriate to stick to core database.

 

 

Posted in Uncategorized

LOD2 at Mannheim Linked Open Data Meetup

Co-located with the last LOD2 project plenary in Mannheim, Germany, February 24-25, 2014, the Mannheim Linked Open Data Meetup will take place on Sunday, February 23. The meetup is organized and hosted by the Data and Web Science research group of the University of Mannheim.

The meetup is meant to bring together researchers and practitioners interested in Linked Open Data from Southern Germany, and is half-way between a workshop and an informal gathering, with drinks and snacks provided by the organizers.

The following people will give talks at the meetup:

  1. Sören Auer (University of Bonn), Overview of the LOD2 project
  2. Katja Eck (Wolters Kluwer), LOD in the publishing industry
  3. Martin Kaltenböck (Semantic Web Company, Vienna), Enterprise Semantics – Information Management with PoolParty
  4. Mirjam Kessler, Aliaksandr Birukou (Springer), Linked Data Initiatives at Springer Verlag
  5. Peter Haase (Fluid Operations, Heidelberg), Linked Data Applications with the Information Workbench
  6. Sebastian Hellman (University of Leipzig), Dutch DBpedia for Searching and Visualising Dutch Library Data

The meetup will take place at the University of Mannheim in the building B6 26, Room A101 on Sunday evening, February 23, starting at 18:30.

Up-to-date information is also available at

http://www.meetup.com/OpenKnowledgeFoundation/Mannheim-DE/1092882/

Looking forward to meeting you there!

The LOD2 Team

Posted in Uncategorized

Preview release of conTEXT for Linked-Data based text analytics

We are happy to announce the preview release of conTEXT — a platform for lightweight text analytics using Linked Data (soon to be part of the LOD2 Stack).
conTEXT enables social Web users to semantically analyze text corpora (such as blogs, RSS/Atom feeds, Facebook, G+, Twitter or SlideWiki.org decks) and provides novel ways for browsing and visualizing the results.

conTEXT workflow

The process of text analytics in conTEXT starts by collecting information from the web. conTEXT utilizes standard information access methods and protocols such as RSS/ATOM feeds, SPARQL endpoints and REST APIs as well as customized crawlers for WordPress and Blogger to build a corpus of information relevant for a certain user. The assembled text corpus is then processed by Natural Language Processing (NLP) services (currently FOX and DBpedia-Spotlight) which link unstructured information sources to the Linked Open Data cloud through DBpedia. The processed corpus is then further enriched by de-referencing the  DBpedia URIs as well as  matching with pre-defined natural-language patterns for DBpedia predicates (BOA patterns). The processed data can also be joined with other existing corpora in a text analytics mashup. The creation of analytics mashups requires dealing with the heterogeneity of different corpora as well as the heterogeneity of different NLP services utilized for annotation. conTEXT employs NIF (NLP Interchange Format) to deal with this heterogeneity. The processed, enriched and possibly mixed results are presented to users using different views for exploration and visualization of the data. Additionally, conTEXT provides an annotation refinement user interface based on the RDFa Content Editor (RDFaCE) to enable users to revise the annotated results. User-refined annotations are sent back to the NLP services as feedback for the purpose of learning in the system.

For more information on conTEXT visit:

Posted in Uncategorized

Publicdata.eu becoming more pan-European

As part of LOD2 we are developing the portal publicdata.eu. This is an open data portal which aggregates open data from other government portals throughout Europe. At the start of the project, most portals that were included in publicdata.eu were scraped from other sites, or based on portals that were set up by members of the open data community, often volunteers. Over the course of the years we have seen more and more portals being published by municipalities, regions and EU member states. This has substantially increased the number of the datasets that are available on the portal.

About a year ago, only about 17,000 datasets were available on publicdata.eu. Today, more than 30,000 datasets can be found, many of which published by national governments. For example, we have included the German national open data portal and more recently the Romanian open data portal.

On 27 Nov 2013, the Open Knowledge Foundation attended a meeting of the public sector information sub-group on the pan-European Open Data portal. This was an excellent opportunity to show the work that has been going on to build publicdata.eu, as well as talking with representatives of more than 18 EU member states, who are working on open data portals on the national level. One of the exciting developments is the re-launch of the French portal which is built on CKAN, software developed by the Open Knowledge Foundation, and the same software that is being built to run publicdata.eu. Because CKAN is open source and AGL licensed, any adaptations of CKAN made by the French team will be released as open source as well.

Apart from the presentation of publicdata.eu, the meeting also highlighted some other developments from EU member states. For example, Ross Jones of the data.gov.uk team demonstrated the work his team has done around further developing the portal. He remarked that the focus will be more on data quality in the near future, a topic that is still relevant across many open data portal. Additionally, Jean-Charles Quertinmont of Belgium shared their experiences. In their experience in setting up two separate open data portals, interest for using the datasets is higher from developers in civil society than from businesses. Finally, the Open data support project presented their work in harvesting datasets from portals and turning those into RDF. The publicdata.eu team and open data support team are looking into working together to ensure that the data harvested by the open data support team is also available via the publicdata.eu support team. Other work in the final year of publicdata.eu as part of LOD2 focuses mainly on support for multilingualism, support for the DCAT Application profile and better categorisation of the data.

Posted in Uncategorized