Domain-Specific Multilingual Linked Data Extraction from Natural Language Documents

The ever-growing world of data is largely unstructured. It is estimated that information sources such as books, journals, documents, social media content and everyday news articles constitute as much as 90% of it.  Making sense of all this data and exposing the knowledge hidden beneath, while minimizing human effort, is a challenging task that often holds the key to new insights that can prove crucial to one’s research or business. Relying on domain-specific dictionaries, Named Entity Recognition, and link discovery mechanisms, our upcoming addition to the LOD2 Stack is a novel attempt at extracting the who, what, where and when from multilingual natural language documents, in the form of Linked Data.

Free Webinar: LOD2 stack – release 3


This webinar in the course of the LOD2 webinar series will present the release 3.0 of the LOD2 stack, which contains updates to

  • Virtuoso 7 [Openlink]: the original row store of the Virtuoso 6 universal server has now been replaced by a column store, increasing the performance of SPARQL queries significantly, the store is now up to three times as fast as the previous major version.
  • Linked Open Data Manager Suite [SWC]: the ‘lodms’ application allows the user to quickly set up pipelines for transforming linked data through the use of its many extensions. It also allows operations for extracting rdf from other types of data.
  • dbpedia-spotlight-ui [ULEI]: a graphical user interface component that allows the user to use a remote DBpedia spotlight instance to annotate a text with DBpedia concepts.
  • sparqlify [ULEI]: a scalable SPARQL-SQL rewriter, allowing you to query an SQL database as if it were a triple store.
  • SIREn [DERI]: a Lucene plugin that allows you to efficiently index and query RDF, as well as any textual document with an arbitrary amount of metadata fields.
  • CubeViz [ULEI]: CubeViz allows visualization of the Data Cube linked data representation of statistical data. It has support for the more advanced DataCube features, such as slices. It also allows the selection of a remote SPARQL endpoint and export of a modified cube.
  • R2R [UMA]: the R2R mapping API is now included directly into the lod2 demonstrator application, allowing users to experience the full effect of the R2R semantic mapping language through a graphical user interface.
  • ontowiki-csvimport [ULEI]: an OntoWiki extension that transforms CSV files to RDF. The extension can create Data Cubes that can be visualized by CubeViz.

If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the free LOD2 webinar series!


Title: LOD2 Webinar Series: LOD2 stack release 3.0
: Wednesday, October 23, 2013, 4:00 PM – 5:00 PM CEST
Presenter: Bert van Nuffelen (Tenforce, Belgium)
Information & free Registration: https://www4.gotomeeting.com/register/932378567

The LOD2 webinar series is powered by LOD2 – Creating Knowledge out of Interlinked Data (http://lod2.eu), organised & produced by Semantic Web Company (http://www.semantic-web.at), Austria. This series will provide a monthly webinar about Linked (Open) Data tools and services around the LOD2 project, the LOD2 Stack and the Linked Open Data Life Cycle, also in the form of 3rd party tools. Please find continuously updated information here: http://lod2.eu/BlogPost/webinar-series


LOD2 project meets in Prague to prepare for its last year


On October 7 and 8 members of the LOD2 project gathered for yet another plenary meeting with a goal to plan and discuss what shall be done in the upcoming half-year. This time, the meeting was hosted by the University of Economics in Prague, Czech Republic.

As the project entered its last year the main focus shifted to use case work, which should demonstrate how the software developed by the project’s partners can be applied in real-world settings. In particular, the use cases should show that the LOD2 Stack is built to outlive the LOD2 project.
The practical demonstrations of the LOD2 tools include four domains: media and publishing, data in enterprises, government data and data in public procurement. The work done in these areas should serve as a source of feedback to LOD2 Stack’s component owners in order to improve their quality and robustness. Besides LOD2-driven work, there are already a number of deployments of the LOD2 Stack. The widespread uptake of the LOD2 Stack by third parties provides hope that the results of the project will continue to be useful even after its end.

Linked data excited interest in the Data Mining community

The first Workshop on Data Mining from Linked Data, DMoLD’13, was held on September 23, 2013, in Prague, in collocation with the  European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013), a prime scientific event in the data mining field. The workshop was co-organized by LOD2 researchers from UEP, Prague and I2G, Poznań (plus Claudia d’Amato from the University of Bari).

Despite the modest format of a half-day (afternoon) workshop with six technical presentations only, the interest of ECML/PKDD participants surpassed all expectations. The number of pre-registered was the highest of all twelve workshops, and during the sessions  the large room was constantly half-full with over 40 listeners and discussants.

A highlight of the whole afternoon was the invited talk on Exploiting Linked Open Data as Background Knowledge in Data Mining, by Heiko Paulheim from the University of Mannheim.  He advertised the recently released RapidMiner LOD Extension, and exemplified how such technology allows to exploit rich background knowledge (for both horizontal and vertical enrichment of original data) while outsourcing its maintenance to the LOD infrastructure, for example, the regular DBpedia updates via extraction from Wikipedia. The audience was interested, among other, in applying the presented approach in bioinformatics and in comparing/combining it with more traditional relational data mining such as Inductive Logic Programming.

Nearly half of the workshop was devoted to the outcomes of the Linked Data Mining Challenge. Its participants faced data mining tasks on linked data describing public contracts (LOD2 project has a whole work package devoted to managing this kind of data), such as predicting the number of bidders.

A clear lesson from the workshop was that the data mining community is eager to apply their tools on novel, richly structured types of data such as linked data, although the syntactical peculiarities of RDF are not yet sufficiently addressed by their pre-processing components, which currently limits their active engagement. Effort from the semantic web side, like the mentioned RapidMiner extension project, could help remove this obstacle and bring the two communities even more closely together.

DBpedia Knowledge Base Version 3.9 released

The LOD2 project is happy to announce the release of the DBpedia Knowledge Base Version 3.9.

Knowledge bases are playing an increasingly important role in enhancing the intelligence of Web and enterprise search and in supporting information integration as well as natural language processing. Today, most knowledge bases cover only specific domains, are created by relatively small groups of knowledge engineers, and are very cost intensive to keep up-to-date as domains change. At the same time, Wikipedia has grown into one of the central knowledge sources of mankind, maintained by thousands of contributors.

The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia and by making this information accessible on the Web as a large, multilingual, cross-domain knowledge base.

The English version of the DBpedia knowledge base version 3.9  describes 4.0 million things, out of which 3.22 million are classified in a consistent Ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases.

We provide localized versions of DBpedia in 119 languages. All these versions together describe 24.9 million things, out of which 16.8 million overlap (are interlinked) with the concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 12.6 million unique things in 119 different languages; 24.6 million links to images and 27.6 million links to external web pages; 45.0 million external links into other RDF datasets, 67.0 million links to Wikipedia categories, and 41.2 million YAGO categories.

Altogether the DBpedia 3.9 release consists of 2.46 billion pieces of information (RDF triples) out of which 470 million were extracted from the English edition of Wikipedia, 1.98 billion were extracted from other language editions, and about 45 million are links to external data sets.

Detailed statistics about the DBpedia data sets in 24 popular languages are provided at Dataset Statistics.

The most important improvements of the new DBpedia release compared to DBpedia 3.8 are:

1. the new release is based on updated Wikipedia dumps dating from March / April 2013 (the 3.8 release was based on dumps from June 2012), leading to an overall increase in the number of concepts in the English edition from 3.7 to 4.0 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner concept descriptions.

3. we extended the DBpedia type system to also cover Wikipedia articles that do not contain an infobox.

4. we provide links pointing from DBpedia concepts to Wikidata concepts and updated the links pointing at YAGO concepts and classes, making it easier to integrate knowledge from these sources.

More information about DBpedia is found at http://dbpedia.org/About as well as in the new overview article about the project.

Lots of thanks to

  • Jona Christopher Sahnwaldt (Freelancer funded by the University of Mannheim, Germany) for improving the DBpedia extraction framework, for extracting the DBpedia 3.9 data sets for all 119 languages, and for generating the updated RDF links to external data sets.
  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Heiko Paulheim (University of Mannheim, Germany) for inventing and implementing the algorithm to generate additional type statements for formerly untyped resources.
  • The whole Internationalization Committee for pushing the DBpedia internationalization forward.
  • Dimitris Kontokostas (University of Leipzig) for improving the DBpedia extraction framework and loading the new release onto the DBpedia download server in Leipzig.
  • Volha Bryl (University of Mannheim, Germany) for generating the statistics about the new release.
  • Petar Ristoski (University of Mannheim, Germany) for generating the updated links pointing at the GADM database of Global Administrative Areas.
  • Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.
  • OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  • Julien Cojan, Andrea Di Menna, Ahmed Ktob, Julien Plu, Jim Regan and others who contributed improvements to the DBpedia extraction framework via the source code repository on GitHub.


How Linked Open Data (LOD) supports Sustainable Development & Climate Change Development (Workshop at OKCon)

A workshop organised by REEEP, GBPN and the Semantic Web Company on “How Linked Open Data (LOD) supports Sustainable Development & Climate Change Development” – part of the “Open Development and Sustainability” Program of the Open Knowledge Conference 2013 (OKCon2013) – taking place on 18-09-2013, 11:45 to 13:15, Geneva, Switzerland. This workshop is also supported by the LOD2 project.




Panel members & workshop organisers

Florian Bauer, Operations & IT Director, Renewable Energy and Energy Efficiency Partnership
Jens Laustsen, Technical Director, Global Buildings Performance Network
Martin Kaltenböck, Managing Partner & CFO, Semantic Web Company

This session gives an introduction & overview on how Linked Open Data (LOD) can push and support sustainable development and climate change development. Along real world examples by The Renewable Energy and Energy Efficiency Partnership (REEEP, with data.reegle.info or api.reegle.info) and the Global Buildings Performance Network (GBPN, with www.gbpn.org) it will be showcased how these organisations (both NGOs) already use Linked Open Data (LOD) principles and technologies to reach their objectives in sustainable development as well as how this approach supports their vision of knowledge sharing and transparency. In addition to this the Semantic Web Company (SWC) – working in a lot in LOD projects in the field of sustainable development – will provide a short introduction into Linked Open Data (LOD) principles, benefits and best practises.

This session should help interested organisations and individuals to learn more about the benefits and potentials of Linked Open Data (LOD) for sustainable development and climate change development and invites people to discuss use cases & ideas, principles and first steps, bottlenecks and/or pitfalls in the use of Linked Open Data (LOD) to support the respective goals together with the panelists.

This workshop is NOT a pure technical workshop – target groups are: decision makers, project developers & -managers, people interested in / working in the fields of sustainable development and / or climate change development, all types of data workers.

This workshop is also supported by the LOD2 project.

See more details at: http://okcon.org/open-development-and-sustainability/session-6/

LOD2 at the Prague Open Data Meetup #7: Linked Open Cities

Linked Open Cities coat of arms

LOD2 project will be represented at the next in the series of Prague open data meetups. This time, the meetup prepared together with Otakar Motejl Fund will focus on opening and linking city data. Members of the LOD2 project will share their expertise and experiences from collaborations with various municipalities in past PUBLINK projects. Local perspectives on the topic will be provided by Czech innovators promoting transparent city councils.

The event will be held in Prague on Sunday October 6, 2013, starting at 7 PM. The event’s venue will be provided by Node5 (map).

LOD2 presentation at the Publications Office of the European Union

In June 2013, I visited the Publications Office of the European Union and gave there a presentation about our results in the area of public procurement achieved in LOD2 project. The visit was organized by LOD2 project officer Stefano Bertolo and Luca Martinelli, Assistant to the Director General of the Publications Office of the European Union. The Publications Office maintains the Tenders Electronic Daily (TED) portal. The portal contains public procurement notices from all EU member countries. It currently publishes its data in the HTML format. Data in the XML format is also available but only under a paid licence. However, the Publications Office plans to change its publication policy so that the data from TED is published as open data (machine readable formats, easily accessible, etc.). RDF is one of the considered formats, which would help the Publications Office to expose data on TED as linked open data. My presentation was about what linked open data is and what benefits it could bring to EU citizens, public authorities in EU and to the EU as a whole.

ESWC 2013 Panel – Semantic Technologies for Big Data Analytics: Opportunities and Challenges

I was invited to the ESWC 2013 “Semantic Technologies for Big Data Analytics: Opportunities and Challenges” panel on 29th May 2013 in Montpellier, France. The panel was moderated by Marko Grobelnik(JSI), with panelists Enrico Motta (KMi), Manfred Hauswirth (NUIG), David Karger (MIT), John Davies (British Telecom), José Manuel Gómez Pérez (ISOCO) and Orri Erling (myself).

Marko opened the panel by looking at the Google Trends search statistics for big data, semantics, business intelligence, data mining, and other such terms. Big data keeps climbing its hype-cycle hill, now above semantics and most of the other terms. But what do these in fact mean? In the leading books about big data, the word semantics does not occur.

I will first recap my 5 minute intro, and then summarize some questions and answers. This is from memory and is in no sense a full transcript.

Open Data session at St. Petersburg International Economic Forum

Last week I attended the St. Petersburg International economic forum. I was invited to represent LOD2 and speak in a session about open government data titled: “Open Data: Transparency with a purpose”. Other participants included two ministers of the Russian Federation (Mikhail Abyzov – minister for Open Government and Nikolai Nikiforov – minister for Telecom and Mass Communications), the NASA’s Chief Knowledge Architect Jeanne Holm and UK Transparency Board member Andrew Stott as moderator. Each one of us gave a short presentation before Andrew asked a few questions to some front row participants from NGOs. You can watch the full video of the session (in English and Russian) here.  Its great to see that Open Government (Data) is becoming an important topic in Russia. Mikhail Abyzov was presenting a number of legislative and operational projects which have happened. Russia seemed to have started late (Open Data became a more official strategy only last year), but things are moving very quickly – there is already a large amount of data published and in a way its seems to me that Open Data might be perceived by some government officials to be a way to improve efficiency and effectiveness of public administrations bypassing cumbersome bureaucratic hierarchies. Ivan Begtin, one of Russia’s most prominent Open Data activists, told me after the session, that meanwhile already most of the public tenders in Russia are published already as Open Data and in the last years the situation of corruption related to public procurement seems to have improved already a lot thanks to this measure. Ivan also mentioned, that certain people still try to prevent this increased transparency using homograph attacks (i.e. using different Unicode characters with the same glyphs or injecting special invisible Unicode characters). Anyway, thanks to Open Data it is now much easier to identify such attempts.

Jeanne Holm was talking about some experiences with Open Data and data.gov in the US and the relationship of Open Data with the recent big data buzz. Similarly Andrew reported about developments in the UK. The talks confirmed that although the importance of Open Data is meanwhile widely understood, the individual pain points and sucess factors in every country and region differ quite much. In Russia probably transparent and due process in public administrations is one of the most pressing issue, while its crime prevention or public transportation efficiency in the US and UK or public planning in Germany. In my talk, I tried to make a case that real value can only be generated out of Open Data, if we are able to link and integrate different sources (Linked Open Data). Our session at SPIEF was not the best attended one – looks like the majority of the attendees were organizers, contributors or accompanying people. This might be related to the hefty entrance barrier (if you were not invited by the organizers you had to purchase a 5.000 Euro conference pass). Also, I think that for making Open Data really a success you need to develop the whole ecosystem, of administrations, companies, NGOs and community initiatives to engage in the topic and at least with regard to supporting the last two stakeholder groups Russia still seems to be quite behind.

Was a very interesting experience to attend such an international economic forum. The St. Petersburg one is probably not as exclusive as the one in Davos, still a large number of World-class politicians (Vladimir Putin of course, Angela Merkel, China’s vize premier) and business/finance leaders (Deutsche Bank boss Jürgen Fitschen, IMF’s Christine Lagarde) attended. Whole St. Petersburg seemed to be in the SPIEF fever with thousands of policement dressed up in their Sunday uniforms, bumped up Mercedes S-Class density and quite some locals complained about the nuisances of blocked roads booked out theater performances etc. Due to the skyrocking hotel rates I preferred to stay with Avicom/HSE/W3C Russia’s Victor Klintsov in a student dorm of St. Petersburg’s University of Information Technology, Mechanics and Optics (IFMO) – thanks Dmitry for organizing this. BTW: If you want to visit St. Petersburg for some Open Data Web related event, I can recommend the 4th Conference on Knowledge Engineering and Semantic Web 2013, taking place in October at IFMO.