Domain-Specific Multilingual Linked Data Extraction from Natural Language Documents

The ever-growing world of data is largely unstructured. It is estimated that information sources such as books, journals, documents, social media content and everyday news articles constitute as much as 90% of it.  Making sense of all this data and exposing the knowledge hidden beneath, while minimizing human effort, is a challenging task that often holds the key to new insights that can prove crucial to one’s research or business. Relying on domain-specific dictionaries, Named Entity Recognition, and link discovery mechanisms, our upcoming addition to the LOD2 Stack is a novel attempt at extracting the who, what, where and when from multilingual natural language documents, in the form of Linked Data.

Rozeta is a multilingual NLP and Linked Data tool wrapped around STRUTEX, a structured text knowledge representation technique used to represent natural language documents in structured form, and extract words and phrases. Rozeta provides automatic extraction of STRUTEX dictionaries in RDF form, semantic enrichment through link discovery services, a manual revision and authoring component, a document similarity search tool and an automatic document classifier.


Leave a Reply

Your email address will not be published. Required fields are marked *