Thursday, April 24, 2014

Using Semantic Technologies to crunch Big Data

The interest in Big Data (nice post by a collegue on this subject) has sparked a new interest in Semantic Technologies. It is clear that the Volume and Variance of Big Data requires technologies that can structure and segment Big Data into useful and usable structures. For this, Semantic Technologies are used.

However there are different kinds of Semantic Technologies around, so I will start off with an introduction on Semantics and the Semantic Web. Next I will cover two key Semantic Technologies to arrive at the goal of this introduction: How do Semantic Technologies help us to crunch Big Data.

Semantics and Semantic Web

In 2001 Tim Berners-Lee and others published in article in Scientific American: “The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities.”

This title provides a first definition of Semantic Data:
"Content that is meaningful to computers"
Tim Berners-Lee understood that HTML web pages were useful to humans, but since they were (and often still are) encoded to store visual information, rather than the meaning of the information, they were of no use to automated systems to understand.

To be meaningful for computers, content has to be encoded in such a way that the meaning is clear, and can be processed automatically. XML is the first step in this. The progress so far is:

HTML:<p>Tim Berners-Lee</p>
XML: <author>Tim Berners-Lee</author>

The computer can now apply a formatting style on all authors, and can index them separately, but still cannot use the meaning of the concept “Author” or distinguish this Tim Berners-Lee from any other Tim Berners-Lee (If you think this is a silly example, please visit the Michael Jackson disambiguation page on Wikipedia: http://en.wikipedia.org/wiki/Michael_Jackson_(disambiguation)).

Wikipedia defines Semantic Technology:
"Using Semantic Technologies, meanings is stored separately from data and content files, and separately from application code"
So in our example, the author role is matched to a central definition for the creation of documents, preferably using a standard such as the Dublin Core standard “DC.creator”.

XML: <author>Tim Berners-Lee</author>
RDF expressed in XML: 
<rdf:Description dc:title=" The Semantic Web">
    <dc:creator>Tim Berners-Lee </dc:creator> 
</rdf:Description>

In the next step we can replace “The Semantic Web” and “Tim Berners-Lee” with Unique Resource Identifiers (URI). For ease of understanding, the URI for Tim Berners-Lee could be: http://en.wikipedia.org/wiki/Tim_Berners-Lee and the article could be referenced as: http://dx.doi.org/10.1038/scientificamerican0501-34.

So from a formatted piece of text we arrive at a well-defined relation between two specific URI’s. A computer now can apply logic, based on understandable definitions and relationships.



Such a relationship is called a “Triple”: consisting of three pieces of information – from left to right: a Subject, a Predicate and an Object – together describing a piece of knowledge.

The de facto standard for expressing Semantic information is the W3C's Resource Description Framework (RDF).

So what do we need to make the Semantic Web work?

  1. Well defined relations – like the Dublin Core relations, for instance RFD Schema’s
  2. A way to store a multitude of triples: A Triple Store
  3. Vocabularies: the concepts and relationships between them that describe a certain domain.
  4. Semantic Enrichment to create triples from unstructured data

This article is about Semantic Technologies, so let’s look at how Triple Stores and Semantic Enrichment will help us to get to our goal: Linked Big Data

Triple Stores

Triples are a specific way to store information. To use Triples in an effective way – querying using SPARQL and reasoning – a special database is needed to store these Graph structures. These databases are called Triple Stores

From the given example, it is easy to understand that a vocabulary + dataset can expand into millions or billions of triples. Performance – both ingestion and querying – are important considerations.

Some of the better known Triple Stores are Sesame and Jena for smaller implementations and OWLIM, MarkLogic and Virtuoso for large implementations.

Semantic Enrichment Technologies

To use the Big, we have to understand the Data. In an ideal world, data is created according to a well organised ontology.

Alas, in most cases Big Data is created with no ontology present. To create structure from unstructured data (or structured with a different goal in mind) we need automatic recognition of meaning from our data.

This usually starts with recognising types of information using Semantic Enrichment Technologies. Semantic Enrichment Technologies are a collection of linguistic tools and -techniques such as Natural Language Processing (NLP) and artificial intelligence (AI) to analyse unstructured natural language or -data and try to classify and relate it.

By identifying the parts of speech (subject, predicate, etc.), algorithms can recognise categories, concepts (people, places, organisations, events, etc.), and topics. Once analysed, text can be further enriched with vocabularies, dictionaries, taxonomies, and ontologies (so regardless which literal is used, concepts are matched, for example: KLM = Koninklijke Luchtvaart Maatschappij = Royal Dutch Airlines).

This layer of linked metadata over our data creates Linked Data.

The quality of enrichment will range from (nearly) 100% for literal translated content to 90% or less, depending on the amount of training that is available.

Linked Big Data

So Semantic Enrichment Technologies gives us the opportunity to turn Big Data into Linked Big Data.
Tim Berners-Lee defined Linked Open Data to comply with the following 5 rules:

  1. Available on the web (whatever format) but with an open licence, to be Open Data
  2. Available as machine-readable structured data (e.g. excel instead of image scan of a table)
  3. As (2) plus non-proprietary format (e.g. CSV instead of excel)
  4. All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
  5. All the above, plus: Link your data to other people’s data to provide context

Governments and other public organisations are putting much effort in providing Linked Open Data for citizens and organisations to use.

Commercial organisations will not likely openly publish their data, but will use the same standards as Linked Open Data (such as HTTP, URIs and RDF) and therefore have similar implementations for Linked Big Data.



Some examples of Big Linked Data and Big Open Data initiatives:

  1. Linked Open Data in the Netherlands, UK and USA 
  2. Linked Open Data sources from DbPedia which, essentially makes the content of Wikipedia available in RDF and also links to GeoNames for geographical locations, Freebase, a community-curated database of well-known people, places, and things 
  3. A browse interface for triple stores 
  4. Enriched Dutch Newspaper articles via Newz 
  5. Dutch Laws in RDF
  6. Europeana opens up European cultural history collections 

So what’s in it for me?

Does your organisation create or own lots of unstructured data? Hidden in there probably is a wealth of knowledge, which you can access:

  1. Find out what structure (ontology) fits your needs
  2. Use Semantic Enrichment Technologies to create structure from your unstructured data
  3. Store your data in a Triple Store
  4. Start exploring, learn & earn

I will post more on Triple Stores and Semantic Enrichments in future blogs.

No comments:

Post a Comment