Monday, May 5, 2014

Big Content challenges

At Dayon, we are used to work with Big Data. Coming from a publisher’s background, we have provided content solutions to publishers since 1997.

I read some stories about Big Content, and was intrigued that Gartner saw Big Content as the unstructured part of Big Data. To me, Big Content is the structured version of Big Data.

Let me explain this and address some challenges and Big Content technologies.

Planned Variety

In Terms of the three Big Data V’s (Volume, Velocity and Variety), publishers content is odd. Since the goal of publishers is to make a profit from providing content, content must be able to be published to a vast arrange of channels. To enable this, content must be structured (preferably in XML) and enriched with metadata. Any Variety is planned, because unplanned Variety leads to unplanned structures and/or unplanned publications.

Data is generated, where Content is handcrafted. Tweets en Facebook-posts are only lightly structured, but Blog posts are already quite structured. Some numbers by Chartbeat can be found here and a useful insight by Fastcompany on the rise of “Big Content” as a marketing Tool.

Publishers Content is usually completely structured: XML + Meta Data, sometimes already as RDF Triples (read my earlier Blog post on Semantic Technologies).

So to me, Content is structured Data. Big Content problems differ from other Big Data problems, where handling the Variety to understand your data is a big issue. Therefore, I would like to label the publishers challenge to be a Big Content challenge.

So how big is Big Content?

A quick scan at some of our publishing clients provided these numbers (XML only!):

  1. Publisher 1: 10 million files, 25 GB
  2. Publisher 2: 750.000  files, 15 GB
  3. Publisher 3: 150 million files, 15 GB
  4. Publisher 4: 1 million files, 15.000 new files per day (max)
  5. Publisher 5: 45 million files, 20.000 new files per day (max)
  6. Publisher 6: 500.000  files

Challenges of Big Content

With these numbers in mind, what are the challenges for Big Content?
  1. Volume - XML: Are 30 million XML files a challenge? Or 25 GB in XML a challenge? It really should not be, but in reality I have met quite some technologies struggling with these amounts. An XML system should be true XML to handle this amount of data. XML isn't hard. Doing XML right is hard. If you don’t do XML right, 100.000 files or 1 GB of XML can get you plenty of headaches.
  2. Volume - Other file types: Alas, not all Content is XML. Many Publishers still manage huge amounts of HTML, PDF or other file formats. With PDF, huge numbers often also turn into huge volumes because multi-channel and hence print-quality PDF is stored.If you have to index lots of other file types, do a proper intake process per file and weed out the corrupt and the largest files.
  3. Volume - Subscriptions: At various clients I encountered the problem that Big Content is offered in large amounts of different Subscriptions. Whereas a large amounts of different Subscriptions are not a problem in itself, the combination of Big Content and Big (number of) Subscriptions often is. So if you offer lots of data, be smart about the number of Subscriptions.
  4. Volume - Triples: Nearly all Publishers storing Big Content are looking into Triples as a way to store and link Meta Data from their XML files. Storing your Meta Data in a Triple Store, and Linking it to the Linked Open Data can be a very good idea, but this calls for a Big Triple Store. A set of 1 billion Triples isn't exceptional, but also requires Big Content Technology.
  5. Velocity - Real Time Indexing: Failing at real time indexing is usually the first sign that you are becoming a Big Content publisher. Many technologies struggle with incremental updates, needing complete re-indexing, which in term leads to strange solutions such as overnight indexing, flip-flopping or indexes out of sync with the rest of the front-end.
  6. Velocity - Real Time Alerting: The value of Content depends on its relevance, and timeliness is a huge factor in relevance. Real Time Alerting will offer a competitive edge to content users. To provide Real Time Alerting, XML store need to handle alerting efficiently (using minimal resources) at load time
  7. Variety - Presentation: A Big Content challenge can be how to present all of this Content. If a simple “What’s New” view results in 20.000 hits, what are you going to show the customer?The most used solutions are:
    1. Provide a Search Only interface
    2. Provide as much structure from Meta Data as possible to assist the user in drilling down to the most useful Content
  8. Variety - Enrichment: If the Meta Data you need to provide useful segmentation of your Big Content to your end users just isn't there; there is a need for additional Enrichment. Big Content will (due to costs) call for automated enrichment using Natural Language Processing

Big Content Technology

At Dayon / HintTech we strongly believe that Big Content challenges require specialized Big Content Technology. Here are some of the Big Content Technologies we have implemented:
  1. MarkLogic
    Several of our Big Content clients have selected MarkLogic as their content platform. I believe that MarkLogic is the best XML store and indexer available at this moment.
    As a big bonus, MarkLogic comes with all kinds of useful features such as XQuery, an Application Server, and now even a Triple Store.
    Find out more about MarkLogic at MarkLogic World Amsterdam and meet us there!
  2. OWLIM
    In our project at Newz we needed a Big Content Triple Store. We found OWLIM by Ontotext to provide an excellent Big Triple Store, as did BBC and Press Association.
    W3C maintains a list of Big Triple Stores, with BigOWLIM as one of the top products.
    We also selected Ontotext as our partner for their Semantic Tagging capabilities.
  3. SOLR
    We have also implemented SOLR for Big Content collections. SOLR will not face all of the Big Content challenges, but is a great Open Source search engine.

PS: After writing this blog, I feel like renaming Meta Data to Meta Content. Probably better if I don’t…

No comments:

Post a Comment