Home > ECM > Unstructured? You must be joking!

Unstructured? You must be joking!

(Originally Posted on InformationZen by George Parapadakis on August 11, 2008)

While I was listening to a friend analyst talk about the future of ECM, I got one of these “eureka” moments that is also an “oh ****!” moment at the same time. One of these moments that makes you realise that something has been staring you in the face for a long time, and you just happily ignored it…

Our whole industry are guilty of propagating that “universal truth” which, I’m starting to think, may have stifled a lot of creativity in the ECM space: The notion of “unstructured” content vs. “structured” data. We often use this concept to distinguish between electronic information that lives in an ECM repository vs. a relational database.

I could argue (and I will!) that there is no such thing as unstructured content!

Take a Word document: It has a hierarchy of chapters, with headings that describe them, it has a table of contents with hyperlinks to the sections, it has tables and footnotes and properties and embedded objects, internal cross-references and external links. It can even contain meta-content that describes what was changed in the document by whom and when. Is it unstructured?

Take an Excel spreadsheet: Not only it contains data like a database, but it also contains text that provides context to that data. It has formulas that provide calculations and validations, embodying business rules and multiple views of the same data. Even code. Is it unstructured?

I can carry on… Powerpoints have sequencing and timings that define the order and relationships between the slides. Photos, jpeg, contain highly structured information about the camera and the circumstances that the picture were taken. TIFF, XML, software code – highly structured content, by definition!

And that’s without going into philosophical arguments about the fact that any text written in a language, has by definition more complex and universally understood grammatical and syntactical structures than numbers in a database can ever portray.

Now, look outside the documents themselves and there’s even more structure: relationships between documents (the RFP, the proposal document, the presentation and the spreadsheet with the pricing are implicitly connected through their structures – you know that specific slides will typically relate to specific parts of the proposal). Spatial and domain specific structure in terms of the applicability of the content, etc.

And what do we ECM practitioners do? We take all of this highly structured information and obscure it by shoving it in a repository (or collaborative space!) as an “unstructured” blob, hoping that a user might define three or four attributes – title, author, date, description; hoping that a search engine may one day find it back.

And all that richness lies hidden until the next time someone opens the document and goes “oh! I see!”. What a waste!

It is true that ECM has been designed to deal with unstructured content – because we designed it that way! A typical ECM system has no notion of understanding or exploiting the structured within its content objects. Most of the time it can’t even define and exploit relationships between these objects (other than the fact that someone happened to put them both in the same folder).

I would like to stop calling content “unstructured information”. I think the sooner we educate people on how structured content is and the value that structure adds, the more likely it is that we will increase the value we can extract from it!

Advertisements
  1. December 2, 2011 at 12:39 pm

    Hi George,

    I discovered this article today and I must say that I agree with your point of view. I actually never looked at a document like this. It is structured when isolated.
    After giving this point of view some reflections, I make a distinction between and unstructured document or unstructured content. The document itself can be very well structured but when you put all these documents in a repository, it is just a bunch of unstructured documents, that have no (visual) relationship between them.

    PS. thanks for giving me the Eureka moment that a document is actually structured 🙂

    Patrick

  1. March 11, 2010 at 5:53 pm
  2. November 15, 2010 at 3:54 pm
  3. August 30, 2011 at 2:39 pm
  4. December 2, 2011 at 12:26 pm
  5. March 2, 2016 at 1:45 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: