Unstructured? You must be joking!
While I was listening to a friend analyst talk about the future of ECM, I got one of these “eureka” moments that is also an “oh ****!” moment at the same time. One of these moments that makes you realise that something has been staring you in the face for a long time, and you just happily ignored it…
Our whole industry are guilty of propagating that “universal truth” which, I’m starting to think, may have stifled a lot of creativity in the ECM space: The notion of “unstructured” content vs. “structured” data. We often use this concept to distinguish between electronic information that lives in an ECM repository vs. a relational database.
I could argue (and I will!) that there is no such thing as unstructured content!
Take a Word document: It has a hierarchy of chapters, with headings that describe them, it has a table of contents with hyperlinks to the sections, it has tables and footnotes and properties and embedded objects, internal cross-references and external links. It can even contain meta-content that describes what was changed in the document by whom and when. Is it unstructured?
Take an Excel spreadsheet: Not only it contains data like a database, but it also contains text that provides context to that data. It has formulas that provide calculations and validations, embodying business rules and multiple views of the same data. Even code. Is it unstructured?
I can carry on… Powerpoints have sequencing and timings that define the order and relationships between the slides. Photos, jpeg, contain highly structured information about the camera and the circumstances that the picture were taken. TIFF, XML, software code – highly structured content, by definition!
And that’s without going into philosophical arguments about the fact that any text written in a language, has by definition more complex and universally understood grammatical and syntactical structures than numbers in a database can ever portray.
Now, look outside the documents themselves and there’s even more structure: relationships between documents (the RFP, the proposal document, the presentation and the spreadsheet with the pricing are implicitly connected through their structures – you know that specific slides will typically relate to specific parts of the proposal). Spatial and domain specific structure in terms of the applicability of the content, etc.
And what do we ECM practitioners do? We take all of this highly structured information and obscure it by shoving it in a repository (or collaborative space!) as an “unstructured” blob, hoping that a user might define three or four attributes – title, author, date, description; hoping that a search engine may one day find it back.
And all that richness lies hidden until the next time someone opens the document and goes “oh! I see!”. What a waste!
It is true that ECM has been designed to deal with unstructured content – because we designed it that way! A typical ECM system has no notion of understanding or exploiting the structured within its content objects. Most of the time it can’t even define and exploit relationships between these objects (other than the fact that someone happened to put them both in the same folder).
I would like to stop calling content “unstructured information”. I think the sooner we educate people on how structured content is and the value that structure adds, the more likely it is that we will increase the value we can extract from it!