Archive for the ‘taxonomy’ Category

Do you organise your fridge like your information??

September 24, 2013 2 comments

It’s not often that I describe a refrigerator as a taxonomy, so bear with me here… So, you loaded up the car with your grocery shopping, you brought it all in the kitchen from the car, and you are about to load up the fridge. Do you organise your fridge layout based on the  “Use By” date of the products? No, nobody does. You put the vegetables in the vegetable drawer, you put the raw meats on a shelf of their own, the yoghurts and the desert puddings on a separate shelf. The eggs go in the door. You may consider the use-by date as you stack things of the same category, e.g. the fresh chicken will have to be eaten before the sausages which will still last until next week, but that’s incidental, it’s not the primary organisational structure. Your fridge has a taxonomy, a classification scheme, and it is organised functionally, by product class, not by date.

Where am I going with this? Records and retention management (where else?). It’s over fours years ago, that I wrote an article called “Is it a record? Who cares!”  which created quite a bit of animosity in the RM community, and I quickly had to follow it up with a Part 2  to explain that my original title was quite literal, not sarcastic.

Four years later, I find myself still having very similar conversations with clients and colleagues. The more we move into an era of Information Governance, the more the distinction between records and non-records becomes irrelevant. And the more we move from the world of paper documents to the multi-faceted world of electronic content, the more we need to move away from the “traditional” records management organisational models of retention-based fileplans: The physical management of paper records necessitated their organisation in clusters of documents with similar retention requirements in order to dispose of them, so classification taxonomies (fileplans) were organised around that requirement.

In the digital world, this is no longer a requirement. Retention period, is just another logical attribute (metadata) applied to each individual content piece, not an organisational structure. With the right tools in place, a retention model can be associated with each piece of content individually, and collections of content with the same retention and – more importantly, disposition – periods, can be assembled dynamically as and when required.

For me, there are only two logical questions that drive the classification of digital content: “What is it?” (the type of content, or class) and “What is it for?” (the context under which it has been, or will be used). To use an example: An application form for opening a new account, is a certain type of content which will determine its initial retention period while it’s being processed. If that application is approved or rejected, is context that will further affect its retention period. If the client raises a dispute about his new account, it may further impact that retention period of that application form. This context-driven variance, cannot be supported in a traditional fileplan-based records management system, which permanently fixes the record – fileplan – retention relationship.

The classification (organisation, taxonomy, use any term you like…) of that content, is not even relevant to this fileplan/retention discussion. The application form in the previous example, will need to be associated with the customer, the account type, and the approval process or the dispute process. That is the context under which the organisation will need to organise and find that particular application form. You will not look for it by its retention period, unless you are specifically looking to dispose of it.

To go back to my original fridge metaphor: You will not start cooking dinner by picking up the item in the fridge that will expire first – that’s probably the pudding. You will look in the relevant shelf for the food you are trying to cook: meat or vegetables or eggs. Only after that you may double check the date, to see if it is still valid or expired.

So… I remain convinced that:
(a) there is no point in distinguishing between records and non-records any more, non-records are just records with zero shelf-life
(b) the concept of a “fileplan” as a classification structure is outdated and unnecessary for digital records, and
(c) it’s time we start managing content “in context”, based on its usage history and not as an isolated self-defining entity.

As always, I’m keen to hear your thoughts on this.

P.S. I read some blogs to learn, some for their amusing content, and some because (even if their content sometimes irritates me) force me to re-think. I read Chris Walker’s blog because it generally makes me nod my head in violent agreement 🙂 . He often expresses very similar views to mine and I find his approach to Information Governance (which he is now consolidating into a book) extremely down to earth. The reason for this shameless plug to his blog, is that as I was writing the thoughts expressed above, I caught up with his article from last week Big Buckets of Stuff, that covers very similar ground… Well worth a read.

A clouded view of Records and Auto-Classification

When you see Lawrence Hart (@piewords), Christian Walker (@chris_p_walker) and Cheryl McKinnon (@CherylMcKinnon) involved in a debate on Records Management, you know it’s time to pay attention! 🙂

This morning, I was reading Lawrence’s blog titled “Does Records Management Give Content Management a Bad Name?”, which picks on one of the points in Cheryl’s article “It’s a Digital-First World: Five Trends Reshaping Records Management As You Know It”, with some very insightful comments added by Christian.  I started leaving a comment under Lawrence’s blog (which I will still do, pointing back to this) but there are too many points I wanted to add to the debate and it was becoming too long…

So, here is my take:

First of all, I want to move away from the myth that RM is a single requirement. Organisations look to RM tools as the digital equivalent to a Swiss Army Knife, to address multiple requirements:

  • Classification – Often, the RM repository is the only definitive Information Management taxonomy managed by the organisation. Ironically, it mostly reflects the taxonomy needed by retention management, not by the operational side of the business. Trying to design a taxonomy that serves both masters, leads to the huge granularity issues that Lawrence refers to.
  • Declaration – A conscious decision to determine what is a business record and what is not. This is where both the workflow integration and the auto-classification have a role to play, and where in an ideal world we should try to remove the onus of that decision from the hands of the end-user. More on that point later…
  • Retention management – This is the information governance side of the house. The need to preserve the records for the duration that they must legally be retained, move them to the most cost-effective storage medium based on their business value, and actively dispose of them when there is no regulatory or legal reason to retain them any longer.
  • Security & auditability – RM systems are expected to be a “safe pair of hands”. In the old world of paper records management, once you entrusted your important and valuable documents to the records department, you knew that they were safe. They would be preserved and looked after until you ask for them. Digital RM is no different: It needs to provide a safe-haven for important information, guaranteeing its integrity, security, authenticity and availability. Supported by a full audit trail that can withstand legal scrutiny.

Auto-categorisation or auto-classification, relates to both the first and the second of these requirements: Classification (using linguistic, lexical and semantical analysis to identify what type of document it is, and where it should fit into the taxonomy) and Declaration (deciding if this is a business document worthy of declaration as a record). Auto-classification is not new, it’s been available both as a standalone product  and integrated within email and records capture systems for several years. But its adoption has been slow, not for technological reasons, but because culturally both compliance and legal departments are reluctant to accept that a machine can be good enough to be allowed to make this type of decisions. And even thought numerous studies have proven that machine-based classification can be far more accurate and consistent than a room full of paralegals reading each document, it will take a while before the cultural barriers are lifted. Ironically, much of the recent resurgence and acceptance of auto-classification is coming from the legal field itself, where the “assisted review” or “predictive coding” (just a form of auto-classification to you and me) wars between eDiscovery vendors, have brought the technology to the fore, with judges finally endorsing its credibility [Magistrate Judge Peck in Moore v. Publicis Groupe & MSL Group, 287 F.R.D. 182 (S.D.N.Y.2012), approving use of predictive coding in a case involving over 3 million e-mails.].

The point that Christian Walker is making in his comments however is very important: Auto-classification can help but it is not the only, or even the primary, mechanism available for Auto-Declaration. They are not the same thing. To take the records declaration process away from the end-user, requires more than understanding the type of document and its place in a hierarchical taxonomy. It needs the business context around the document, and that comes from the process. A simple example to illustrate this would be a document with a pricing quotation. Auto-classification can identify what it is, but not if it has been sent to a client or formed part of a contract negotiation. It’s that latter contextual fact that makes it a business record. Auto-Declaration from within a line-of-business application, or a process management system is easy: You already know what the document is (whether it has been received externally, or created as part of the process), you know who it relates to (client id, case, process) and you know what stage in its lifecycle it is at (draft, approved, negotiated, signed, etc.). These give enough definitive context to be able to accurately identify and declare a record, without the need to involve the users or resort to auto-classification or any other heuristic decision. That’s assuming, of course, that there is an integration between the LoB/process and the RM system, to allow that declaration to take place automatically.

The next point I want to pick up is the issue of Cloud. I think cloud is a red herring to this conversation. Cloud should be an architecture/infrastructure and procurement/licensing decision, not a functional one. Most large ECM/RM vendors can offer similar functionality hosted on- and off-premises, and offer SaaS payment terms rather than perpetual licensing. The cloud conversation around RM however, comes to its own sticky mess where you start looking at guaranteeing location-specific storage (critical issue for a lot of European data protection and privacy regulation) and when you start looking at the integration between on-premise and off-premise systems (as in the examples of auto-declaration above). I don’t believe that auto-classification is a significant factor in the cloud decision making process.

Finally, I wanted to bring another element to this discussion. There is another RM disruptive trend that is not explicit in Cheryl’s article (but it fits under point #1) and it addresses the third RM requirement above: “In-place” Retention Management. If you extract the retention schedule management from the RM tool and architect it at a higher logical level, then retention and disposition can be orchestrated across multiple RM repositories, applications, collaboration environments and even file systems, without the need to relocate the content into a dedicated traditional RM environment. It’s early days (and probably a step too far, culturally, for most RM practitioners) but the huge volumes of currently unmanaged information are becoming a key driver for this approach. We had some interesting discussions at the IRMS conference this year (triggered partly because of IBM’s recent acquisition of StoredIQ, into their Information Lifecycle Governance portfolio) and James Lappin (@JamesLappin) covered the concept in his recent blog here: The Mechanics on Manage-In-Place Records Management Tools. Well worth a read…

So to summarise my points: RM is a composite requirement; Auto-Categorisation is useful and is starting to become legitimate. But even though it can participate, it should not be confused with Auto-Declaration of records;  “Cloud” is not a functional decision, it’s an architectural and commercial one.

%d bloggers like this: