Welcome to the graveyard of my words

Or “How to file a million documents”

My hard drive has a folder structure that only I understand, mixing as it does the day-to-day personal, work-related and ephemeral aspects of my life. What, you might ask, lives in the folder called “Cognite”? I had to look in it myself to recall what I had put there.

If I can’t remember what is in a set of folders that I alone have provided semantic naming to, what hope for an enterprise that employs thousands of me? A simple search for .doc and .docx files on my computer tells me that I have 2,673 such items, my brain can just about cope with 10. I need a better filing system!

Today we create and name folders according to a theme or context relevant to the files it will contain. It is self-evident what a folder called “Accounts” will contain. It’s where a hacker looking for bank account details would start. However, if Accounts accumulates so many files that cognitive dissonance sets in, we might create sub-folders called “Accounts 2014” and “Accounts 2015” and file accounts by year.

But what if instead we’d created folders called “Accounts Payable”, “Accounts Receivable” and “General Ledger”, to denote three commonly separate areas of accounting? Would each area be broken down by year, or would folders by year be broken down by area? Who makes that decision? Who decides to embark on a potentially endless proliferation of directory organization? The answer in almost every case is “one mind”. Therein lays the problem that we face when we need to collaborate — it doesn’t matter whose mind it is, no two minds think alike.

I cannot think of a logical way to ask another person to find a file in my Cognite folder without telling them exactly where to look. If they searched for the file what would they make of the fact that there is a file in that folder with exactly the same name as a file in another folder? It would undoubtedly lead to confusion. In most organizations confusion reigns supreme — dozens of people create folders that nobody else understands.

Search — a simple solution

There’s only one way to file documents consistently and that is to not care too much about where they are stored. Filing simply deals with the cognitive dissonance I mentioned earlier; it provides the minimum level of order to information that our brains need, without which we would just walk away. Show a human being a million files, or a thousand folders and the brain just shuts down and says ‘someone else can deal with this’. Someone has to create the folder structure, just hope they are smart enough to do a good job, but don’t rely on it!

Search is the only way to handle a million documents but it’s no good searching for file names, not even if your document collection is no larger than my laptop’s hard drive, because that implies that the searcher a) knows what the document is called; b) the document is named to match the search term; and c) the document does indeed contain what the searcher is looking for.

The simple reality of documents is that it is their contents we value, so we need to look inside them. Otherwise we are like rock collectors, looking at a bucket of rocks and hoping that one of them is a gold nugget. It’s too slow and inefficient to start looking inside documents when we are searching, which implies that the documents should be parsed in the filing process and ‘prepared’ for search. We want to be able to ‘Google’ documents, like we can web pages, and if we are going to parse them for this purpose then we might as well do some smart stuff along the way — like standardise the formatting, infer structure, convert them to a ubiquitous format so they can be read on any device.

Sounds like the first step in filing a million documents is to parse and turn them into HTML. It’s a step in the process that you can see here called Documization.

Workflow and Collaboration

Nobody files a million documents (or even a hundred) and doesn’t need to collaborate on some of them. It is the need to collaborate on documents that is usually the pre-cursor to the purchase of some clunky document management systems (DMS) that replaces the network drive as the document graveyard. No matter how sophisticated the DMS, the actual act of collaboration invariably breaks its primary purpose — control.

Imagine you have a file in a DMS (SharePoint, OpenText, Documentum — take your pick). You need to share it with an external entity (Customer, Lawyer, Supplier — again, take your pick). To give them access to your DMS you would have to a) pay for it, and b) they would have to know how to use it and be willing to do so. In the real world this doesn’t happen. Let me tell you what does…

Person A attaches document to email and sends it to Person B. Now there are 2 versions of the document. Person B downloads the document to their network/hard drive, now there are 3 versions (one at source, one in email, and one at destination). Person B changes the document (hopefully remembering to Track Changes), uploads it and sends it back (4 versions). Person A downloads the document, hunts through the changes… I’ll stop now, this is painful.

So far we only have two people and we’ve not even factored in that one of them might have Box or Dropbox synching away in the background, spreading various versions of the document to different machines (and the cloud) like a virus! Find the CIO of an organization with a million documents who tells you that a ton of confidential documents don’t fly in and out of the org via email and you will have found a liar.

Once again, the solution is simple:

Put the document in a single location that all parties can access, and track every edit on it, with the ability to review and rollback revisions.

This is what the DMS is supposed to do, except it’s too expensive and clunky, so most people bypass the system. At the risk of stating the obvious, having only one version of a document in a single, inexpensive, controlled and accessible location is what a Documized document gives you.