I hit a major milestone today. 10000 unread messages in my inbox. Actually, 10001 since one more came in between the time I took that screenshot and the time I started writing this article. People tend to notice big round numbers, so when I logged in and saw that 10k sitting there I had a moment of crisis. Why did it get that bad? How am I ever going to clean up that pile of junk? Am I just a disorganized mess? Should somebody that lets their inbox get into that state ever be trusted with anything again? It felt like failure.
Is it failure, or does this indicate that the shift from categorization to search as the dominant way to find things slowly become complete enough that I no longer really care (or more to the point, that I don’t need to care) how many items sit in the bucket? I think it is the latter.
Think back to when you first started using email. Maybe take a look at your corporate mail client, which likely lags the state of the art in terms of functionality (or in case you are saddled with Lotus Notes, far, FAR behind the state of the art). Remember setting up mail rules to route stuff to the right folders, or regularly snipping off older messages at some date milestone and stuffing them into a rarely (never?) touched archive just in case? Now think about the way you handle your personal mail, assuming that you are using something like Gmail or another modern incarnation. Is that level of categorization and routing and archiving still necessary? No? Didn’t think so. Email, being an area of fast moving improvement and early SaaS colonization, crossed this search threshold quite some time ago. Systems that deal with more complex content in enterprise contexts took (and are still taking) a bit longer. Bruce Schneier talks a bit about this toward the beginning of his book Data and Goliath where he states “for me, saving and searching became easier than sorting and deleting”. By the by, Data and Goliath is a fantastic book, and I highly recommend you give it a read if you want to find yourself terrified by what is possible with a hoard of data.
So, what does this have to do with content management systems? A lot, actually.
One of my guiding principles for implementing content management systems is to look for the natural structure of the content. Are there common elements that suggest a structure that minimizes complexity? Are there groupings of related content that need to stay together? How are those things related? Is there a rigid taxonomy at work or is it more ad-hoc? Are there groups of metadata properties that cut across multiple types of content? What constructs does your content platform support that align with the natural structure of the content itself? From there you can start to flesh out the other concerns such as how other applications will access it and what type of things those applications expect to get back. The takeaway here is to strike a balance between the intrinsic structure of what you have (if it even has any structure at all), and how people will use it.
I’ve written previously about Alfresco’s best practices, and one of the things that has always been considered to be part of that list is paying attention to the depth and degree of your graph. Every node (a file, a folder, etc) in Alfresco has to have a parent (except for the root node), and it was considered a bad practice to simply drop a million objects as children to a single parent. A better practice was to categorize these and create subcontainers so that no single object ended up with millions of children. For some use cases this makes perfect sense, such as grouping documents related to an insurance claim in a claim folder, or HR documents in a folder for each person you employ, or grouping documents by geographical region, or per-project folders, etc.
Recently though, I have seen more use cases from customers where that model feels like artificially imposing a structure on content where no such structure exists. Take, for example, policy documents. These are likely to be consistent, perhaps singular documents with no other content associated with them. They have a set of metadata used as common lookup fields like names, policy numbers, dates in force, etc. Does this set of content require a hierarchical structure? You could easily impose one by grouping policy docs by date range, or by the first character or two of the policy holder’s last name, but does that structure bring any value whatsoever that the metadata doesn’t? Does adding that structure bring any value to the repository or the applications that use it? I don’t think it does. In fact, creating structures that are misaligned to the content and the way it is used can create unnecessary complexity on the development side. Even for the claim folder example above, it might make the most sense to just drop all claim folders under a common parent and avoid creating artificial structure where no intrinsic structure exists. Similar to the inbox model, save and search can be more efficient.
Can you do this with Alfresco? We have done some investigation and the answer appears to be “yes”, with some caveats. We have several customers successfully using large collections of objects, and as long as they stay between some reasonable guardrails it works. First, make sure that you aren’t creating a massive transaction when moving content under a single parent. This is usually a concern during migrations of large collections. One nice side-effect of breaking content down into smaller containers is that the same tools that do that usually help you to avoid creating massive single transactions that can manifest as indexing delays later. Second, make sure you are accessing these large collections in a reasonable way. If you request all the children of a parent and those children number in the millions, you’re going to have a bad time. Use pagination to limit the number of results to something that you can reasonably handle. You can do this easily with most of Alfresco’s APIs, including CMIS. Even better, only retrieve those objects that you need by taking advantage of metadata search. Finally, don’t try to browse those folders with extremely large numbers of children. Share can load more than it needs when loading up folder content in the document library, which may cause a problem. Really though, what value is there in trying to browse a collection that large? Nobody is going to look past the first page or two of results anyway.
So there you (mostly) have it. Listen to your content, listen to your users, listen to your developers, and don’t try to build structure where it doesn’t exist already. Search is your friend.
Footnote: When I posted the size of my unread inbox to Facebook, people had one of two reactions. The first was “Heh, amateur, I have 30K unread in mine”. The second was a reaction of abject horror that anybody has an inbox in that state. Seems the “sort and delete” method still has its followers!