Content Services is in Alfresco’s DNA

I’m spending this week at Alfresco’s Sales Kickoff in Chicago, and having a blast.  There’s a lot of energy across the company about our new Digital Business Platform, and it’s great to see how many people instantly and intuitively get how Alfresco’s platform fits into a customer’s digital transformation strategy.  When content and process converge, and when we provide a world class platform for managing and exposing both as a service it’s a pretty easy case to make.  We have some great customer stories to drive the point home too.  It’s one thing to talk about a digital strategy and how we can play there, but it’s another thing entirely to see it happen.

Content management is undergoing a shift in thinking.  Analysts have declared that ECM is dead, and content services is a better way to describe the market.  For my part, I think they are right.  Companies ahead of the curve have been using Alfresco as a content services platform for a long time.  I decided to do a little digging and see when Alfresco first added a web API to our content platform.  A quick look through some of our internal systems shows that Alfresco had working web services for content all the way back in 2006.  It was probably there earlier than that, but that’s one of the earliest references I could easily find in our systems.  That’s over a decade of delivering open source content services.  Here’s a quick view of the history of content services delivery channels in the product.

API History

I don’t think any other company in the market space formerly known as ECM can say that they have been as consistently service enabled for as long as Alfresco.  It’s great to see the market going to where we have been all along.

My Favorite New Things in the Alfresco Digital Business Platform

alfresco-dbp

Everybody inside Alfresco has been busy getting ready for today’s launch of our new version, new branding, new web site, updated services and everything that comes along with it.  Today was a huge day for the company, with the release of Alfresco Content Services 5.2, a shiny new version of Alfresco Governance Services, our desktop sync client, the Alfresco Content Connector for Salesforce, a limited availability release of the Alfresco App Dev Framework and refreshes of other products such as our analytics solution, media management and AWS AMIs / Quickstarts.  Here are a few of my favorite bits from today’s releases (in no particular order).

The new REST API

Alfresco has always had a great web API, both the core REST API that was useful for interacting with Alfresco objects, and the open standards CMIS API for interacting with content.  Alfresco Content Services 5.2 takes this to the next level with a brand new set of APIs for working directly with nodes, versions, renditions and running search queries.  Not only is there a new API, but it is easier than ever to explore what the API has to offer via the API Explorer.  We also host a version of the API explorer so you can take a look without having to set up an Alfresco instance.  The new REST API is versioned, so you can build applications against it without worry that something will change in the future and break your code.  This new REST API was first released in the 5.2 Community version and is now available to Alfresco Enterprise customers.  The API is also a key component of the Alfresco App Development Framework, or ADF.  Like all previous releases, you can still extend the API to suit your needs via web scripts.

Alfresco Search Services

Alfresco Content Services 5.2 comes with a whole new search implementation called Alfresco Search Services.  This service is based on Solr 6, and brings a huge number of search improvements to the Alfresco platform.  Search term highlighting, indexing of multiple versions of a document, category faceting and multi-select facets and document fingerprinting are all now part of the Alfresco platform.  Sharding also gets some improvements and you can now shard your index by DBID, ACL, date, or any string property.  This is a big one for customers supporting multiple large, distinct user communities that may each have different search requirements.  Unlike previous releases of Alfresco, search is no longer bundled as a WAR file.  It is now its own standalone service.

The Alfresco App Dev Framework

Over the years there have been a number of ways to build what your users need on top of the Alfresco platform.  In the early days this was the Alfresco Explorer (now deprecated), built with JSF.  The Share UI was added to the mix later, allowing a more configurable UI with extension points based on Surf and YUI.  Both of these approaches required you to start with a UI that Alfresco created and modify it to suit your needs.  This works well for use cases that are somewhat close to what the OOTB UI was built for, or for problems that require minimal change to solve.  For example, both Explorer and Share made it pretty easy to add custom actions, forms, or to change what metadata was displayed.  However, the further you get from what Share was designed to do, the more difficult the customizations become.

What about those cases where you need something completely different?  What if you want to build your own user experience on top of Alfresco content and processes?  Many customers have done this by building our their own UI in any number of different technologies.  These customers asked us to make it easier, and we listened.  Enter the Alfresco App Dev Framework, or ADF.  The ADF is a set of Angular2 components that make it easier to build your own application on top of Alfresco services.  There’s much more to it than that, including dev tooling, test tooling and other things that accelerate your projects.  The ADF is big enough to really need its own series of articles, so may I suggest you hop over to the Alfresco Community site and take a look!  Note that the ADF is still in a limited availability release, but we have many customers that are already building incredible things with it.

Admin Improvements

A ton of people put in a tremendous amount of work to get Alfresco Content Services 5.2 out the door.  Two new features that I’ve been waiting for are included, courtesy of the Alfresco Community and Alfresco Support.  The first is the trashcan cleaner, which can automate the task of cleaning out the Alfresco deleted items collection.  This is based on the community extension that many of our customers have relied on for years.  The second is the Alfresco Support Tools component.  Support Tools gives you a whole new set of tools to help manage and troubleshoot your Alfresco deployment, including thread dumps, profiling and sampling, scheduled job and active session monitoring, and access to both viewing logs and changing log settings, all from the browser.  This is especially handy for those cases where admins might not have shell access to the box on which Alfresco is running or have JMX ports blocked.  There’s more as well, check out the 5.2 release notes for the full story.

The Name Change

Ok, so we changed the name of the product.  Big deal?  Maybe not to some people, but it is to me.  Alfresco One is now Alfresco Content Services.  Why does this matter?  For one, it more accurately reflects what we are, and what we want to be.  Alfresco has a great UI in Share, but it’s pretty narrowly focused on collaboration and records management use cases.  This represents a pretty big slice of the content management world, but it’s not what everybody needs.  Many of our largest and most successful customers use Alfresco primarily as a content services platform.  They already have their own front end applications that are tailor made for their business, either built in-house or bought from a vendor.  These customers need a powerful engine for creating, finding, transforming and managing content, and they have found it in Alfresco.  The name change also signals a shift in mindset at Alfresco.  We’re thinking bigger by thinking smaller.  This new release breaks down the platform into smaller, more manageable pieces.  Search Services, the Share UI, Content Services and Governance Services are all separate components that can be installed or not based on what you need.  This lets you build the platform you want, and lets our engineering teams iterate more quickly on each individual component.  Watch for this trend to continue.

I’m excited to be a part of such a vibrant community and company, and can’t wait to see what our customers, partners and others create with the new tools they have at their disposal.  The technology itself is cool, but what you all do with it is what really matters.

Digital Transformation and the Role of Content and Process

I recently had the opportunity to go to NYC and speak on architecting systems for digital transformation.  It was a terrific day with our customers and prospects, as well as the extended Alfresco crew and many of our outstanding partners.  This was the first time I’ve delivered this particular talk so I probably stumbled over my words a few times as I was watching the audience reaction as much as the speaker notes.  One of the things I always note when speaking is which slides prompt people to whip out their phones and take a picture.  That’s an obvious signal that you are hitting on something that matters to them in a way that caught their attention, so they want to capture it.  In my last talk, one of the slides that got the most attention and set the cameras snapping pictures was this one:

road-to-transformation

Digitization

Digitization of content is the whole reason ECM as a discipline exists in the first place.  In the early days of ECM a lot of use cases centered around scanning paper as images and attaching metadata so it could be easily found later.  The drivers for this are obvious.  Paper is expensive, it takes up space (also expensive), it is hard to search, deliver and copy and you can’t analyze it en masse to extract insights.  As ECM matured, we started handling more advanced digital content such as PDFs and Office documents, videos, audio and other data, and we started to manipulate them in new ways.  Transforming between formats, for example, or using form fields and field extraction.  OCR also played a role, taking those old document image files and breathing into them a new life as first class digital citizens.  We layered search and analytics on top of this digital content to make sense of it all and find what we needed as the size of repositories grew ever larger.  Digitization accelerated with the adoption of the web.  That paper form that your business relied on was replaced by a web form or app, infinitely malleable.  That legacy physical media library was transformed into a searchable, streamable platform.

What all of these things have in common is that they are centered around content.

Digitalization

Simply digitizing content confers huge advantages on an organization.  It drives down costs, makes information easier to find and allows aggregation and analysis in ways that legacy analog formats never could.  While all of those things are important, they only begin to unlock the potential of digital technologies.  Digitized content allows us to take the next step:  Digitalization.  When the cost of storing and moving content drops to near zero, processes can evolve free from many of the previous constraints.  BPM and process management systems can leverage the digitized content and allow us to build more efficient and friendlier ways to interact with our customers and colleagues.  We can surface the right content to the right people at the right time, as they do their job.  Just as importantly, content that falls out of a process can be captured with context, managed as a record if needed.  Now instead of just having digitized content, we have digitalized processes that were not possible in the past.  We can see, via our process platform, exactly what is happening across our business processes.  Processes become searchable and can be analyzed in depth.  Processes can have their state automatically affected by signals in the business, and can signal others.  Our business state is now represented in a digital form, just like our content.

If digitization is about content, digitalization is about process.

Digital Transformation

The union of digitized content and digital processes is a powerful combination, and that helps create the conditions for digital transformation.  How?

In my opinion (and many others) the single most important thing to happen to technology in recent memory is the rise of the open standards web API.  APIs have been around on the web for decades.  In fact, my first company was founded to build out solutions on top of the public XML based API provided by DHL to generate shipments and track packages.  That was all the way back in the very early 2000s, a lifetime ago by tech standards.  Even then though, people were thinking about how to expose parts of their business as an API.

One of the watershed moments in the story of the API happened way back in the early 2000s as well, at Amazon.  By this point everybody has read about Amazon’s “Big Mandate”.  If you haven’t read about it yet, go read it now.  I’ll wait.  Ok, ready?  Great.  So now we know that the seeds for Amazon’s dominance in the cloud were planted over a decade and a half ago by the realization that everything needs to be a service, that the collection of services that you use to run your business can effectively become a platform, and that platform (made accessible) changes the rules.  By treating every area of their business as part of the platform, all the way down to the infrastructure that ran them, Amazon changed the rules for everybody.

How does this tie into the first two paragraphs?  What about content and process?  I’m glad you asked.  Content and process management systems take two critical parts of a business (the content it generates and the processes that it executes) and surface them as an API.  Want to find a signed copy of an insurance policy document?  API.  Want to see how many customers are in your sign up pipeline and at what stage?  API.  Want to see all of the compliance related records from that last round of lab testing?  API.  Want to release an order to ship?  API.  Want to add completed training to somebody’s personnel record?  API.  You get the idea.  By applying best practices around content and process management systems, you can quickly expose large chunks of your business as a service, and begin to think of your company as a platform.

This is transformative.  When your business is a platform, you can do things you couldn’t do before.  You can build out new and better user experiences much more quickly by creating thin UI layers on top of your services.  Your internal teams or vendors that are responsible for software that provides the services can innovate independently (within the bounds of the API contract) which immediately makes them more agile.  You can selectively expose those services to your customer or partners, enabling them to innovate on top of the services you provide them.  You can apply monitoring and analytics at the service layer to see how the services are being used, and by whom (This is one, in fact, I would argue that you MUST do if you plan to orient yourselves toward services in any meaningful way, but that’s a separate article) which adds a new dimension to BI.  This is the promise of digital transformation.

There’s certainly more to come on this in the near future as my team continues our own journey toward digital transformation.  We are already well down the path and will share our insights and challenges as we go.

 

Do’s and Don’ts of Alfresco Development

Recently I had the pleasure of going up to NYC to speak with Alfresco customers and prospects about architecting Alfresco solutions and how to align that with efforts around digital transformation.  Part of that talk was a slide that discussed a few “do’s and don’ts” of designing Alfresco extensions.  Somebody suggested that I write down my thoughts on that slide and share it with the community.  So…  Here you go!

Do:

Stick with documented public APIs and extension points.

In early versions of Alfresco it wasn’t exactly clear what parts of the API were public (intended for use in extensions) or private (intended for use by Alfresco engineering).  This was mostly fixed in the 4.x versions of the product.  This was mostly a problem for people building things out on the Alfresco Java API.  Today, the Java API is fully annotated so it’s clear what is and is not public.  The full list is also available in the documentation.  Of course Alfresco’s server side Javascript API is public, as is the REST API (both Alfresco core and CMIS).  Alfresco Activiti has similar API sets.

Leverage the Alfresco SDK to build your deployment artifacts

During my time in the field I saw some customers and Alfresco developers roll their own toolchain for building Alfresco extensions using Ant, Gradle, Maven or in some cases a series of shell scripts.  This isn’t generally recommended these days, as the Alfresco SDK covers almost all of these cases.  Take advantage of all of the work Alfresco has done on the SDK and use it to build your AMPs / JARs and create your Alfresco / Share WAR file.  The SDK has some powerful features, and can create a complete skeleton project in a few commands using Alfresco archetypes.  Starting Alfresco with the extension included is just as simple.

Use module JARs where possible, AMPs where not

For most of Alfresco’s history, the proper way to build an extension was to package it as an AMP (Alfresco Module Package).  AMPs (mostly) act like overlays for the Alfresco WAR, putting components in the right places so that Alfresco knows where to find them.  Module JARs were first added in Alfresco 4.2 and have undergone significant improvements since introduction and are now referred to as “Simple Modules” in the documentation.  Generally, if your extension does not require a third party JAR, using a Simple Module is a much cleaner, easier way to package and deploy an extension to Alfresco.

Unit test!

This should go without saying, but it’s worth saying anyway.  In the long run unit testing will save time.  The Alfresco SDK has pretty good support for unit tests, and projects built using the All-In-One Archetype have an example unit test class included.

Be aware of your tree, depth and degree matter

At its core, the Alfresco repository is a big node tree.  Files and folders are represented by nodes, and the relationships between these things are modeled as parent-child or peer relationships between nodes.  The depth of this tree can affect application design and performance, as can the degree, or number of child nodes.  In short, it’s not usually a best practice to have a single folder that contains a huge number of children.  Not only can this make navigation difficult if using the Alfresco API, but it can also create performance troubles if an API call tries to list all of the children in a gigantic folder.  In the new REST APIs the results are automatically paged which mitigates this problem.

Use the Alfresco ServiceRegistry instead of injecting individual services

When using the Alfresco Java API and adding new Spring beans there are two ways to inject Alfresco services such as NodeService, PermissionService, etc.  You can inject the individual services that you need, or you can inject the ServiceRegistry that allows access to all services.  On the one hand, injecting individual services makes it easy to see from the bean definition exactly what services are used without going to the code.  On the other hand, if you need another service you need to explicitly inject it.  My preference is to simply inject the registry.  A second reason to inject the registry is that you’ll always get the right version of the service.  Alfresco has two versions of each service.  The first is the “raw” service, which has a naming convention that starts with a lower case letter.  An example is nodeService.  These services aren’t recommended for extensions.  Instead, Alfresco provides an AoP proxy that follows a naming convention that starts with an upper case letter, such as NodeService.  This is the service that is recommended and is the one that will be returned by the service registry.

Don’t:

Modify core Alfresco classes, web scripts, or Spring beans

Alfresco is an open source product.  This means that you can, in theory, pull down the source, modify it as much as you like and build your own version to deploy.  Don’t do that.  That’s effectively a fork, and at that point you won’t be able to upgrade.  Stick with documented extension points and don’t modify any of the core classes, Freemarker templates, Javascript, etc.  If you find you cannot accomplish a task using the defined extension points, contact Alfresco support or open a discussion on the community forum and get an enhancement request in so it can be addressed.  Speaking of extension points, if you want an great overview of what those are and when to use them, the documentation has you covered.

Directly access the Alfresco / Activiti database from other applications

Alfresco and Activiti both have an underlying database.  In general, it is not a best practice to access this database directly.  Doing so can exhaust connection pools on the DB side or cause unexpected locking behavior.  This typically manifests as performance problems but other types of issues are possible, especially if you are modifying the data.  If you need to update things in Alfresco or Activiti, do so through the APIs.

Perform operations that will create extremely large transactions

When Alfresco is building its index, it pulls transaction from the API to determine what has changed, and what needs to be indexed.  This works well in almost all cases.  The one case where it may become an issue is if an extension or import process has created a single transaction which contains an enormous change set.  The Alfresco Bulk Filesystem Import Tool breaks down imports into chunks, and each chunk is treated as a transaction.  Older Alfresco import / export functionality such as ACP did not do this, so importing a very large ACP file may create a large transaction.  When Alfresco attempts to index this transaction, it can take a long time to complete and in some cases it can time out.  This is something of an edge case, but if you are writing an extension that may update a large number of nodes, take care to break those updates down into smaller parts.

Code yourself into a corner with extensions that can’t upgrade

Any time you are building an extension for Alfresco or Activiti, pay close attention to how it will be upgraded.  Alfresco makes certain commitments to API stability, what will and will not change between versions.  Be aware of these things and design with that in mind.  If you stick with the public APIs in all of their forms, use the Alfresco SDK and package / deploy your extensions in a supported manner, most of the potential upgrade hurdles will already be dealt with.

Muck around with the exploded WAR file.  

This one should go without saying, but it is a bad practice to modify an exploded Java WAR file.  Depending on which application server you use and how it is configured, the changes you make in the exploded WAR may be overwritten by a WAR redeployment at the most inopportune time.  Instead of modifying the exploded WAR, create a proper extension and deploy that.  The SDK makes this quick and easy, and you’ll save yourself a lot of pain down the road.

This list is by no means comprehensive, just a few things that I’ve jotted down over many years of developing Alfresco extensions and helping my peers and customers manage their Alfresco platform.  Do you have other best practices?  Share them in the comments!

Alfresco and Solr – Search, Reindexing and Index Cluster Size

A question came up from a colleague recently, driven by a customer question.  When is it appropriate to increase the number of Alfresco Index Servers running Solr?  The right direction depends on several factors, and what exactly you are trying to achieve.  Like many questions related to Alfresco architecture, sizing and scalability the answers can be found in Alfresco’s excellent whitepapers on the subject (full disclosure and shameless self promotion: I wrote one of them).  Not only are there multiple reasons why you may need to scale up the search tier, there are a couple different ways to go about doing it.  Hopefully this article will help lend a little clarity to a sometimes confusing topic.

A place to start

A typical customer configuration starts with a number of index servers that roughly matches the number of repository cluster nodes.  The index servers sit behind a load balancer and provide search services to the repository tier.  Each index server maintains its own copy of the index, providing full failover.  It’s common to see a small to medium enterprise deployment running on a 2X2 configuration.  That is, two repository tier servers and two index servers, the minimum for high availability.  As the repository grows, user patterns change or the system is prepared for upgrade, this can prove insufficient from a search and indexing point of view.

Large repositories

When Alfresco first ran our 1B document benchmark we set a target of about 50M document indexed per index server.  So, for our 1B document environment, we had 20 index shards each containing ~50M docs.  Our testing shows that the system gives solid, predictable performance at this level.  For large repositories, this is a good starting point for planning purposes.  If you have a lighter indexing requirement per document (for example, a small metadata set or no full text indexing) it is likely possible to go higher.  Conversely, if your requirements call for full text indexing of large documents and you have a large, complex metadata set, a smaller number of documents per shard is more appropriate.  Either way, as a repository grows larger at some point you will need to consider adding additional index servers.  As with all things related to scale, take vendor recommendations as a starting point, then test and monitor.

Heavy search

Some ECM use cases lean heavily on search services.  For these cases it makes sense to deploy additional index servers to handle the load.  Spreading search requests across a larger number of servers does not improve the single transaction performance, but does allow more concurrent searches to complete quickly.  If your use case relies heavily on search, then you may need to consider adding additional index servers to satisfy those requests.  For this specific case, both sharding and replication can be appropriate.  Both sharding and replication allow you to spread your search load across multiple systems.  So how do you choose?  In most cases sharding is the better option.  It is more flexible and has additional benefits as we will outline in the next section.

If your repository is relatively small (less than 50M documents or so) and you are primarily concerned with search performance, replication can be a good option.  Replication sets up your index servers so that only one is actually tracking the repository and building the index.  This master node then replicates its index out to one or more slaves that are used to service search requests.  The advantage of this configuration is that DB pressure is reduced by only having one index server tracking the repository, and you now have multiple servers with a copy of that index to service search requests.  The downside is that it has a relatively low upper limit of scalability, and introduces a single point of failure for index tracking.  Not such a huge problem though, if the tracking server stops working you an always spin up another and re-seed it with a copy of the index from the slaves.  A replicated scenario may also increase the index lag time (time between adding a document and it appearing in the index) slightly since it must first be written to the master index and then replicated out to the slaves.  Real world testing shows that this delay is minimal, but it is present.

Reindexing and upgrades

There is another case where you may want to consider adding additional index servers, and that is when reindexing the repository or upgrading to a new version of Alfresco.  Alfresco has supported multiple versions of Solr over the years.  Alfresco 4.x used Solr 1.4, 5.0/5.1 use Solr 4, and the upcoming 5.2 release can use Solr 6.  Newer versions of Solr bring great new features and performance improvements, so customers are always eager to upgrade!  There is one thing to look out for though:  reindexing times.  Switching from one version of Solr to another does require that the repository be reindexed.  For very large repositories this can take some time.  This is where sharding is especially helpful.  By breaking the index into pieces (shards) we can parallelize the reindexing process and allow it to complete more quickly.  The less documents an individual shard reindexes, the faster it will finish (within reason, 10 doc per shard or something would be ridiculous).  So if you are considering an Alfresco upgrade and are worried about reindexing times, consider additional index servers to speed things along.  Note that most Alfresco upgrades do not require you to switch versions of Solr immediately.  You can continue to run your server on the old index while the new index builds, but during this time you cannot take advantage of Alfresco features that depend on the new index.

Conclusion

This list is by no means comprehensive, but it does outline the three most common reasons I have seen customers add additional index servers.  Have you seen others?  Comment below, I’d love to hear about it!