(Possibly) Enhancing Alfresco Search with Stanford CoreNLP

corenlp + alfresco

Laurence Hart recently published an article on CMSWiRE about AI and enterprise search that I found interesting.  In it, he lays out some good arguments about why the expectations for AI and enterprise search are a bit overinflated.  This is probably a natural part of they hype cycle that AI is currently traversing.  While AI probably won’t revolutionize enterprise search overnight, it definitely has the potential to offer meaningful improvements in the short term.  One of the areas where I think we can get some easy improvements is by using natural language processing to extract things that might be relevant to search, along with some context around those things.  For example, it is handy to be able to search for documents that contain references to people, places, organizations or specific dates using something more than a simple keyword search.  It’s useful for your search to know the difference between the china you set on your dinner table and China the country, or Alfresco the company vs eating outside.  Expanding on this work, it might also be useful to do some sentiment analysis on a document, or extract specific parts of it for automatic classification.

Stanford offers a set of tools to help with common natural language processing (NLP) tasks.  The Stanford CoreNLP project consists of a framework and variety of annotators that handle tasks such as sentiment analysis, part of speech tagging, lemmatization, named entity extraction, etc.  My favorite thing about this particular project is how they have simply dropped the barriers to trying it out to zero.  If you want to give the project a spin and see how it would annotate some text with the base models, Stanford helpfully hosts a version you can test out.  I spent an afternoon throwing text at it, both bits I wrote, and bits that come from some of my test document pool.  At first glance it seems to do a pretty good job, even with nothing more than the base models loaded.

I’d like to prove out some of these concepts and explore them further, so I’ve started a simple project to connect Stanford CoreNLP with the Alfresco Content Services platform.  The initial goals are simple:  Take text from an document stored in Alfresco, run it through a few CoreNLP annotators, extract data from the generated annotations, and store that data as Alfresco metadata.  This will make annotation data such as named entities (dates, places, people, organizations) directly searchable via Alfresco Search Services.  I’m starting with an Alfresco Repository Action that calls CoreNLP since that will be easy to test on individual documents.  It would be pretty straightforward to take this component and run it as a metadata extractor, which might make more sense in the long run.  Like most of my Alfresco extension or integration projects, this roughly follows the Service Action Pattern.

Stanford CoreNLP makes the integration bits pretty easy.  You can run CoreNLP as a standalone server, and the project helpfully provides a Java client (StandfordCoreNLPClient) that somewhat closely mirrors the annotation pipeline so if you already know how to use CoreNLP locally, you can easily get it working from an Alfresco integration.  This will also help with scalability since CoreNLP can be memory hungry and running the NLP engine in a separate JVM or server from Alfresco definitely makes sense.  It also makes sense to be judicious about what annotators you run, so that should be configurable in Alfresco.  It also make sense to limit the size of the text that gets sent to CoreNLP, so long term some pagination will probably be necessary to break down large files into more manageable pieces.  The CoreNLP project itself provides some great guidance on getting the best performance out of the tool.

A couple of notes about using CoreNLP programmatically from other applications.  First, if you just provide a host name (like localhost) then CoreNLP assumes that you will be connecting via HTTPS.   This will cause the StanfordCoreNLPClient to not respond if your server isn’t set up for it.  Oddly, it also doesn’t seem to throw any kind of useful exception, it just sort of, well, stops.  If you don’t want to use HTTPS, make sure to specify the protocol in the host name.  Second, Stanford makes it pretty easy to use CoreNLP in your application by publishing on Maven Central, but the model jars aren’t there.  You’ll need to download those separately.  Third, CoreNLP can use a lot of memory for processing large amounts of text.  If you plan to do this kind of thing at any kind of scale, you’ll need to run the CoreNLP bits on a separate JVM, and possibly a separate server.  I can’t imagine that Alfresco under load and CoreNLP in the same JVM would yield good results.  Fourth, the client also has hefty memory requirements.  In my testing, running CoreNLP client in an Alfresco action with less than 2GB of memory caused out of memory errors when processing 5-6 pages of dense text.  Finally, the pipeline that you feed CoreNLP is ordered.  If you don’t have the correct annotators in there in the right order, you won’t get the results you expect.  Some annotators have dependencies, which aren’t always clear until you try to process some text and it fails.  Thankfully the error message will tell you what other annotators you need in the pipeline for it to work.

After some experimentation I’m not sure that CoreNLP is really well suited for integration with a content services platform.  I had hoped that most of the processing using StanfordCoreNLPClient to connect to a server would take place on the server, and only results would be returned but that doesn’t appear to be the case.  I still think that using NLP tools to enhance search has merit though.  If you want to play around with this idea yourself you can find my PoC code on Github.  It’s a toy at this point, but might help others understand Alfresco, some intricacies of CoreNLP, or both.  As a next step I’m going to look at OpenNLP and a few other tools to better understand both the concepts and the space.

 

A Simple Pattern for Alfresco Extensions

Over the years I have worked with and for Alfresco, I have written a ton of Alfresco extensions.  Some of these are for customers, some are for my own education, some for R&D spikes, etc.  I’d like to share a common pattern that comes in handy.  If you are a super experienced Alfresco developer, this article probably isn’t for you.  You know this stuff already!

There are a lot of ways to build Alfresco extensions, and a lot of ways to integrate your own code or connect Alfresco to another product.  There are also a lot of ways you might want to call your own code or an integration, whether that is from an Action, a Behavior, a Web Script, a scheduled job, or via the Alfresco Javascript API.  One way to make your extension as flexible as possible is to use what could informally be called the “Service Action Pattern”.

The Service Action Pattern

service_action_pattern_sequence (1)

Let’s start by describing the Service Action Pattern.  In this pattern, we take the functionality that we want to make available to Alfresco and we wrap it in a service object.  This is a well established pattern in the Alfresco world, used extensively in Alfresco’s own public API.  Things like the NodeService, ActionService, ContentService, etc all take core functionality found in the Alfresco platform and wrap it in a well defined service interface consisting of a set of public APIs that return Alfresco objects like NodeRefs, Actions, Paths, etc, or Java primitives.  The service object is where all of our custom logic lives, and it provides a well defined interface for other objects to use.  In many ways the service object serves as a sort of adapter pattern in that we are using the service object to translate back and forth between the types of domain specific objects that your extension requires and Alfresco objects.  When designing a new service in Alfresco, I find it is a best practice to limit the types of objects returned by the service layer to those things that Alfresco natively understands.  If your service object method creates a new node, return a NodeRef, for example.

A custom service object on its own isn’t terribly useful, since Alfresco doesn’t know what to do with it.  This is where an Alfresco Action comes in handy.  We can use one or more Alfresco Actions to call the services that our service object exposes.  Creating an action to call the service object has several advantages.  First, once you have an Action you can easily call that Action (and thus the underlying service object) from the Javascript API (more on this in a moment).  Second, it is easy to take an Action and surface it in Alfresco Share for testing or so your users can call it directly.  Actions can also be triggered by folder rules, which can be useful if you need to call some code when a document is created or updated.  Finally, Actions are registered with Alfresco, which makes them easy to find and call from other Java or server side Javascript code via the ActionService.  If you want to do something to a file or folder in Alfresco there is a pretty good chance that an Action is the right approach.

Using the Service Action Pattern also makes it simple to expose your service object as a REST API.  Remember that Alfresco Actions can be located and called easily from the Javascript API.  The Javascript API also happens to be (IMHO) the simplest way to build a new Alfresco Web Script.  If you need to call your Action from another system (a very common requirement) you can simply create a web script that exposes your action as a URL and call away.  This does require a bit of boilerplate code to grab request parameters and pass them to the Action, which in turn will call your service object.  It isn’t too much and there are lots of great examples in the Alfresco documentation and out in the community.

So why not just bake the code into the Action itself?  Good question!  First, any project of some complexity is likely to have a group of related functionality.  A good example can be found in the AWS Glacier Archive for Alfresco project we built a couple years ago at an Alfresco hack-a-thon.  This project required us to have Actions for archiving content, initiating a retrieval, and retrieving content.  All of these Actions are logically and functionally related, so it makes sense to group them together in a single service.  If you want the details of how Alfresco integrates with AWS Glacier, you just have to look at the service implementation class, the Action classes themselves are just sanity checks and wiring.  Another good reason to put your logic into a service class is for reuse outside of Actions.  Actions carry some overhead, and depending on how you plan to use it you may want to make your logic available directly to a behavior or expose it to the Alfresco Javascript API via a root scope object.  Both of these are straightforward if you have a well defined service object.

I hope this helps you build your next awesome Alfresco platform extension, I have found it a useful way to implement and organize my Alfresco projects.

Do’s and Don’ts of Alfresco Development

Recently I had the pleasure of going up to NYC to speak with Alfresco customers and prospects about architecting Alfresco solutions and how to align that with efforts around digital transformation.  Part of that talk was a slide that discussed a few “do’s and don’ts” of designing Alfresco extensions.  Somebody suggested that I write down my thoughts on that slide and share it with the community.  So…  Here you go!

Do:

Stick with documented public APIs and extension points.

In early versions of Alfresco it wasn’t exactly clear what parts of the API were public (intended for use in extensions) or private (intended for use by Alfresco engineering).  This was mostly fixed in the 4.x versions of the product.  This was mostly a problem for people building things out on the Alfresco Java API.  Today, the Java API is fully annotated so it’s clear what is and is not public.  The full list is also available in the documentation.  Of course Alfresco’s server side Javascript API is public, as is the REST API (both Alfresco core and CMIS).  Alfresco Activiti has similar API sets.

Leverage the Alfresco SDK to build your deployment artifacts

During my time in the field I saw some customers and Alfresco developers roll their own toolchain for building Alfresco extensions using Ant, Gradle, Maven or in some cases a series of shell scripts.  This isn’t generally recommended these days, as the Alfresco SDK covers almost all of these cases.  Take advantage of all of the work Alfresco has done on the SDK and use it to build your AMPs / JARs and create your Alfresco / Share WAR file.  The SDK has some powerful features, and can create a complete skeleton project in a few commands using Alfresco archetypes.  Starting Alfresco with the extension included is just as simple.

Use module JARs where possible, AMPs where not

For most of Alfresco’s history, the proper way to build an extension was to package it as an AMP (Alfresco Module Package).  AMPs (mostly) act like overlays for the Alfresco WAR, putting components in the right places so that Alfresco knows where to find them.  Module JARs were first added in Alfresco 4.2 and have undergone significant improvements since introduction and are now referred to as “Simple Modules” in the documentation.  Generally, if your extension does not require a third party JAR, using a Simple Module is a much cleaner, easier way to package and deploy an extension to Alfresco.

Unit test!

This should go without saying, but it’s worth saying anyway.  In the long run unit testing will save time.  The Alfresco SDK has pretty good support for unit tests, and projects built using the All-In-One Archetype have an example unit test class included.

Be aware of your tree, depth and degree matter

At its core, the Alfresco repository is a big node tree.  Files and folders are represented by nodes, and the relationships between these things are modeled as parent-child or peer relationships between nodes.  The depth of this tree can affect application design and performance, as can the degree, or number of child nodes.  In short, it’s not usually a best practice to have a single folder that contains a huge number of children.  Not only can this make navigation difficult if using the Alfresco API, but it can also create performance troubles if an API call tries to list all of the children in a gigantic folder.  In the new REST APIs the results are automatically paged which mitigates this problem.

Use the Alfresco ServiceRegistry instead of injecting individual services

When using the Alfresco Java API and adding new Spring beans there are two ways to inject Alfresco services such as NodeService, PermissionService, etc.  You can inject the individual services that you need, or you can inject the ServiceRegistry that allows access to all services.  On the one hand, injecting individual services makes it easy to see from the bean definition exactly what services are used without going to the code.  On the other hand, if you need another service you need to explicitly inject it.  My preference is to simply inject the registry.  A second reason to inject the registry is that you’ll always get the right version of the service.  Alfresco has two versions of each service.  The first is the “raw” service, which has a naming convention that starts with a lower case letter.  An example is nodeService.  These services aren’t recommended for extensions.  Instead, Alfresco provides an AoP proxy that follows a naming convention that starts with an upper case letter, such as NodeService.  This is the service that is recommended and is the one that will be returned by the service registry.

Don’t:

Modify core Alfresco classes, web scripts, or Spring beans

Alfresco is an open source product.  This means that you can, in theory, pull down the source, modify it as much as you like and build your own version to deploy.  Don’t do that.  That’s effectively a fork, and at that point you won’t be able to upgrade.  Stick with documented extension points and don’t modify any of the core classes, Freemarker templates, Javascript, etc.  If you find you cannot accomplish a task using the defined extension points, contact Alfresco support or open a discussion on the community forum and get an enhancement request in so it can be addressed.  Speaking of extension points, if you want an great overview of what those are and when to use them, the documentation has you covered.

Directly access the Alfresco / Activiti database from other applications

Alfresco and Activiti both have an underlying database.  In general, it is not a best practice to access this database directly.  Doing so can exhaust connection pools on the DB side or cause unexpected locking behavior.  This typically manifests as performance problems but other types of issues are possible, especially if you are modifying the data.  If you need to update things in Alfresco or Activiti, do so through the APIs.

Perform operations that will create extremely large transactions

When Alfresco is building its index, it pulls transaction from the API to determine what has changed, and what needs to be indexed.  This works well in almost all cases.  The one case where it may become an issue is if an extension or import process has created a single transaction which contains an enormous change set.  The Alfresco Bulk Filesystem Import Tool breaks down imports into chunks, and each chunk is treated as a transaction.  Older Alfresco import / export functionality such as ACP did not do this, so importing a very large ACP file may create a large transaction.  When Alfresco attempts to index this transaction, it can take a long time to complete and in some cases it can time out.  This is something of an edge case, but if you are writing an extension that may update a large number of nodes, take care to break those updates down into smaller parts.

Code yourself into a corner with extensions that can’t upgrade

Any time you are building an extension for Alfresco or Activiti, pay close attention to how it will be upgraded.  Alfresco makes certain commitments to API stability, what will and will not change between versions.  Be aware of these things and design with that in mind.  If you stick with the public APIs in all of their forms, use the Alfresco SDK and package / deploy your extensions in a supported manner, most of the potential upgrade hurdles will already be dealt with.

Muck around with the exploded WAR file.  

This one should go without saying, but it is a bad practice to modify an exploded Java WAR file.  Depending on which application server you use and how it is configured, the changes you make in the exploded WAR may be overwritten by a WAR redeployment at the most inopportune time.  Instead of modifying the exploded WAR, create a proper extension and deploy that.  The SDK makes this quick and easy, and you’ll save yourself a lot of pain down the road.

This list is by no means comprehensive, just a few things that I’ve jotted down over many years of developing Alfresco extensions and helping my peers and customers manage their Alfresco platform.  Do you have other best practices?  Share them in the comments!