(Possibly) Enhancing Alfresco Search Part 2 – Google Cloud’s Natural Language API

google-nl-alfresco-search

In the first article in this series, we took a look at using Stanford’s CoreNLP library to enrich Alfresco Content Services metadata with some natural language processing tools.  In particular, we looked at using named entity extraction and sentiment analysis to add some value to enterprise search.  As soon as I posted that article, several people got in touch to see if I was working on testing out any other NLP tools.  In part 2 of this series, we’ll take a look at Google Cloud’s Natural Language API to see if it is any easier to integrate and scale, and do a brief comparison of the results.

One little thing I discovered during testing that may be of note if anybody picks up the Github code to try to do anything useful with it:  Alfresco and Google Cloud’s Natural Language API library can’t play nice together due to conflicting dependencies on some of the Google components.  In particular, Guava is a blocker.  Alfresco ships with and depends on an older version.  Complicating matters further, the Guava APIs changed between the version Alfresco ships and the version that the Google Cloud Natural Language API library requires so it isn’t as straightforward as grabbing the newer Guava library and swapping it out.  I have already had a quick chat with Alfresco Engineering and it looks like this is on the list to be solved soon.  In the meantime, I’m using Apache HttpClient to access the relevant services directly.  It’s not quite as nice as the idiomatic approach that the Google Cloud SDK takes, but it will do for now.

Metadata Enrichment and Extraction

The main purpose of these little experiments has been to assess how suitable each tool may be for using NLP to improve search.  This is where, I think, Google’s Natural Language product could really shine.  Google is, after all, a search company (and yes, more than that too).  Google’s entity analyzer not only plucks out all of the named entities, but it also returns a salience score for each.  The higher the score, the more important or central that entity is to the entire text.  The API also returns the number of proper noun mentions for that entity.  This seems to work quite well, and the salience score isn’t looking at just the number of mentions.  During my testing I found several instances where the most salient result was not that which was mentioned the most.  Sorting by salience and only making those most relevant mentions searchable metadata in Alfresco would be useful.  Say, for example, we are looking for documents about XYZ Corporation.  A simple keyword search would return every document that mentions that company, even if the document wasn’t actually about it.  Searching only those documents where XYZ Corporation is the most salient entity (even if not the most frequently mentioned) in the document would give us much more relevant results.

Sentiment analysis is another common feature in many natural language processing suites that may be useful in a context services search context.  For example, if you are using your content services platform to store customer survey results, transcripts of chats or other documents that capture an interaction you might want to find those that were strongly negative or positive to serve as training examples.  Another great use case exists in the process services world, where processes are likely to capture interactions in a more direct fashion.  Sentiment analysis is an area where Google’s and CoreNLP’s approaches differ significantly.  The Google Natural Language API provides two ways to handle sentiment analysis.  The first analyzes the overall sentiment of the provided text, the second provides sentiment analysis related to identified entities within the text.  These are fairly simplistic compared with the full sentiment graph that CoreNLP generates.  Google ranks sentiment along a scale of -1 to 1, with -1 being the most negative, and 1 the most positive.

Lower Level Features

At the core of any NLP tool are the basics of language parsing and processing such as tokenization, sentence splitting, part of speech tagging, lemmatization, dependency parsing, etc.  The Google Cloud NL API exposes all of these features through its syntax analysis API and the token object.  The object syntax is clear and easy to understand.  There are some important differences in the way these are implemented across CoreNLP and Google Cloud NL, which I may explore further in a future article.

Different Needs, Different Tools

Google Cloud’s Natural Language product differs from CoreNLP in some important ways.  The biggest is simply the fact that one is a cloud service and one is traditionally released software.  This has its pros and cons, of course.  If you roll your own NLP infrastructure with CoreNLP (whether you do it on-premises or in the cloud) you’ll certainly have more control but you’ll also be responsible for managing the thing.  For some use cases this might be the critical difference.  Best I can tell, Google doesn’t allow for custom models or annotators (yet).  If you need to train your own system or build custom stuff into the annotation pipeline, Google’s NLP offering may not work for you.  This is likely to be a shortcoming of many of the cloud based NLP services.

Another key difference is language support.  CoreNLP ships models for English, Arabic, Chinese, French, German and Spanish, but not all annotators work for all languages.  CoreNLP also has contributed models in other languages of varying completeness and quality.  Google Cloud’s NLP API has full fledged support for English, Japanese and Spanish, with beta support for Chinese (simplified and traditional), French, German, Italian, Korean and Portuguese.  Depending on where you are and what you need to analyze, language support alone may drive your choice.

On the feature front there are also some key differences when you compare “out of the box” CoreNLP with the Google Cloud NL API.  The first thing I tested was entity recognition.  I have been doing a little testing with a collection of short stories from American writers, and so far both seem to do a fair job of recognizing basic named entities like people, places, organizations, etc.  Google’s API goes further though and will recognize and tag things like the names of consumer goods, works of art, and events.  CoreNLP would take more work to do that sort of thing, it isn’t handled by the models that ship with the code.  On sentiment analysis, CoreNLP is much more comprehensive (at least in my admittedly limited evaluation).

Scalability and ergonomics are also concerns. If you plan to analyze a large amount of content there’s no getting around scale.  Without question, Google wins, but at a cost.  The Cloud Natural Language API uses a typical utilization cost model.  The more you analyze, the more you pay.  Ergonomics is another area where Google Cloud NL has a clear advantage.  CoreNLP is a more feature rich experience, and that shows in the model it returns.  Google Cloud NL API just returns a logically structured JSON object, making it much easier to read and interpret the results right away.  There’s also the issue of interface.  CoreNLP relies on a client library.  Google Cloud NL API is just a set of REST calls that follow the usual Google conventions and authentication schemes.  There has been some work to put a REST API on top of CoreNLP, but I have not tried that out.

The more I explore this space the more convinced I am that natural language processing has the potential to provide some significant improvements to enterprise content search, as well as to content and process analytics.

 

(Possibly) Enhancing Alfresco Search with Stanford CoreNLP

corenlp + alfresco

Laurence Hart recently published an article on CMSWiRE about AI and enterprise search that I found interesting.  In it, he lays out some good arguments about why the expectations for AI and enterprise search are a bit overinflated.  This is probably a natural part of they hype cycle that AI is currently traversing.  While AI probably won’t revolutionize enterprise search overnight, it definitely has the potential to offer meaningful improvements in the short term.  One of the areas where I think we can get some easy improvements is by using natural language processing to extract things that might be relevant to search, along with some context around those things.  For example, it is handy to be able to search for documents that contain references to people, places, organizations or specific dates using something more than a simple keyword search.  It’s useful for your search to know the difference between the china you set on your dinner table and China the country, or Alfresco the company vs eating outside.  Expanding on this work, it might also be useful to do some sentiment analysis on a document, or extract specific parts of it for automatic classification.

Stanford offers a set of tools to help with common natural language processing (NLP) tasks.  The Stanford CoreNLP project consists of a framework and variety of annotators that handle tasks such as sentiment analysis, part of speech tagging, lemmatization, named entity extraction, etc.  My favorite thing about this particular project is how they have simply dropped the barriers to trying it out to zero.  If you want to give the project a spin and see how it would annotate some text with the base models, Stanford helpfully hosts a version you can test out.  I spent an afternoon throwing text at it, both bits I wrote, and bits that come from some of my test document pool.  At first glance it seems to do a pretty good job, even with nothing more than the base models loaded.

I’d like to prove out some of these concepts and explore them further, so I’ve started a simple project to connect Stanford CoreNLP with the Alfresco Content Services platform.  The initial goals are simple:  Take text from an document stored in Alfresco, run it through a few CoreNLP annotators, extract data from the generated annotations, and store that data as Alfresco metadata.  This will make annotation data such as named entities (dates, places, people, organizations) directly searchable via Alfresco Search Services.  I’m starting with an Alfresco Repository Action that calls CoreNLP since that will be easy to test on individual documents.  It would be pretty straightforward to take this component and run it as a metadata extractor, which might make more sense in the long run.  Like most of my Alfresco extension or integration projects, this roughly follows the Service Action Pattern.

Stanford CoreNLP makes the integration bits pretty easy.  You can run CoreNLP as a standalone server, and the project helpfully provides a Java client (StandfordCoreNLPClient) that somewhat closely mirrors the annotation pipeline so if you already know how to use CoreNLP locally, you can easily get it working from an Alfresco integration.  This will also help with scalability since CoreNLP can be memory hungry and running the NLP engine in a separate JVM or server from Alfresco definitely makes sense.  It also makes sense to be judicious about what annotators you run, so that should be configurable in Alfresco.  It also make sense to limit the size of the text that gets sent to CoreNLP, so long term some pagination will probably be necessary to break down large files into more manageable pieces.  The CoreNLP project itself provides some great guidance on getting the best performance out of the tool.

A couple of notes about using CoreNLP programmatically from other applications.  First, if you just provide a host name (like localhost) then CoreNLP assumes that you will be connecting via HTTPS.   This will cause the StanfordCoreNLPClient to not respond if your server isn’t set up for it.  Oddly, it also doesn’t seem to throw any kind of useful exception, it just sort of, well, stops.  If you don’t want to use HTTPS, make sure to specify the protocol in the host name.  Second, Stanford makes it pretty easy to use CoreNLP in your application by publishing on Maven Central, but the model jars aren’t there.  You’ll need to download those separately.  Third, CoreNLP can use a lot of memory for processing large amounts of text.  If you plan to do this kind of thing at any kind of scale, you’ll need to run the CoreNLP bits on a separate JVM, and possibly a separate server.  I can’t imagine that Alfresco under load and CoreNLP in the same JVM would yield good results.  Fourth, the client also has hefty memory requirements.  In my testing, running CoreNLP client in an Alfresco action with less than 2GB of memory caused out of memory errors when processing 5-6 pages of dense text.  Finally, the pipeline that you feed CoreNLP is ordered.  If you don’t have the correct annotators in there in the right order, you won’t get the results you expect.  Some annotators have dependencies, which aren’t always clear until you try to process some text and it fails.  Thankfully the error message will tell you what other annotators you need in the pipeline for it to work.

After some experimentation I’m not sure that CoreNLP is really well suited for integration with a content services platform.  I had hoped that most of the processing using StanfordCoreNLPClient to connect to a server would take place on the server, and only results would be returned but that doesn’t appear to be the case.  I still think that using NLP tools to enhance search has merit though.  If you want to play around with this idea yourself you can find my PoC code on Github.  It’s a toy at this point, but might help others understand Alfresco, some intricacies of CoreNLP, or both.  As a next step I’m going to look at OpenNLP and a few other tools to better understand both the concepts and the space.

 

Alfresco and Solr – Search, Reindexing and Index Cluster Size

A question came up from a colleague recently, driven by a customer question.  When is it appropriate to increase the number of Alfresco Index Servers running Solr?  The right direction depends on several factors, and what exactly you are trying to achieve.  Like many questions related to Alfresco architecture, sizing and scalability the answers can be found in Alfresco’s excellent whitepapers on the subject (full disclosure and shameless self promotion: I wrote one of them).  Not only are there multiple reasons why you may need to scale up the search tier, there are a couple different ways to go about doing it.  Hopefully this article will help lend a little clarity to a sometimes confusing topic.

A place to start

A typical customer configuration starts with a number of index servers that roughly matches the number of repository cluster nodes.  The index servers sit behind a load balancer and provide search services to the repository tier.  Each index server maintains its own copy of the index, providing full failover.  It’s common to see a small to medium enterprise deployment running on a 2X2 configuration.  That is, two repository tier servers and two index servers, the minimum for high availability.  As the repository grows, user patterns change or the system is prepared for upgrade, this can prove insufficient from a search and indexing point of view.

Large repositories

When Alfresco first ran our 1B document benchmark we set a target of about 50M document indexed per index server.  So, for our 1B document environment, we had 20 index shards each containing ~50M docs.  Our testing shows that the system gives solid, predictable performance at this level.  For large repositories, this is a good starting point for planning purposes.  If you have a lighter indexing requirement per document (for example, a small metadata set or no full text indexing) it is likely possible to go higher.  Conversely, if your requirements call for full text indexing of large documents and you have a large, complex metadata set, a smaller number of documents per shard is more appropriate.  Either way, as a repository grows larger at some point you will need to consider adding additional index servers.  As with all things related to scale, take vendor recommendations as a starting point, then test and monitor.

Heavy search

Some ECM use cases lean heavily on search services.  For these cases it makes sense to deploy additional index servers to handle the load.  Spreading search requests across a larger number of servers does not improve the single transaction performance, but does allow more concurrent searches to complete quickly.  If your use case relies heavily on search, then you may need to consider adding additional index servers to satisfy those requests.  For this specific case, both sharding and replication can be appropriate.  Both sharding and replication allow you to spread your search load across multiple systems.  So how do you choose?  In most cases sharding is the better option.  It is more flexible and has additional benefits as we will outline in the next section.

If your repository is relatively small (less than 50M documents or so) and you are primarily concerned with search performance, replication can be a good option.  Replication sets up your index servers so that only one is actually tracking the repository and building the index.  This master node then replicates its index out to one or more slaves that are used to service search requests.  The advantage of this configuration is that DB pressure is reduced by only having one index server tracking the repository, and you now have multiple servers with a copy of that index to service search requests.  The downside is that it has a relatively low upper limit of scalability, and introduces a single point of failure for index tracking.  Not such a huge problem though, if the tracking server stops working you an always spin up another and re-seed it with a copy of the index from the slaves.  A replicated scenario may also increase the index lag time (time between adding a document and it appearing in the index) slightly since it must first be written to the master index and then replicated out to the slaves.  Real world testing shows that this delay is minimal, but it is present.

Reindexing and upgrades

There is another case where you may want to consider adding additional index servers, and that is when reindexing the repository or upgrading to a new version of Alfresco.  Alfresco has supported multiple versions of Solr over the years.  Alfresco 4.x used Solr 1.4, 5.0/5.1 use Solr 4, and the upcoming 5.2 release can use Solr 6.  Newer versions of Solr bring great new features and performance improvements, so customers are always eager to upgrade!  There is one thing to look out for though:  reindexing times.  Switching from one version of Solr to another does require that the repository be reindexed.  For very large repositories this can take some time.  This is where sharding is especially helpful.  By breaking the index into pieces (shards) we can parallelize the reindexing process and allow it to complete more quickly.  The less documents an individual shard reindexes, the faster it will finish (within reason, 10 doc per shard or something would be ridiculous).  So if you are considering an Alfresco upgrade and are worried about reindexing times, consider additional index servers to speed things along.  Note that most Alfresco upgrades do not require you to switch versions of Solr immediately.  You can continue to run your server on the old index while the new index builds, but during this time you cannot take advantage of Alfresco features that depend on the new index.

Conclusion

This list is by no means comprehensive, but it does outline the three most common reasons I have seen customers add additional index servers.  Have you seen others?  Comment below, I’d love to hear about it!