In the first article in this series, we took a look at using Stanford’s CoreNLP library to enrich Alfresco Content Services metadata with some natural language processing tools. In particular, we looked at using named entity extraction and sentiment analysis to add some value to enterprise search. As soon as I posted that article, several people got in touch to see if I was working on testing out any other NLP tools. In part 2 of this series, we’ll take a look at Google Cloud’s Natural Language API to see if it is any easier to integrate and scale, and do a brief comparison of the results.
One little thing I discovered during testing that may be of note if anybody picks up the Github code to try to do anything useful with it: Alfresco and Google Cloud’s Natural Language API library can’t play nice together due to conflicting dependencies on some of the Google components. In particular, Guava is a blocker. Alfresco ships with and depends on an older version. Complicating matters further, the Guava APIs changed between the version Alfresco ships and the version that the Google Cloud Natural Language API library requires so it isn’t as straightforward as grabbing the newer Guava library and swapping it out. I have already had a quick chat with Alfresco Engineering and it looks like this is on the list to be solved soon. In the meantime, I’m using Apache HttpClient to access the relevant services directly. It’s not quite as nice as the idiomatic approach that the Google Cloud SDK takes, but it will do for now.
Metadata Enrichment and Extraction
The main purpose of these little experiments has been to assess how suitable each tool may be for using NLP to improve search. This is where, I think, Google’s Natural Language product could really shine. Google is, after all, a search company (and yes, more than that too). Google’s entity analyzer not only plucks out all of the named entities, but it also returns a salience score for each. The higher the score, the more important or central that entity is to the entire text. The API also returns the number of proper noun mentions for that entity. This seems to work quite well, and the salience score isn’t looking at just the number of mentions. During my testing I found several instances where the most salient result was not that which was mentioned the most. Sorting by salience and only making those most relevant mentions searchable metadata in Alfresco would be useful. Say, for example, we are looking for documents about XYZ Corporation. A simple keyword search would return every document that mentions that company, even if the document wasn’t actually about it. Searching only those documents where XYZ Corporation is the most salient entity (even if not the most frequently mentioned) in the document would give us much more relevant results.
Sentiment analysis is another common feature in many natural language processing suites that may be useful in a context services search context. For example, if you are using your content services platform to store customer survey results, transcripts of chats or other documents that capture an interaction you might want to find those that were strongly negative or positive to serve as training examples. Another great use case exists in the process services world, where processes are likely to capture interactions in a more direct fashion. Sentiment analysis is an area where Google’s and CoreNLP’s approaches differ significantly. The Google Natural Language API provides two ways to handle sentiment analysis. The first analyzes the overall sentiment of the provided text, the second provides sentiment analysis related to identified entities within the text. These are fairly simplistic compared with the full sentiment graph that CoreNLP generates. Google ranks sentiment along a scale of -1 to 1, with -1 being the most negative, and 1 the most positive.
Lower Level Features
At the core of any NLP tool are the basics of language parsing and processing such as tokenization, sentence splitting, part of speech tagging, lemmatization, dependency parsing, etc. The Google Cloud NL API exposes all of these features through its syntax analysis API and the token object. The object syntax is clear and easy to understand. There are some important differences in the way these are implemented across CoreNLP and Google Cloud NL, which I may explore further in a future article.
Different Needs, Different Tools
Google Cloud’s Natural Language product differs from CoreNLP in some important ways. The biggest is simply the fact that one is a cloud service and one is traditionally released software. This has its pros and cons, of course. If you roll your own NLP infrastructure with CoreNLP (whether you do it on-premises or in the cloud) you’ll certainly have more control but you’ll also be responsible for managing the thing. For some use cases this might be the critical difference. Best I can tell, Google doesn’t allow for custom models or annotators (yet). If you need to train your own system or build custom stuff into the annotation pipeline, Google’s NLP offering may not work for you. This is likely to be a shortcoming of many of the cloud based NLP services.
Another key difference is language support. CoreNLP ships models for English, Arabic, Chinese, French, German and Spanish, but not all annotators work for all languages. CoreNLP also has contributed models in other languages of varying completeness and quality. Google Cloud’s NLP API has full fledged support for English, Japanese and Spanish, with beta support for Chinese (simplified and traditional), French, German, Italian, Korean and Portuguese. Depending on where you are and what you need to analyze, language support alone may drive your choice.
On the feature front there are also some key differences when you compare “out of the box” CoreNLP with the Google Cloud NL API. The first thing I tested was entity recognition. I have been doing a little testing with a collection of short stories from American writers, and so far both seem to do a fair job of recognizing basic named entities like people, places, organizations, etc. Google’s API goes further though and will recognize and tag things like the names of consumer goods, works of art, and events. CoreNLP would take more work to do that sort of thing, it isn’t handled by the models that ship with the code. On sentiment analysis, CoreNLP is much more comprehensive (at least in my admittedly limited evaluation).
Scalability and ergonomics are also concerns. If you plan to analyze a large amount of content there’s no getting around scale. Without question, Google wins, but at a cost. The Cloud Natural Language API uses a typical utilization cost model. The more you analyze, the more you pay. Ergonomics is another area where Google Cloud NL has a clear advantage. CoreNLP is a more feature rich experience, and that shows in the model it returns. Google Cloud NL API just returns a logically structured JSON object, making it much easier to read and interpret the results right away. There’s also the issue of interface. CoreNLP relies on a client library. Google Cloud NL API is just a set of REST calls that follow the usual Google conventions and authentication schemes. There has been some work to put a REST API on top of CoreNLP, but I have not tried that out.
The more I explore this space the more convinced I am that natural language processing has the potential to provide some significant improvements to enterprise content search, as well as to content and process analytics.