Laurence Hart recently published an article on CMSWiRE about AI and enterprise search that I found interesting. In it, he lays out some good arguments about why the expectations for AI and enterprise search are a bit overinflated. This is probably a natural part of they hype cycle that AI is currently traversing. While AI probably won’t revolutionize enterprise search overnight, it definitely has the potential to offer meaningful improvements in the short term. One of the areas where I think we can get some easy improvements is by using natural language processing to extract things that might be relevant to search, along with some context around those things. For example, it is handy to be able to search for documents that contain references to people, places, organizations or specific dates using something more than a simple keyword search. It’s useful for your search to know the difference between the china you set on your dinner table and China the country, or Alfresco the company vs eating outside. Expanding on this work, it might also be useful to do some sentiment analysis on a document, or extract specific parts of it for automatic classification.
Stanford offers a set of tools to help with common natural language processing (NLP) tasks. The Stanford CoreNLP project consists of a framework and variety of annotators that handle tasks such as sentiment analysis, part of speech tagging, lemmatization, named entity extraction, etc. My favorite thing about this particular project is how they have simply dropped the barriers to trying it out to zero. If you want to give the project a spin and see how it would annotate some text with the base models, Stanford helpfully hosts a version you can test out. I spent an afternoon throwing text at it, both bits I wrote, and bits that come from some of my test document pool. At first glance it seems to do a pretty good job, even with nothing more than the base models loaded.
I’d like to prove out some of these concepts and explore them further, so I’ve started a simple project to connect Stanford CoreNLP with the Alfresco Content Services platform. The initial goals are simple: Take text from an document stored in Alfresco, run it through a few CoreNLP annotators, extract data from the generated annotations, and store that data as Alfresco metadata. This will make annotation data such as named entities (dates, places, people, organizations) directly searchable via Alfresco Search Services. I’m starting with an Alfresco Repository Action that calls CoreNLP since that will be easy to test on individual documents. It would be pretty straightforward to take this component and run it as a metadata extractor, which might make more sense in the long run. Like most of my Alfresco extension or integration projects, this roughly follows the Service Action Pattern.
Stanford CoreNLP makes the integration bits pretty easy. You can run CoreNLP as a standalone server, and the project helpfully provides a Java client (StandfordCoreNLPClient) that somewhat closely mirrors the annotation pipeline so if you already know how to use CoreNLP locally, you can easily get it working from an Alfresco integration. This will also help with scalability since CoreNLP can be memory hungry and running the NLP engine in a separate JVM or server from Alfresco definitely makes sense. It also makes sense to be judicious about what annotators you run, so that should be configurable in Alfresco. It also make sense to limit the size of the text that gets sent to CoreNLP, so long term some pagination will probably be necessary to break down large files into more manageable pieces. The CoreNLP project itself provides some great guidance on getting the best performance out of the tool.
A couple of notes about using CoreNLP programmatically from other applications. First, if you just provide a host name (like localhost) then CoreNLP assumes that you will be connecting via HTTPS. This will cause the StanfordCoreNLPClient to not respond if your server isn’t set up for it. Oddly, it also doesn’t seem to throw any kind of useful exception, it just sort of, well, stops. If you don’t want to use HTTPS, make sure to specify the protocol in the host name. Second, Stanford makes it pretty easy to use CoreNLP in your application by publishing on Maven Central, but the model jars aren’t there. You’ll need to download those separately. Third, CoreNLP can use a lot of memory for processing large amounts of text. If you plan to do this kind of thing at any kind of scale, you’ll need to run the CoreNLP bits on a separate JVM, and possibly a separate server. I can’t imagine that Alfresco under load and CoreNLP in the same JVM would yield good results. Fourth, the client also has hefty memory requirements. In my testing, running CoreNLP client in an Alfresco action with less than 2GB of memory caused out of memory errors when processing 5-6 pages of dense text. Finally, the pipeline that you feed CoreNLP is ordered. If you don’t have the correct annotators in there in the right order, you won’t get the results you expect. Some annotators have dependencies, which aren’t always clear until you try to process some text and it fails. Thankfully the error message will tell you what other annotators you need in the pipeline for it to work.
After some experimentation I’m not sure that CoreNLP is really well suited for integration with a content services platform. I had hoped that most of the processing using StanfordCoreNLPClient to connect to a server would take place on the server, and only results would be returned but that doesn’t appear to be the case. I still think that using NLP tools to enhance search has merit though. If you want to play around with this idea yourself you can find my PoC code on Github. It’s a toy at this point, but might help others understand Alfresco, some intricacies of CoreNLP, or both. As a next step I’m going to look at OpenNLP and a few other tools to better understand both the concepts and the space.