Alfresco Javascript API and AWS Lambda Functions. Part 1, Lambda and Cloud9 101


I’ve written before about several ways to make use of AWS Lambda functions within your Alfresco Content and Process Services deployment.  In preparation for my DevCon talk, I’m diving deeper into this and building out some demos for the session.  I figured this is a good time to do a quick writeup on one way to approach the problem.

What is a Lambda function?

Briefly, AWS Lambda is a “serverless” compute service.  Instead of the old model of provisioning an EC2 instance, loading software and services, etc, Lambda allows you to simply write code against a specific runtime (Java, Node, Python, or .NET) that is executed when an event occurs.  This event can come from a HUGE number of sources within the AWS suite of services.  When your function is called, it is spun up in a lightweight, configurable container and run.  That container may or may not be used again, depending on several factors.  AWS provides some information about what happens under the hood, but the idea is that most of the time you don’t need to sweat it.  In summary, a Lambda function is just a bit of code that runs in response to a triggering event.

Preparing the Lambda package

Creating a Lambda function through the AWS UI is trivial.  A few clicks, a couple form fields, and you’re done.  This is fine for simple functions, but what about when you need to use an external library?  The bad news is that this takes a couple extra steps.  The good news is that once you have done it, you can move on to a more productive workflow.  The sticking point from doing it all through the AWS console is the addition of libraries and dependencies.  We can get around that by using a Zip package to start out project.  The Zip package format is pretty simple.  Let’s create one, we’ll call it AlfrescoAPICall.

Start by creating an empty director for your project, and changing into that directory:

mkdir AlfrescoAPICall

cd AlfrescoAPICall

Next, create a handler for your Lambda function.  The default name for the handler is index.js, but you can change it so long as you configure your Lambda function appropriately.

touch index.js

Now, use npm to install the modules you need into your project directory.  For this example, we’ll use alfresco-js-api.

npm install alfresco-js-api

A real project probably wouldn’t just install the needed modules piecemeal, it makes more sense to define all of the dependencies in package.json instead.  Regardless, at the end of this you should have a project root folder that contains your Lambda function handler, and a node_modules directory that contains all of your dependencies.  Next, we need to Zip this up into a Lambda deployment package.  Don’t Zip up the project folder, we need to Zip up the folder contents instead.

zip -r .

And that’s it! is the Lambda package that we’ll upload to AWS so we can get to work.  We don’t need to do this again unless the dependencies or versions change.

Getting it into AWS

There are a few ways to get our newly created deployment package up to AWS.  It can be done using the AWS CLI, or with the AWS console.  If you do it via the CLI, it will look something like this:

aws lambda update-function-code –function-name AlfrescoAPICall –zip-file

If you do it via the AWS console, you can simply choose the file to upload :

Screen Shot 2017-12-14 at 4.53.50 PM

Regardless of how you update your Lambda function, once you have uploaded your zip archive you can see the entire project in the “edit code inline” view in the Lambda page:

Screen Shot 2017-12-14 at 4.55.17 PM

Woo hoo!  Now we have a skeleton Lambda project that includes the Alfresco Javascript Client API and we can start building!

Starting development in earnest

This process is simple, but it isn’t exactly the easiest way to build in AWS.  With the relaunch of Cloud9 at re:Invent, AWS has a pretty good web based IDE that we can use for this project.  I’m not going to go through all the steps of creating a Cloud9 environment, but once you have it created you should see your newly created Lambda function in the right-hand pane under the AWS Resources tab.  If you don’t, make sure the IAM account you are using with Cloud9 (not the root account!!) has access to your function.  It will be listed under Remote Functions.  Here’s where it gets really cool.  Right click on the remote function, and you can import the whole thing right into your development environment:

Screen Shot 2017-12-14 at 5.07.00 PM


Neat, right?  After you import it you’ll see it show up in the project explorer view on the left.  From here it is basically like any other IDE.  Tabbed code editor, tree views, syntax highlighting and completion, etc, etc.

Screen Shot 2017-12-14 at 5.07.17 PM

One cool feature of Cloud9 is the ability to test run your Lambda function locally (in the EC2 instance Cloud9 is connected to) or on the AWS Lambda service itself by picking the option from the menu in the run panel. As one would expect, you can also set breakpoints in your Lambda function for debugging:

Screen Shot 2017-12-14 at 5.19.51 PM

Finally, once you are done with your edits and have tested your function to your satisfaction, getting the changes back up to the Lambda service is trivial.  Save your changes, right click on the local function, and select deploy:

Screen Shot 2017-12-14 at 8.35.46 PM

Simple, right?  Now we have a working Lambda function with the Alfresco Javascript Client API available, and we can start developing something useful!  In part two, we’ll continue by getting Alfresco Process Services up and running in AWS and extend this newly created function to interact with a process.


Alfresco DevCon is Back in 2018

DevCon2018_Logo_Grey (1)

Alfresco DevCon has always been my favorite of the many events that we host around the world.  Unfortunately it has been on a bit of a hiatus lately as we explored other ways to connect with the many personas that we work with every day.  Good news though, it’s back in 2018!  Alfresco brings to market an awesome platform for accelerating digital business, and you can’t be a platform company without giving your developers a big friendly hug (and tools, and best practices, and a whole host of other things).  This is THE best opportunity to come and interact with all of the extended Alfresco developer family.  I’ve had a sneak peak at the presenter list and it’s an incredibly diverse group pulled from many of our stellar customers, partners, community members, and of course, a healthy dose of Alfrescans from across the technical parts of the company.

In what turned out to be a happy coincidence, I received the notice that my talks were accepted on my birthday.  Talk about a great present!  I’ve signed up to do two talks.  The first is a lightning talk on using natural language processing techniques to improve the quality of enterprise search.  The lightning talk got a shot in the arm with the recent announcement of AWS Comprehend.  Originally this talk was just going to talk about some on-premises offerings as well as Google Cloud NLP.  I’m excited to play around with the new AWS service and see how it stacks up.

The second talk I’m going to do in Portugal is a full length presentation on using Alfresco Process Services with AWS IoT,

Screen Shot 2017-12-13 at 12.34.22 PM

Originally the IoT talk was going to use a custom device built, probably a Raspberry Pi or an Arduino based device.  However, when one of these little guys showed up in the mail I decided to change it up a bit and use Amazon hardware.


If these topics sound familiar they should because I’ve blogged about both of them in the past.  Alfresco DevCon provides a great opportunity to build on and refine those ideas and vet them with a highly technical audience.  Hopefully the talks will spark some interesting conversations afterward.

I can’t overstate how happy I am that DevCon is back, and I can’t wait to see all of you in Lisbon in January!



Custom Angel’s Envy Bourbon Preamp Build

Bourbon whiskey has been surging in popularity in recent years, resulting in shortages of some popular brands, long lines, and camping for special releases.  Crazy.  Along with the newfound popularity of bourbon has come new bottlers and distilleries experimenting with cask finishes in much the same way the Scots have been doing for ages.  One notable brand doing that sort of experiment is Angel’s Envy.  Their flagship product is a port barrel finished bourbon, offered both as a regular bottling and a once a year cask strength release.  Several years ago a friend got me a bottle of the cask strength version which was thoroughly enjoyed.  With the whiskey gone, what to do with the awesome rough cut wooden box it came in?  Seems a shame to just throw it away.  Let’s upcycle it into something cool instead!

I’m a bit of a hobbyist maker, and have done a lot of woodworking and electronics projects over the years.  Languishing in my parts bin were most of the components I bought for a tube preamp project I started but never finished for lack of a good enclosure.  I bought the transformer and power supply from eBay as a kit, and sourced the tubes, tube shields, sockets, connectors, and passive components from a number of other places.  A little test fitting confirmed that there was room in the Angel’s Envy box to house all of the bits and pieces if I mounted the tubes and transformer on the outside.

Cooling came up as a possible problem, but with the tubes on the outside and proper ventilation to get some airflow going around the power supply it should be OK.  I’ll stick a digital thermometer in there the first time it runs just to be sure.  If it is too hot, I can always route out the bottom of the case and install a quiet little fan.  A set of conical feet turned from cocobolo wood raise the box up to allow that airflow configuration to work.  Given that the enclosure is wood it can’t be used for grounding like a metal enclosure, but that’s easy enough to solve.  Another problem might be interference since a wood case does not provide any shielding.  A layer of adhesive metal foil applied to the inside should take care of that.

With parts and design in hand, I headed down to Red Mountain Makers to lay out all of the components and get the holes drilled.  One downside to condo life is that we don’t have room for a drill press and the holes needed to be more precise than I can do with a hand drill.  Thankfully our local makerspace has everything I needed and more.  As an aside, if you have a local makerspace, find it, join it and use it.  Not only will you get access to the tools you need, but you’ll find friendly people with boatloads of expertise too.  Many of them are entirely volunteer run and they depend on regular memberships to stay up and running.  After a quick trip I got the transformer, tube sockets, shields and tubes all test fit and mounted.  It will have to come apart later as everything gets wired up, but better to try the fit first before I get to soldering it all together.

There are a few parts still on order, such as a fancy chrome cover for the transformer, some internal bits and pieces and such, but it is coming along nicely.  I’ll post an update with more pictures once it is all completed, but I’m too excited about the progress to not share a bit now!

What My Inbox Says About the State of Content Management

Screen Shot 2017-09-15 at 5.44.20 PM

I hit a major milestone today.  10000 unread messages in my inbox.  Actually, 10001 since one more came in between the time I took that screenshot and the time I started writing this article.  People tend to notice big round numbers, so when I logged in and saw that 10k sitting there I had a moment of crisis.  Why did it get that bad?  How am I ever going to clean up that pile of junk?  Am I just a disorganized mess?  Should somebody that lets their inbox get into that state ever be trusted with anything again?  It felt like failure.

Is it failure, or does this indicate that the shift from categorization to search as the dominant way to find things slowly become complete enough that I no longer really care (or more to the point, that I don’t need to care) how many items sit in the bucket?  I think it is the latter.

Think back to when you first started using email.  Maybe take a look at your corporate mail client, which likely lags the state of the art in terms of functionality (or in case you are saddled with Lotus Notes, far, FAR behind the state of the art).  Remember setting up mail rules to route stuff to the right folders, or regularly snipping off older messages at some date milestone and stuffing them into a rarely (never?) touched archive just in case?  Now think about the way you handle your personal mail, assuming that you are using something like Gmail or another modern incarnation.  Is that level of categorization and routing and archiving still necessary?  No?  Didn’t think so.  Email, being an area of fast moving improvement and early SaaS colonization, crossed this search threshold quite some time ago.  Systems that deal with more complex content in enterprise contexts took (and are still taking) a bit longer.  Bruce Schneier talks a bit about this toward the beginning of his book Data and Goliath where he states “for me, saving and searching became easier than sorting and deleting”.  By the by, Data and Goliath is a fantastic book, and I highly recommend you give it a read if you want to find yourself terrified by what is possible with a hoard of data.

So, what does this have to do with content management systems?  A lot, actually.

One of my guiding principles for implementing content management systems is to look for the natural structure of the content.  Are there common elements that suggest a structure that minimizes complexity?  Are there groupings of related content that need to stay together?  How are those things related?  Is there a rigid taxonomy at work or is it more ad-hoc?  Are there groups of metadata properties that cut across multiple types of content?  What constructs does your content platform support that align with the natural structure of the content itself?  From there you can start to flesh out the other concerns such as how other applications will access it and what type of things those applications expect to get back.  The takeaway here is to strike a balance between the intrinsic structure of what you have (if it even has any structure at all), and how people will use it.

I’ve written previously about Alfresco’s best practices, and one of the things that has always been considered to be part of that list is paying attention to the depth and degree of your graph.  Every node (a file, a folder, etc) in Alfresco has to have a parent (except for the root node), and it was considered a bad practice to simply drop a million objects as children to a single parent.  A better practice was to categorize these and create subcontainers so that no single object ended up with millions of children.  For some use cases this makes perfect sense, such as grouping documents related to an insurance claim in a claim folder, or HR documents in a folder for each person you employ, or grouping documents by geographical region, or per-project folders, etc.

Recently though, I have seen more use cases from customers where that model feels like artificially imposing a structure on content where no such structure exists.  Take, for example, policy documents.  These are likely to be consistent, perhaps singular documents with no other content associated with them.  They have a set of metadata used as common lookup fields like names, policy numbers, dates in force, etc.  Does this set of content require a hierarchical structure?  You could easily impose one by grouping policy docs by date range, or by the first character or two of the policy holder’s last name, but does that structure bring any value whatsoever that the metadata doesn’t?  Does adding that structure bring any value to the repository or the applications that use it? I don’t think it does.  In fact, creating structures that are misaligned to the content and the way it is used can create unnecessary complexity on the development side.  Even for the claim folder example above, it might make the most sense to just drop all claim folders under a common parent and avoid creating artificial structure where no intrinsic structure exists.  Similar to the inbox model, save and search can be more efficient.

Can you do this with Alfresco?  We have done some investigation and the answer appears to be “yes”, with some caveats.  We have several customers successfully using large collections of objects, and as long as they stay between some reasonable guardrails it works.  First, make sure that you aren’t creating a massive transaction when moving content under a single parent.  This is usually a concern during migrations of large collections.  One nice side-effect of breaking content down into smaller containers is that the same tools that do that usually help you to avoid creating massive single transactions that can manifest as indexing delays later.  Second, make sure you are accessing these large collections in a reasonable way.  If you request all the children of a parent and those children number in the millions, you’re going to have a bad time.  Use pagination to limit the number of results to something that you can reasonably handle.  You can do this easily with most of Alfresco’s APIs, including CMIS.  Even better, only retrieve those objects that you need by taking advantage of metadata search.  Finally, don’t try to browse those folders with extremely large numbers of children.  Share can load more than it needs when loading up folder content in the document library, which may cause a problem.  Really though, what value is there in trying to browse a collection that large?  Nobody is going to look past the first page or two of results anyway.

So there you (mostly) have it.  Listen to your content, listen to your users, listen to your developers, and don’t try to build structure where it doesn’t exist already.  Search is your friend.

Footnote:  When I posted the size of my unread inbox to Facebook, people had one of two reactions.  The first was “Heh, amateur, I have 30K unread in mine”.  The second was a reaction of abject horror that anybody has an inbox in that state.  Seems the “sort and delete” method still has its followers!

Spinning up the SmarterBham project at Code for Birmingham


There are few things that get my inner geek as excited as the intersection of technology and the public sphere.  We have only begun to scratch the surface of the ways that technology can improve governance, and the ways we can put the power to transform the places people live in the hands of those people themselves.  This sort of civic hacking has been promoted by groups like Code for America for some time.  Code for America is loosely organized into “brigades” that service a particular city.  These independent units operate all over the US, and have gone worldwide.  Like any town worth its salt, Birmingham has its own brigade.  I first became aware of it back in 2015, attended a few meetings and then it fell of my radar.  The group produced a lot of valuable work, including an app for spotting and reporting potholes, contributions to open data policies, traffic accident analysis.

For about a year now I’ve grown increasingly interested in building IoT devices for monitoring various aspects of city life.  My first project was an air quality monitor (which is still up and running!).  At the same time I got interested in The Things Network and other ways citizens can participate and own the rollout of IoT projects at scale.  The price of technology has dropped so far and connectivity has become so ubiquitous that it is entirely feasible for a group of dedicated people to roll out their own IoT solutions with minimal monetary investment.

When these two things collided, something started happening.  Some of the folks at Code for Birmingham got excited.  I got excited.  Community partners got excited.  We made a plan.  We designed some things.  We ordered parts.  We started coding.  We made a pitch deck (because of course you need a pitch deck).  We applied for grants.  We built a team.  A couple months down the road we’re making serious progress.  One of our team members has made huge strides in building a prototype.  Another has started on our AWS templates.  We’re getting there.

Take a look at what we’re building and if you want to be a part of something awesome, get in touch.  We need designers, coders, CAD gurus, testers, writers, data wizards, and of course, some dreamers.  All are welcome.


(Possibly) Enhancing Alfresco Search Part 2 – Google Cloud’s Natural Language API


In the first article in this series, we took a look at using Stanford’s CoreNLP library to enrich Alfresco Content Services metadata with some natural language processing tools.  In particular, we looked at using named entity extraction and sentiment analysis to add some value to enterprise search.  As soon as I posted that article, several people got in touch to see if I was working on testing out any other NLP tools.  In part 2 of this series, we’ll take a look at Google Cloud’s Natural Language API to see if it is any easier to integrate and scale, and do a brief comparison of the results.

One little thing I discovered during testing that may be of note if anybody picks up the Github code to try to do anything useful with it:  Alfresco and Google Cloud’s Natural Language API library can’t play nice together due to conflicting dependencies on some of the Google components.  In particular, Guava is a blocker.  Alfresco ships with and depends on an older version.  Complicating matters further, the Guava APIs changed between the version Alfresco ships and the version that the Google Cloud Natural Language API library requires so it isn’t as straightforward as grabbing the newer Guava library and swapping it out.  I have already had a quick chat with Alfresco Engineering and it looks like this is on the list to be solved soon.  In the meantime, I’m using Apache HttpClient to access the relevant services directly.  It’s not quite as nice as the idiomatic approach that the Google Cloud SDK takes, but it will do for now.

Metadata Enrichment and Extraction

The main purpose of these little experiments has been to assess how suitable each tool may be for using NLP to improve search.  This is where, I think, Google’s Natural Language product could really shine.  Google is, after all, a search company (and yes, more than that too).  Google’s entity analyzer not only plucks out all of the named entities, but it also returns a salience score for each.  The higher the score, the more important or central that entity is to the entire text.  The API also returns the number of proper noun mentions for that entity.  This seems to work quite well, and the salience score isn’t looking at just the number of mentions.  During my testing I found several instances where the most salient result was not that which was mentioned the most.  Sorting by salience and only making those most relevant mentions searchable metadata in Alfresco would be useful.  Say, for example, we are looking for documents about XYZ Corporation.  A simple keyword search would return every document that mentions that company, even if the document wasn’t actually about it.  Searching only those documents where XYZ Corporation is the most salient entity (even if not the most frequently mentioned) in the document would give us much more relevant results.

Sentiment analysis is another common feature in many natural language processing suites that may be useful in a context services search context.  For example, if you are using your content services platform to store customer survey results, transcripts of chats or other documents that capture an interaction you might want to find those that were strongly negative or positive to serve as training examples.  Another great use case exists in the process services world, where processes are likely to capture interactions in a more direct fashion.  Sentiment analysis is an area where Google’s and CoreNLP’s approaches differ significantly.  The Google Natural Language API provides two ways to handle sentiment analysis.  The first analyzes the overall sentiment of the provided text, the second provides sentiment analysis related to identified entities within the text.  These are fairly simplistic compared with the full sentiment graph that CoreNLP generates.  Google ranks sentiment along a scale of -1 to 1, with -1 being the most negative, and 1 the most positive.

Lower Level Features

At the core of any NLP tool are the basics of language parsing and processing such as tokenization, sentence splitting, part of speech tagging, lemmatization, dependency parsing, etc.  The Google Cloud NL API exposes all of these features through its syntax analysis API and the token object.  The object syntax is clear and easy to understand.  There are some important differences in the way these are implemented across CoreNLP and Google Cloud NL, which I may explore further in a future article.

Different Needs, Different Tools

Google Cloud’s Natural Language product differs from CoreNLP in some important ways.  The biggest is simply the fact that one is a cloud service and one is traditionally released software.  This has its pros and cons, of course.  If you roll your own NLP infrastructure with CoreNLP (whether you do it on-premises or in the cloud) you’ll certainly have more control but you’ll also be responsible for managing the thing.  For some use cases this might be the critical difference.  Best I can tell, Google doesn’t allow for custom models or annotators (yet).  If you need to train your own system or build custom stuff into the annotation pipeline, Google’s NLP offering may not work for you.  This is likely to be a shortcoming of many of the cloud based NLP services.

Another key difference is language support.  CoreNLP ships models for English, Arabic, Chinese, French, German and Spanish, but not all annotators work for all languages.  CoreNLP also has contributed models in other languages of varying completeness and quality.  Google Cloud’s NLP API has full fledged support for English, Japanese and Spanish, with beta support for Chinese (simplified and traditional), French, German, Italian, Korean and Portuguese.  Depending on where you are and what you need to analyze, language support alone may drive your choice.

On the feature front there are also some key differences when you compare “out of the box” CoreNLP with the Google Cloud NL API.  The first thing I tested was entity recognition.  I have been doing a little testing with a collection of short stories from American writers, and so far both seem to do a fair job of recognizing basic named entities like people, places, organizations, etc.  Google’s API goes further though and will recognize and tag things like the names of consumer goods, works of art, and events.  CoreNLP would take more work to do that sort of thing, it isn’t handled by the models that ship with the code.  On sentiment analysis, CoreNLP is much more comprehensive (at least in my admittedly limited evaluation).

Scalability and ergonomics are also concerns. If you plan to analyze a large amount of content there’s no getting around scale.  Without question, Google wins, but at a cost.  The Cloud Natural Language API uses a typical utilization cost model.  The more you analyze, the more you pay.  Ergonomics is another area where Google Cloud NL has a clear advantage.  CoreNLP is a more feature rich experience, and that shows in the model it returns.  Google Cloud NL API just returns a logically structured JSON object, making it much easier to read and interpret the results right away.  There’s also the issue of interface.  CoreNLP relies on a client library.  Google Cloud NL API is just a set of REST calls that follow the usual Google conventions and authentication schemes.  There has been some work to put a REST API on top of CoreNLP, but I have not tried that out.

The more I explore this space the more convinced I am that natural language processing has the potential to provide some significant improvements to enterprise content search, as well as to content and process analytics.


Branching The Blog Process


I work at Alfresco.  I also participate in the Alfresco community and build my own side projects / experiments / etc.  Some of these are Alfresco product related, some are not.  Sometimes this seems to introduce confusion around what is “official” Alfresco stuff related to my role and what is a science project or spike to explore an idea.  To avoid this confusion, I’m making a small change to the way I blog.  Going forward anything related to supported Alfresco platforms or functionality, troubleshooting, tuning, performance, etc will be hosted at the Alfresco Premier Services blog.  Other stuff related to experimentation, thoughts around content and process services management as a whole, embedded systems, etc will continue to get posted right here.  Hopefully this split will help to clarify which articles are related to the product as it is, and separate out the more exploratory stuff.

If you have not popped over to the Premier Team blog yet, check it out!

(Possibly) Enhancing Alfresco Search with Stanford CoreNLP

corenlp + alfresco

Laurence Hart recently published an article on CMSWiRE about AI and enterprise search that I found interesting.  In it, he lays out some good arguments about why the expectations for AI and enterprise search are a bit overinflated.  This is probably a natural part of they hype cycle that AI is currently traversing.  While AI probably won’t revolutionize enterprise search overnight, it definitely has the potential to offer meaningful improvements in the short term.  One of the areas where I think we can get some easy improvements is by using natural language processing to extract things that might be relevant to search, along with some context around those things.  For example, it is handy to be able to search for documents that contain references to people, places, organizations or specific dates using something more than a simple keyword search.  It’s useful for your search to know the difference between the china you set on your dinner table and China the country, or Alfresco the company vs eating outside.  Expanding on this work, it might also be useful to do some sentiment analysis on a document, or extract specific parts of it for automatic classification.

Stanford offers a set of tools to help with common natural language processing (NLP) tasks.  The Stanford CoreNLP project consists of a framework and variety of annotators that handle tasks such as sentiment analysis, part of speech tagging, lemmatization, named entity extraction, etc.  My favorite thing about this particular project is how they have simply dropped the barriers to trying it out to zero.  If you want to give the project a spin and see how it would annotate some text with the base models, Stanford helpfully hosts a version you can test out.  I spent an afternoon throwing text at it, both bits I wrote, and bits that come from some of my test document pool.  At first glance it seems to do a pretty good job, even with nothing more than the base models loaded.

I’d like to prove out some of these concepts and explore them further, so I’ve started a simple project to connect Stanford CoreNLP with the Alfresco Content Services platform.  The initial goals are simple:  Take text from an document stored in Alfresco, run it through a few CoreNLP annotators, extract data from the generated annotations, and store that data as Alfresco metadata.  This will make annotation data such as named entities (dates, places, people, organizations) directly searchable via Alfresco Search Services.  I’m starting with an Alfresco Repository Action that calls CoreNLP since that will be easy to test on individual documents.  It would be pretty straightforward to take this component and run it as a metadata extractor, which might make more sense in the long run.  Like most of my Alfresco extension or integration projects, this roughly follows the Service Action Pattern.

Stanford CoreNLP makes the integration bits pretty easy.  You can run CoreNLP as a standalone server, and the project helpfully provides a Java client (StandfordCoreNLPClient) that somewhat closely mirrors the annotation pipeline so if you already know how to use CoreNLP locally, you can easily get it working from an Alfresco integration.  This will also help with scalability since CoreNLP can be memory hungry and running the NLP engine in a separate JVM or server from Alfresco definitely makes sense.  It also makes sense to be judicious about what annotators you run, so that should be configurable in Alfresco.  It also make sense to limit the size of the text that gets sent to CoreNLP, so long term some pagination will probably be necessary to break down large files into more manageable pieces.  The CoreNLP project itself provides some great guidance on getting the best performance out of the tool.

A couple of notes about using CoreNLP programmatically from other applications.  First, if you just provide a host name (like localhost) then CoreNLP assumes that you will be connecting via HTTPS.   This will cause the StanfordCoreNLPClient to not respond if your server isn’t set up for it.  Oddly, it also doesn’t seem to throw any kind of useful exception, it just sort of, well, stops.  If you don’t want to use HTTPS, make sure to specify the protocol in the host name.  Second, Stanford makes it pretty easy to use CoreNLP in your application by publishing on Maven Central, but the model jars aren’t there.  You’ll need to download those separately.  Third, CoreNLP can use a lot of memory for processing large amounts of text.  If you plan to do this kind of thing at any kind of scale, you’ll need to run the CoreNLP bits on a separate JVM, and possibly a separate server.  I can’t imagine that Alfresco under load and CoreNLP in the same JVM would yield good results.  Fourth, the client also has hefty memory requirements.  In my testing, running CoreNLP client in an Alfresco action with less than 2GB of memory caused out of memory errors when processing 5-6 pages of dense text.  Finally, the pipeline that you feed CoreNLP is ordered.  If you don’t have the correct annotators in there in the right order, you won’t get the results you expect.  Some annotators have dependencies, which aren’t always clear until you try to process some text and it fails.  Thankfully the error message will tell you what other annotators you need in the pipeline for it to work.

After some experimentation I’m not sure that CoreNLP is really well suited for integration with a content services platform.  I had hoped that most of the processing using StanfordCoreNLPClient to connect to a server would take place on the server, and only results would be returned but that doesn’t appear to be the case.  I still think that using NLP tools to enhance search has merit though.  If you want to play around with this idea yourself you can find my PoC code on Github.  It’s a toy at this point, but might help others understand Alfresco, some intricacies of CoreNLP, or both.  As a next step I’m going to look at OpenNLP and a few other tools to better understand both the concepts and the space.


AWS Lambda and Alfresco – Connecting Serverless to Content and Process


Let’s start with a rant.  I don’t like the term “Serverless” to describe Lambda or other function as a service platforms.  Yeah, OK, so you don’t need to spin up servers, or worry about EC2 instances, or any of that stuff.  Great.  But it still runs on a server of some sort.  Even nascent efforts to extend “Serverless” to edge devices still have something that could be called a server, the device itself.  If it provides a service, it’s a server.  It’s like that Salesforce “No Software” campaign.  What?  It’s freaking software, no matter what some marketing campaign says.  It looks like the name is going to stick, so I’ll use it, but if you wonder why I look like I’ve just bit into a garden slug every time I say “Serverless”, that’s why.

Naming aside, there’s no doubt this is a powerful paradigm for writing, running and managing code.  For one, it’s simple.  It takes away all the pain of the lower levels of the stack and gives devs a superbly clean and easy environment.  It is (or should be) scalable.  It is (or should be) performant.  The appeal is easy to see.  Like most areas that AWS colonizes, Lambda seems to be the front runner in this space.

You know what else runs well in AWS?  Alfresco Content and Process Services.

Lambda -> Alfresco Content / Process Services

It should be fairly simple to call Alfresco Content or Process Services from AWS Lambda.  Lambda supports several execution environments, all of which support calling an external URL.  If you have an Alfresco instance running on or otherwise reachable from AWS, you can call it from Lambda.  This does, however, require you to write all of the supporting code to make the calls.  One Lambda execution environment is Node.js, which probably presents us with the easiest way to get Lambda talking to Alfresco.  Alfresco has a recently released Javascript client API which supports connections to both Alfresco Content Services and Alfresco Process Services.  This client API requires at least Node.js 5.x.  Lambda supports 6.10 at the time this article was written, so no problem there!

Alfresco Content / Process Services -> Lambda

While it’s incredibly useful to be able to call Alfresco services from Lambda, what about triggering Lambda functions from the Alfresco Digital Business Platform?  That part is also possible, exactly how to do it depends on what you want to do.  Lambda supports many ways to invoke a function, some of which may be helpful to us.

S3 bucket events

AWS Lambda functions can be triggered in response to S3 events such as object creation or deletion.  The case that AWS outlines on their web site is a situation where you might want to generate a thumbnail, which Alfresco already handles quite nicely, but it’s not hard to come up with others.  We might want to do something when a node is deleted from the S3 bucket by Alfresco.  For example, this could be used to trigger a notification that content cleanup was successful or to write out an additional audit entry to another system.  Since most Alfresco DBP deployments in AWS use S3 as a backing store, this is an option available to most AWS Alfresco Content or Process Services instances.

Simple Email Service

Another way to trigger a Lambda function is via the AWS Simple Email Service.  SES is probably more commonly used to send emails, but it can also receive them.  SES can invoke your Lambda function and pass it the email it received.  Sending email can already easily be done from both an Alfresco Process Services BPMN task and from an Alfresco Content Services Action, so this could be an easy way to trigger a Lambda function using existing functionality in response to a workflow event or something occurring in the content repository.

Scheduled Events

AWS CloudWatch provides a scheduled event capability for CloudWatch Events.  These are configured using either a fixed rate or a cron expression, and use a rule target definition to define which Lambda function to call.  A scheduled event isn’t really a way to call Lambda functions from ACS or APS, but it could prove to be very useful for regular cleanup events, archiving or other recurring tasks you wish to run against your Alfresco Content Services instances in AWS.  It also gives you a way to trigger things to happen in your APS workflows on a schedule, but that case is probably better handled in the process definition itself.

API Gateway

Our last two options would require a little work, but may turn out to be the best for most use cases.  Using an API Gateway you can define URLs that can be used to directly trigger your Lambda functions.  Triggering these from Alfresco Process Services is simple, just use a REST call task to make the call.  Doing so from Alfresco Content Services is a bit trickier, requiring either a custom action or a behavior that makes the call out to the API gateway and passes it the info your Lambda function needs to do its job.  Still fairly straightforward, and there are lots of good examples of making HTTP calls from Alfresco Content Services extensions out there in the community.


AWS Simple Notification Service provides another scalable option for calling Lambda functions.  Like the API gateway option, you could use a REST call task in Alfresco Process Services, or a bit of custom code to make the call from Alfresco Content Services.  AWS SNS supports a simple API for publishing messages to a topic, which can then be used to trigger your Lambda function.

There are quite a few ways to both use Alfresco Process and Content services from Lambda functions, as well use Lambda functions to enhance your investment in Alfresco technologies.  I plan to do a little spike to explore this further, stay tuned for findings and code samples!


13 Essentials for Building Great Remote Teams


It’s been a while since I wrote a listicle, and this stuff has been on my mind a lot lately.  About two years ago I assumed my current role as Alfresco’s Global Director of Premier Support Services.  The Premier Services team is scattered across the world, with team members from Sydney to Paris and just about everywhere in between.  This has been my first time running a large distributed team, here are some things I’ve found essential to making it work.  Some are things you need to have, some are things you need to do:

  1. Find and use a good chat tool.  When your team is spread around the world you can’t live without a good tool for informal asynchronous communications.
  2. But not too many chat tools.  Seriously, this is a problem.  Ask you team what they like, settle on one and stick with it, otherwise you end up with silos, missed messages and a confused group of people.
  3. Use video, even if it feels weird.  Voice chat is great, but there’s no substitute for seeing who you are talking with.  In his book “Silent Messages“, Dr. Albert Mehrabian attributes up to 55% of the impact of a message to the body language of the person presenting the message.  You can’t get that from voice chat alone.
  4. Take advantage of document sharing and collaboration.  A big percentage of our work results in unstructured content in the form of spreadsheets, reports, etc.  We need easy ways to find, collaborate on and share that stuff.  We use Alfresco, naturally.
  5. Have regular face-to-face meetings.  These can be expensive and time consuming, but there is no substitute for meeting in person, shaking hands and sharing a cup of coffee or lunch.  This is especially true for new additions to the crew, during that honeymoon period you need to meet.
  6. Make smart development investments.  When most of your team is remote it is easy to start to feel disconnected from your organization.  Over 5 years of working remotely both as an individual contributor and a leader I know I have felt that way from time to time.  Investing in your team’s professional development is a great way to help them reconnect.  It’s even better if you can use this as an opportunity to get some face time, for example by sending a couple of people from your team to the same conference so they can get to know each other.
  7. Celebrate success, no matter how small.  When everybody works together under one roof it’s easy to congratulate somebody on your team for a job well done.  It’s easy to pull the team together to celebrate a release, or a project milestone, or whatever.  When everybody is remote that becomes simultaneously harder and more important.  Don’t be shy about calling somebody out in your chat, via email or in a team call when they score a win.  Think to yourself “Is this the kind of thing I would walk over and thank somebody for in person?”.  If the answer is yes, then mention it in public.
  8. Raise your team’s profile.  It has been said that a boss takes credit, a leader gives it.  When your entire team is spread around the globe you, as their lead, serve to a certain extent as their interface to upper management and to leaders in other areas of the company.  Use this to your team’s benefit by raising the profile of your top performers to your leadership and to your peers.  When you bring a good idea from your staff to your leadership, your team or anybody else in your company, make sure you let them know exactly where it came from.
  9. Build lots of bridges.  A lot of these essentials come back to the risk of a remote team member becoming disconnected or disengaged.  One way to prevent this is to help your team get and stay engaged in areas other than your own.  Every company I have ever worked for has cross functional teams and initiatives.  Find the places where your teams’ skills and bandwidth align with those cross functional needs and get them connected.  They’ll learn something new, share what they know and contribute to the success of the team.
  10. Shamelessly signal boost.  Many people on my team are active on social media, or with our team blog, or on other channels for knowledge sharing and engagement.  I absolutely encourage this, sharing our knowledge with peers, customers, partners and community members helps everybody.  It takes effort though, an effort that often goes beyond somebody’s core job role.  It’s also a bit scary at times, putting yourself and your ideas out there for everybody to see (and potentially criticize).  If somebody on your team takes the time and the risk, help boost them a bit.  Retweet their post, share it on LinkedIn, post it to internal chats, etc.  Not only will you be helping them spread the knowledge around, but you’re also lending your credibility to their message.
  11. Have defined development paths within (and out of!) your team.  A lot of promotions come from networking, cross functional experiences, word of mouth and other things that are harder to achieve when you work remotely.  As a leader of a remote team, it’s your responsibility to help your people understand the roles within your organization, what is required to move into those roles, and how to get there.  It’s also your job to make sure they know about great opportunities outside your team.  I want my people to be successful, however they define success.  That might be in my team or it might be elsewhere in the company.
  12. Be clear about your goals and how you’ll measure them.  If you have the sort of job that lets you work from home, odds are you aren’t punching a clock.  If you do come into an office every day but your boss is elsewhere 95% of the time, nobody is hovering around making sure you’re there.  Work should be a thing you do, not necessarily a place you go.  The only way this works is if everybody is clear about what we are all trying to achieve together, who’s responsible for what, and how we’ll measure the outcome.  If we all agree on that, you can work from the moon if your internet connection is fast enough.
  13. Trust.  I put this one last because it is easily the most important.  It’s important from the moment you hire somebody that you won’t see in person every day.  Put simply, you cannot possibly have a successful remote organization if you don’t trust the people you work with.  Full stop.  You have to nurture a culture of trust where people aren’t afraid to speak up, where transparency is a core value.

Is this list comprehensive?  Of course not.  Do I still struggle to do this stuff?  Every day, but I keep trying.