(Possibly) Enhancing Alfresco Search Part 2 – Google Cloud’s Natural Language API

google-nl-alfresco-search

In the first article in this series, we took a look at using Stanford’s CoreNLP library to enrich Alfresco Content Services metadata with some natural language processing tools.  In particular, we looked at using named entity extraction and sentiment analysis to add some value to enterprise search.  As soon as I posted that article, several people got in touch to see if I was working on testing out any other NLP tools.  In part 2 of this series, we’ll take a look at Google Cloud’s Natural Language API to see if it is any easier to integrate and scale, and do a brief comparison of the results.

One little thing I discovered during testing that may be of note if anybody picks up the Github code to try to do anything useful with it:  Alfresco and Google Cloud’s Natural Language API library can’t play nice together due to conflicting dependencies on some of the Google components.  In particular, Guava is a blocker.  Alfresco ships with and depends on an older version.  Complicating matters further, the Guava APIs changed between the version Alfresco ships and the version that the Google Cloud Natural Language API library requires so it isn’t as straightforward as grabbing the newer Guava library and swapping it out.  I have already had a quick chat with Alfresco Engineering and it looks like this is on the list to be solved soon.  In the meantime, I’m using Apache HttpClient to access the relevant services directly.  It’s not quite as nice as the idiomatic approach that the Google Cloud SDK takes, but it will do for now.

Metadata Enrichment and Extraction

The main purpose of these little experiments has been to assess how suitable each tool may be for using NLP to improve search.  This is where, I think, Google’s Natural Language product could really shine.  Google is, after all, a search company (and yes, more than that too).  Google’s entity analyzer not only plucks out all of the named entities, but it also returns a salience score for each.  The higher the score, the more important or central that entity is to the entire text.  The API also returns the number of proper noun mentions for that entity.  This seems to work quite well, and the salience score isn’t looking at just the number of mentions.  During my testing I found several instances where the most salient result was not that which was mentioned the most.  Sorting by salience and only making those most relevant mentions searchable metadata in Alfresco would be useful.  Say, for example, we are looking for documents about XYZ Corporation.  A simple keyword search would return every document that mentions that company, even if the document wasn’t actually about it.  Searching only those documents where XYZ Corporation is the most salient entity (even if not the most frequently mentioned) in the document would give us much more relevant results.

Sentiment analysis is another common feature in many natural language processing suites that may be useful in a context services search context.  For example, if you are using your content services platform to store customer survey results, transcripts of chats or other documents that capture an interaction you might want to find those that were strongly negative or positive to serve as training examples.  Another great use case exists in the process services world, where processes are likely to capture interactions in a more direct fashion.  Sentiment analysis is an area where Google’s and CoreNLP’s approaches differ significantly.  The Google Natural Language API provides two ways to handle sentiment analysis.  The first analyzes the overall sentiment of the provided text, the second provides sentiment analysis related to identified entities within the text.  These are fairly simplistic compared with the full sentiment graph that CoreNLP generates.  Google ranks sentiment along a scale of -1 to 1, with -1 being the most negative, and 1 the most positive.

Lower Level Features

At the core of any NLP tool are the basics of language parsing and processing such as tokenization, sentence splitting, part of speech tagging, lemmatization, dependency parsing, etc.  The Google Cloud NL API exposes all of these features through its syntax analysis API and the token object.  The object syntax is clear and easy to understand.  There are some important differences in the way these are implemented across CoreNLP and Google Cloud NL, which I may explore further in a future article.

Different Needs, Different Tools

Google Cloud’s Natural Language product differs from CoreNLP in some important ways.  The biggest is simply the fact that one is a cloud service and one is traditionally released software.  This has its pros and cons, of course.  If you roll your own NLP infrastructure with CoreNLP (whether you do it on-premises or in the cloud) you’ll certainly have more control but you’ll also be responsible for managing the thing.  For some use cases this might be the critical difference.  Best I can tell, Google doesn’t allow for custom models or annotators (yet).  If you need to train your own system or build custom stuff into the annotation pipeline, Google’s NLP offering may not work for you.  This is likely to be a shortcoming of many of the cloud based NLP services.

Another key difference is language support.  CoreNLP ships models for English, Arabic, Chinese, French, German and Spanish, but not all annotators work for all languages.  CoreNLP also has contributed models in other languages of varying completeness and quality.  Google Cloud’s NLP API has full fledged support for English, Japanese and Spanish, with beta support for Chinese (simplified and traditional), French, German, Italian, Korean and Portuguese.  Depending on where you are and what you need to analyze, language support alone may drive your choice.

On the feature front there are also some key differences when you compare “out of the box” CoreNLP with the Google Cloud NL API.  The first thing I tested was entity recognition.  I have been doing a little testing with a collection of short stories from American writers, and so far both seem to do a fair job of recognizing basic named entities like people, places, organizations, etc.  Google’s API goes further though and will recognize and tag things like the names of consumer goods, works of art, and events.  CoreNLP would take more work to do that sort of thing, it isn’t handled by the models that ship with the code.  On sentiment analysis, CoreNLP is much more comprehensive (at least in my admittedly limited evaluation).

Scalability and ergonomics are also concerns. If you plan to analyze a large amount of content there’s no getting around scale.  Without question, Google wins, but at a cost.  The Cloud Natural Language API uses a typical utilization cost model.  The more you analyze, the more you pay.  Ergonomics is another area where Google Cloud NL has a clear advantage.  CoreNLP is a more feature rich experience, and that shows in the model it returns.  Google Cloud NL API just returns a logically structured JSON object, making it much easier to read and interpret the results right away.  There’s also the issue of interface.  CoreNLP relies on a client library.  Google Cloud NL API is just a set of REST calls that follow the usual Google conventions and authentication schemes.  There has been some work to put a REST API on top of CoreNLP, but I have not tried that out.

The more I explore this space the more convinced I am that natural language processing has the potential to provide some significant improvements to enterprise content search, as well as to content and process analytics.

 

(Possibly) Enhancing Alfresco Search with Stanford CoreNLP

corenlp + alfresco

Laurence Hart recently published an article on CMSWiRE about AI and enterprise search that I found interesting.  In it, he lays out some good arguments about why the expectations for AI and enterprise search are a bit overinflated.  This is probably a natural part of they hype cycle that AI is currently traversing.  While AI probably won’t revolutionize enterprise search overnight, it definitely has the potential to offer meaningful improvements in the short term.  One of the areas where I think we can get some easy improvements is by using natural language processing to extract things that might be relevant to search, along with some context around those things.  For example, it is handy to be able to search for documents that contain references to people, places, organizations or specific dates using something more than a simple keyword search.  It’s useful for your search to know the difference between the china you set on your dinner table and China the country, or Alfresco the company vs eating outside.  Expanding on this work, it might also be useful to do some sentiment analysis on a document, or extract specific parts of it for automatic classification.

Stanford offers a set of tools to help with common natural language processing (NLP) tasks.  The Stanford CoreNLP project consists of a framework and variety of annotators that handle tasks such as sentiment analysis, part of speech tagging, lemmatization, named entity extraction, etc.  My favorite thing about this particular project is how they have simply dropped the barriers to trying it out to zero.  If you want to give the project a spin and see how it would annotate some text with the base models, Stanford helpfully hosts a version you can test out.  I spent an afternoon throwing text at it, both bits I wrote, and bits that come from some of my test document pool.  At first glance it seems to do a pretty good job, even with nothing more than the base models loaded.

I’d like to prove out some of these concepts and explore them further, so I’ve started a simple project to connect Stanford CoreNLP with the Alfresco Content Services platform.  The initial goals are simple:  Take text from an document stored in Alfresco, run it through a few CoreNLP annotators, extract data from the generated annotations, and store that data as Alfresco metadata.  This will make annotation data such as named entities (dates, places, people, organizations) directly searchable via Alfresco Search Services.  I’m starting with an Alfresco Repository Action that calls CoreNLP since that will be easy to test on individual documents.  It would be pretty straightforward to take this component and run it as a metadata extractor, which might make more sense in the long run.  Like most of my Alfresco extension or integration projects, this roughly follows the Service Action Pattern.

Stanford CoreNLP makes the integration bits pretty easy.  You can run CoreNLP as a standalone server, and the project helpfully provides a Java client (StandfordCoreNLPClient) that somewhat closely mirrors the annotation pipeline so if you already know how to use CoreNLP locally, you can easily get it working from an Alfresco integration.  This will also help with scalability since CoreNLP can be memory hungry and running the NLP engine in a separate JVM or server from Alfresco definitely makes sense.  It also makes sense to be judicious about what annotators you run, so that should be configurable in Alfresco.  It also make sense to limit the size of the text that gets sent to CoreNLP, so long term some pagination will probably be necessary to break down large files into more manageable pieces.  The CoreNLP project itself provides some great guidance on getting the best performance out of the tool.

A couple of notes about using CoreNLP programmatically from other applications.  First, if you just provide a host name (like localhost) then CoreNLP assumes that you will be connecting via HTTPS.   This will cause the StanfordCoreNLPClient to not respond if your server isn’t set up for it.  Oddly, it also doesn’t seem to throw any kind of useful exception, it just sort of, well, stops.  If you don’t want to use HTTPS, make sure to specify the protocol in the host name.  Second, Stanford makes it pretty easy to use CoreNLP in your application by publishing on Maven Central, but the model jars aren’t there.  You’ll need to download those separately.  Third, CoreNLP can use a lot of memory for processing large amounts of text.  If you plan to do this kind of thing at any kind of scale, you’ll need to run the CoreNLP bits on a separate JVM, and possibly a separate server.  I can’t imagine that Alfresco under load and CoreNLP in the same JVM would yield good results.  Fourth, the client also has hefty memory requirements.  In my testing, running CoreNLP client in an Alfresco action with less than 2GB of memory caused out of memory errors when processing 5-6 pages of dense text.  Finally, the pipeline that you feed CoreNLP is ordered.  If you don’t have the correct annotators in there in the right order, you won’t get the results you expect.  Some annotators have dependencies, which aren’t always clear until you try to process some text and it fails.  Thankfully the error message will tell you what other annotators you need in the pipeline for it to work.

After some experimentation I’m not sure that CoreNLP is really well suited for integration with a content services platform.  I had hoped that most of the processing using StanfordCoreNLPClient to connect to a server would take place on the server, and only results would be returned but that doesn’t appear to be the case.  I still think that using NLP tools to enhance search has merit though.  If you want to play around with this idea yourself you can find my PoC code on Github.  It’s a toy at this point, but might help others understand Alfresco, some intricacies of CoreNLP, or both.  As a next step I’m going to look at OpenNLP and a few other tools to better understand both the concepts and the space.

 

AWS Lambda and Alfresco – Connecting Serverless to Content and Process

lambda+alfresco

Let’s start with a rant.  I don’t like the term “Serverless” to describe Lambda or other function as a service platforms.  Yeah, OK, so you don’t need to spin up servers, or worry about EC2 instances, or any of that stuff.  Great.  But it still runs on a server of some sort.  Even nascent efforts to extend “Serverless” to edge devices still have something that could be called a server, the device itself.  If it provides a service, it’s a server.  It’s like that Salesforce “No Software” campaign.  What?  It’s freaking software, no matter what some marketing campaign says.  It looks like the name is going to stick, so I’ll use it, but if you wonder why I look like I’ve just bit into a garden slug every time I say “Serverless”, that’s why.

Naming aside, there’s no doubt this is a powerful paradigm for writing, running and managing code.  For one, it’s simple.  It takes away all the pain of the lower levels of the stack and gives devs a superbly clean and easy environment.  It (should be) scalable.  It (should be) performant.  The appeal is easy to see.  Like most areas that AWS colonizes, Lambda seems to be the front runner in this space.

You know what else runs well in AWS?  Alfresco Content and Process Services.

Lambda -> Alfresco Content / Process Services

It should be fairly simple to call Alfresco Content or Process Services from AWS Lambda.  Lambda supports several execution environments, all of which support calling an external URL.  If you have an Alfresco instance running on or otherwise reachable from AWS, you can call it from Lambda.  This does, however, require you to write all of the supporting code to make the calls.  One Lambda execution environment is Node.js, which probably presents us with the easiest way to get Lambda talking to Alfresco.  Alfresco has a recently released Javascript client API which supports connections to both Alfresco Content Services and Alfresco Process Services.  This client API requires at least Node.js 5.x.  Lambda supports 6.10 at the time this article was written, so no problem there!

Alfresco Content / Process Services -> Lambda

While it’s incredibly useful to be able to call Alfresco services from Lambda, what about triggering Lambda functions from the Alfresco Digital Business Platform?  That part is also possible, exactly how to do it depends on what you want to do.  Lambda supports many ways to invoke a function, some of which may be helpful to us.

S3 bucket events

AWS Lambda functions can be triggered in response to S3 events such as object creation or deletion.  The case that AWS outlines on their web site is a situation where you might want to generate a thumbnail, which Alfresco already handles quite nicely, but it’s not hard to come up with others.  We might want to do something when a node is deleted from the S3 bucket by Alfresco.  For example, this could be used to trigger a notification that content cleanup was successful or to write out an additional audit entry to another system.  Since most Alfresco DBP deployments in AWS use S3 as a backing store, this is an option available to most AWS Alfresco Content or Process Services instances.

Simple Email Service

Another way to trigger a Lambda function is via the AWS Simple Email Service.  SES is probably more commonly used to send emails, but it can also receive them.  SES can invoke your Lambda function and pass it the email it received.  Sending email can already easily be done from both an Alfresco Process Services BPMN task and from an Alfresco Content Services Action, so this could be an easy way to trigger a Lambda function using existing functionality in response to a workflow event or something occurring in the content repository.

Scheduled Events

AWS CloudWatch provides a scheduled event capability for CloudWatch Events.  These are configured using either a fixed rate or a cron expression, and use a rule target definition to define which Lambda function to call.  A scheduled event isn’t really a way to call Lambda functions from ACS or APS, but it could prove to be very useful for regular cleanup events, archiving or other recurring tasks you wish to run against your Alfresco Content Services instances in AWS.  It also gives you a way to trigger things to happen in your APS workflows on a schedule, but that case is probably better handled in the process definition itself.

API Gateway

Our last two options would require a little work, but may turn out to be the best for most use cases.  Using an API Gateway you can define URLs that can be used to directly trigger your Lambda functions.  Triggering these from Alfresco Process Services is simple, just use a REST call task to make the call.  Doing so from Alfresco Content Services is a bit trickier, requiring either a custom action or a behavior that makes the call out to the API gateway and passes it the info your Lambda function needs to do its job.  Still fairly straightforward, and there are lots of good examples of making HTTP calls from Alfresco Content Services extensions out there in the community.

SNS

AWS Simple Notification Service provides another scalable option for calling Lambda functions.  Like the API gateway option, you could use a REST call task in Alfresco Process Services, or a bit of custom code to make the call from Alfresco Content Services.  AWS SNS supports a simple API for publishing messages to a topic, which can then be used to trigger your Lambda function.

There are quite a few ways to both use Alfresco Process and Content services from Lambda functions, as well use Lambda functions to enhance your investment in Alfresco technologies.  I plan to do a little spike to explore this further, stay tuned for findings and code samples!

 

13 Essentials for Building Great Remote Teams

alfresco_support

It’s been a while since I wrote a listicle, and this stuff has been on my mind a lot lately.  About two years ago I assumed my current role as Alfresco’s Global Director of Premier Support Services.  The Premier Services team is scattered across the world, with team members from Sydney to Paris and just about everywhere in between.  This has been my first time running a large distributed team, here are some things I’ve found essential to making it work.  Some are things you need to have, some are things you need to do:

  1. Find and use a good chat tool.  When your team is spread around the world you can’t live without a good tool for informal asynchronous communications.
  2. But not too many chat tools.  Seriously, this is a problem.  Ask you team what they like, settle on one and stick with it, otherwise you end up with silos, missed messages and a confused group of people.
  3. Use video, even if it feels weird.  Voice chat is great, but there’s no substitute for seeing who you are talking with.  In his book “Silent Messages“, Dr. Albert Mehrabian attributes up to 55% of the impact of a message to the body language of the person presenting the message.  You can’t get that from voice chat alone.
  4. Take advantage of document sharing and collaboration.  A big percentage of our work results in unstructured content in the form of spreadsheets, reports, etc.  We need easy ways to find, collaborate on and share that stuff.  We use Alfresco, naturally.
  5. Have regular face-to-face meetings.  These can be expensive and time consuming, but there is no substitute for meeting in person, shaking hands and sharing a cup of coffee or lunch.  This is especially true for new additions to the crew, during that honeymoon period you need to meet.
  6. Make smart development investments.  When most of your team is remote it is easy to start to feel disconnected from your organization.  Over 5 years of working remotely both as an individual contributor and a leader I know I have felt that way from time to time.  Investing in your team’s professional development is a great way to help them reconnect.  It’s even better if you can use this as an opportunity to get some face time, for example by sending a couple of people from your team to the same conference so they can get to know each other.
  7. Celebrate success, no matter how small.  When everybody works together under one roof it’s easy to congratulate somebody on your team for a job well done.  It’s easy to pull the team together to celebrate a release, or a project milestone, or whatever.  When everybody is remote that becomes simultaneously harder and more important.  Don’t be shy about calling somebody out in your chat, via email or in a team call when they score a win.  Think to yourself “Is this the kind of thing I would walk over and thank somebody for in person?”.  If the answer is yes, then mention it in public.
  8. Raise your team’s profile.  It has been said that a boss takes credit, a leader gives it.  When your entire team is spread around the globe you, as their lead, serve to a certain extent as their interface to upper management and to leaders in other areas of the company.  Use this to your team’s benefit by raising the profile of your top performers to your leadership and to your peers.  When you bring a good idea from your staff to your leadership, your team or anybody else in your company, make sure you let them know exactly where it came from.
  9. Build lots of bridges.  A lot of these essentials come back to the risk of a remote team member becoming disconnected or disengaged.  One way to prevent this is to help your team get and stay engaged in areas other than your own.  Every company I have ever worked for has cross functional teams and initiatives.  Find the places where your teams’ skills and bandwidth align with those cross functional needs and get them connected.  They’ll learn something new, share what they know and contribute to the success of the team.
  10. Shamelessly signal boost.  Many people on my team are active on social media, or with our team blog, or on other channels for knowledge sharing and engagement.  I absolutely encourage this, sharing our knowledge with peers, customers, partners and community members helps everybody.  It takes effort though, an effort that often goes beyond somebody’s core job role.  It’s also a bit scary at times, putting yourself and your ideas out there for everybody to see (and potentially criticize).  If somebody on your team takes the time and the risk, help boost them a bit.  Retweet their post, share it on LinkedIn, post it to internal chats, etc.  Not only will you be helping them spread the knowledge around, but you’re also lending your credibility to their message.
  11. Have defined development paths within (and out of!) your team.  A lot of promotions come from networking, cross functional experiences, word of mouth and other things that are harder to achieve when you work remotely.  As a leader of a remote team, it’s your responsibility to help your people understand the roles within your organization, what is required to move into those roles, and how to get there.  It’s also your job to make sure they know about great opportunities outside your team.  I want my people to be successful, however they define success.  That might be in my team or it might be elsewhere in the company.
  12. Be clear about your goals and how you’ll measure them.  If you have the sort of job that lets you work from home, odds are you aren’t punching a clock.  If you do come into an office every day but your boss is elsewhere 95% of the time, nobody is hovering around making sure you’re there.  Work should be a thing you do, not necessarily a place you go.  The only way this works is if everybody is clear about what we are all trying to achieve together, who’s responsible for what, and how we’ll measure the outcome.  If we all agree on that, you can work from the moon if your internet connection is fast enough.
  13. Trust.  I put this one last because it is easily the most important.  It’s important from the moment you hire somebody that you won’t see in person every day.  Put simply, you cannot possibly have a successful remote organization if you don’t trust the people you work with.  Full stop.  You have to nurture a culture of trust where people aren’t afraid to speak up, where transparency is a core value.

Is this list comprehensive?  Of course not.  Do I still struggle to do this stuff?  Every day, but I keep trying.

A Simple Pattern for Alfresco Extensions

Over the years I have worked with and for Alfresco, I have written a ton of Alfresco extensions.  Some of these are for customers, some are for my own education, some for R&D spikes, etc.  I’d like to share a common pattern that comes in handy.  If you are a super experienced Alfresco developer, this article probably isn’t for you.  You know this stuff already!

There are a lot of ways to build Alfresco extensions, and a lot of ways to integrate your own code or connect Alfresco to another product.  There are also a lot of ways you might want to call your own code or an integration, whether that is from an Action, a Behavior, a Web Script, a scheduled job, or via the Alfresco Javascript API.  One way to make your extension as flexible as possible is to use what could informally be called the “Service Action Pattern”.

The Service Action Pattern

service_action_pattern_sequence (1)

Let’s start by describing the Service Action Pattern.  In this pattern, we take the functionality that we want to make available to Alfresco and we wrap it in a service object.  This is a well established pattern in the Alfresco world, used extensively in Alfresco’s own public API.  Things like the NodeService, ActionService, ContentService, etc all take core functionality found in the Alfresco platform and wrap it in a well defined service interface consisting of a set of public APIs that return Alfresco objects like NodeRefs, Actions, Paths, etc, or Java primitives.  The service object is where all of our custom logic lives, and it provides a well defined interface for other objects to use.  In many ways the service object serves as a sort of adapter pattern in that we are using the service object to translate back and forth between the types of domain specific objects that your extension requires and Alfresco objects.  When designing a new service in Alfresco, I find it is a best practice to limit the types of objects returned by the service layer to those things that Alfresco natively understands.  If your service object method creates a new node, return a NodeRef, for example.

A custom service object on its own isn’t terribly useful, since Alfresco doesn’t know what to do with it.  This is where an Alfresco Action comes in handy.  We can use one or more Alfresco Actions to call the services that our service object exposes.  Creating an action to call the service object has several advantages.  First, once you have an Action you can easily call that Action (and thus the underlying service object) from the Javascript API (more on this in a moment).  Second, it is easy to take an Action and surface it in Alfresco Share for testing or so your users can call it directly.  Actions can also be triggered by folder rules, which can be useful if you need to call some code when a document is created or updated.  Finally, Actions are registered with Alfresco, which makes them easy to find and call from other Java or server side Javascript code via the ActionService.  If you want to do something to a file or folder in Alfresco there is a pretty good chance that an Action is the right approach.

Using the Service Action Pattern also makes it simple to expose your service object as a REST API.  Remember that Alfresco Actions can be located and called easily from the Javascript API.  The Javascript API also happens to be (IMHO) the simplest way to build a new Alfresco Web Script.  If you need to call your Action from another system (a very common requirement) you can simply create a web script that exposes your action as a URL and call away.  This does require a bit of boilerplate code to grab request parameters and pass them to the Action, which in turn will call your service object.  It isn’t too much and there are lots of great examples in the Alfresco documentation and out in the community.

So why not just bake the code into the Action itself?  Good question!  First, any project of some complexity is likely to have a group of related functionality.  A good example can be found in the AWS Glacier Archive for Alfresco project we built a couple years ago at an Alfresco hack-a-thon.  This project required us to have Actions for archiving content, initiating a retrieval, and retrieving content.  All of these Actions are logically and functionally related, so it makes sense to group them together in a single service.  If you want the details of how Alfresco integrates with AWS Glacier, you just have to look at the service implementation class, the Action classes themselves are just sanity checks and wiring.  Another good reason to put your logic into a service class is for reuse outside of Actions.  Actions carry some overhead, and depending on how you plan to use it you may want to make your logic available directly to a behavior or expose it to the Alfresco Javascript API via a root scope object.  Both of these are straightforward if you have a well defined service object.

I hope this helps you build your next awesome Alfresco platform extension, I have found it a useful way to implement and organize my Alfresco projects.

Alfresco Premier Services – New Blog

Screen Shot 2017-03-27 at 9.14.38 PM

I’m not shy about saying the best thing about my job is my team.  Never in my career have I worked with such a dedicated, skilled and fun group of people.  Whether you are talking about our management team or our individual contributors, the breadth and depth of experience across the Alfresco Premier Services team is impressive.  Members of our team have developed deep expertise across our product line, from auditing to RM, from workflow to search.  Some folks on the team have even branched out and started their own open source projects around Alfresco.  We have decided to take the next step in sharing our knowledge and launch a team blog.

The inaugural post details some recent changes to the Alfresco Premier Services offerings that coincide with the digital business platform launch.  We will follow that up in short order with new articles covering both content and process services including guidance on FTP load balancing and pulling custom property files into a process services workflow.

Lots of exciting stuff to come from the Premier Services Team, stay tuned!

Content Services is in Alfresco’s DNA

I’m spending this week at Alfresco’s Sales Kickoff in Chicago, and having a blast.  There’s a lot of energy across the company about our new Digital Business Platform, and it’s great to see how many people instantly and intuitively get how Alfresco’s platform fits into a customer’s digital transformation strategy.  When content and process converge, and when we provide a world class platform for managing and exposing both as a service it’s a pretty easy case to make.  We have some great customer stories to drive the point home too.  It’s one thing to talk about a digital strategy and how we can play there, but it’s another thing entirely to see it happen.

Content management is undergoing a shift in thinking.  Analysts have declared that ECM is dead, and content services is a better way to describe the market.  For my part, I think they are right.  Companies ahead of the curve have been using Alfresco as a content services platform for a long time.  I decided to do a little digging and see when Alfresco first added a web API to our content platform.  A quick look through some of our internal systems shows that Alfresco had working web services for content all the way back in 2006.  It was probably there earlier than that, but that’s one of the earliest references I could easily find in our systems.  That’s over a decade of delivering open source content services.  Here’s a quick view of the history of content services delivery channels in the product.

API History

I don’t think any other company in the market space formerly known as ECM can say that they have been as consistently service enabled for as long as Alfresco.  It’s great to see the market going to where we have been all along.

My Favorite New Things in the Alfresco Digital Business Platform

alfresco-dbp

Everybody inside Alfresco has been busy getting ready for today’s launch of our new version, new branding, new web site, updated services and everything that comes along with it.  Today was a huge day for the company, with the release of Alfresco Content Services 5.2, a shiny new version of Alfresco Governance Services, our desktop sync client, the Alfresco Content Connector for Salesforce, a limited availability release of the Alfresco App Dev Framework and refreshes of other products such as our analytics solution, media management and AWS AMIs / Quickstarts.  Here are a few of my favorite bits from today’s releases (in no particular order).

The new REST API

Alfresco has always had a great web API, both the core REST API that was useful for interacting with Alfresco objects, and the open standards CMIS API for interacting with content.  Alfresco Content Services 5.2 takes this to the next level with a brand new set of APIs for working directly with nodes, versions, renditions and running search queries.  Not only is there a new API, but it is easier than ever to explore what the API has to offer via the API Explorer.  We also host a version of the API explorer so you can take a look without having to set up an Alfresco instance.  The new REST API is versioned, so you can build applications against it without worry that something will change in the future and break your code.  This new REST API was first released in the 5.2 Community version and is now available to Alfresco Enterprise customers.  The API is also a key component of the Alfresco App Development Framework, or ADF.  Like all previous releases, you can still extend the API to suit your needs via web scripts.

Alfresco Search Services

Alfresco Content Services 5.2 comes with a whole new search implementation called Alfresco Search Services.  This service is based on Solr 6, and brings a huge number of search improvements to the Alfresco platform.  Search term highlighting, indexing of multiple versions of a document, category faceting and multi-select facets and document fingerprinting are all now part of the Alfresco platform.  Sharding also gets some improvements and you can now shard your index by DBID, ACL, date, or any string property.  This is a big one for customers supporting multiple large, distinct user communities that may each have different search requirements.  Unlike previous releases of Alfresco, search is no longer bundled as a WAR file.  It is now its own standalone service.

The Alfresco App Dev Framework

Over the years there have been a number of ways to build what your users need on top of the Alfresco platform.  In the early days this was the Alfresco Explorer (now deprecated), built with JSF.  The Share UI was added to the mix later, allowing a more configurable UI with extension points based on Surf and YUI.  Both of these approaches required you to start with a UI that Alfresco created and modify it to suit your needs.  This works well for use cases that are somewhat close to what the OOTB UI was built for, or for problems that require minimal change to solve.  For example, both Explorer and Share made it pretty easy to add custom actions, forms, or to change what metadata was displayed.  However, the further you get from what Share was designed to do, the more difficult the customizations become.

What about those cases where you need something completely different?  What if you want to build your own user experience on top of Alfresco content and processes?  Many customers have done this by building our their own UI in any number of different technologies.  These customers asked us to make it easier, and we listened.  Enter the Alfresco App Dev Framework, or ADF.  The ADF is a set of Angular2 components that make it easier to build your own application on top of Alfresco services.  There’s much more to it than that, including dev tooling, test tooling and other things that accelerate your projects.  The ADF is big enough to really need its own series of articles, so may I suggest you hop over to the Alfresco Community site and take a look!  Note that the ADF is still in a limited availability release, but we have many customers that are already building incredible things with it.

Admin Improvements

A ton of people put in a tremendous amount of work to get Alfresco Content Services 5.2 out the door.  Two new features that I’ve been waiting for are included, courtesy of the Alfresco Community and Alfresco Support.  The first is the trashcan cleaner, which can automate the task of cleaning out the Alfresco deleted items collection.  This is based on the community extension that many of our customers have relied on for years.  The second is the Alfresco Support Tools component.  Support Tools gives you a whole new set of tools to help manage and troubleshoot your Alfresco deployment, including thread dumps, profiling and sampling, scheduled job and active session monitoring, and access to both viewing logs and changing log settings, all from the browser.  This is especially handy for those cases where admins might not have shell access to the box on which Alfresco is running or have JMX ports blocked.  There’s more as well, check out the 5.2 release notes for the full story.

The Name Change

Ok, so we changed the name of the product.  Big deal?  Maybe not to some people, but it is to me.  Alfresco One is now Alfresco Content Services.  Why does this matter?  For one, it more accurately reflects what we are, and what we want to be.  Alfresco has a great UI in Share, but it’s pretty narrowly focused on collaboration and records management use cases.  This represents a pretty big slice of the content management world, but it’s not what everybody needs.  Many of our largest and most successful customers use Alfresco primarily as a content services platform.  They already have their own front end applications that are tailor made for their business, either built in-house or bought from a vendor.  These customers need a powerful engine for creating, finding, transforming and managing content, and they have found it in Alfresco.  The name change also signals a shift in mindset at Alfresco.  We’re thinking bigger by thinking smaller.  This new release breaks down the platform into smaller, more manageable pieces.  Search Services, the Share UI, Content Services and Governance Services are all separate components that can be installed or not based on what you need.  This lets you build the platform you want, and lets our engineering teams iterate more quickly on each individual component.  Watch for this trend to continue.

I’m excited to be a part of such a vibrant community and company, and can’t wait to see what our customers, partners and others create with the new tools they have at their disposal.  The technology itself is cool, but what you all do with it is what really matters.

Digital Transformation and the Role of Content and Process

I recently had the opportunity to go to NYC and speak on architecting systems for digital transformation.  It was a terrific day with our customers and prospects, as well as the extended Alfresco crew and many of our outstanding partners.  This was the first time I’ve delivered this particular talk so I probably stumbled over my words a few times as I was watching the audience reaction as much as the speaker notes.  One of the things I always note when speaking is which slides prompt people to whip out their phones and take a picture.  That’s an obvious signal that you are hitting on something that matters to them in a way that caught their attention, so they want to capture it.  In my last talk, one of the slides that got the most attention and set the cameras snapping pictures was this one:

road-to-transformation

Digitization

Digitization of content is the whole reason ECM as a discipline exists in the first place.  In the early days of ECM a lot of use cases centered around scanning paper as images and attaching metadata so it could be easily found later.  The drivers for this are obvious.  Paper is expensive, it takes up space (also expensive), it is hard to search, deliver and copy and you can’t analyze it en masse to extract insights.  As ECM matured, we started handling more advanced digital content such as PDFs and Office documents, videos, audio and other data, and we started to manipulate them in new ways.  Transforming between formats, for example, or using form fields and field extraction.  OCR also played a role, taking those old document image files and breathing into them a new life as first class digital citizens.  We layered search and analytics on top of this digital content to make sense of it all and find what we needed as the size of repositories grew ever larger.  Digitization accelerated with the adoption of the web.  That paper form that your business relied on was replaced by a web form or app, infinitely malleable.  That legacy physical media library was transformed into a searchable, streamable platform.

What all of these things have in common is that they are centered around content.

Digitalization

Simply digitizing content confers huge advantages on an organization.  It drives down costs, makes information easier to find and allows aggregation and analysis in ways that legacy analog formats never could.  While all of those things are important, they only begin to unlock the potential of digital technologies.  Digitized content allows us to take the next step:  Digitalization.  When the cost of storing and moving content drops to near zero, processes can evolve free from many of the previous constraints.  BPM and process management systems can leverage the digitized content and allow us to build more efficient and friendlier ways to interact with our customers and colleagues.  We can surface the right content to the right people at the right time, as they do their job.  Just as importantly, content that falls out of a process can be captured with context, managed as a record if needed.  Now instead of just having digitized content, we have digitalized processes that were not possible in the past.  We can see, via our process platform, exactly what is happening across our business processes.  Processes become searchable and can be analyzed in depth.  Processes can have their state automatically affected by signals in the business, and can signal others.  Our business state is now represented in a digital form, just like our content.

If digitization is about content, digitalization is about process.

Digital Transformation

The union of digitized content and digital processes is a powerful combination, and that helps create the conditions for digital transformation.  How?

In my opinion (and many others) the single most important thing to happen to technology in recent memory is the rise of the open standards web API.  APIs have been around on the web for decades.  In fact, my first company was founded to build out solutions on top of the public XML based API provided by DHL to generate shipments and track packages.  That was all the way back in the very early 2000s, a lifetime ago by tech standards.  Even then though, people were thinking about how to expose parts of their business as an API.

One of the watershed moments in the story of the API happened way back in the early 2000s as well, at Amazon.  By this point everybody has read about Amazon’s “Big Mandate”.  If you haven’t read about it yet, go read it now.  I’ll wait.  Ok, ready?  Great.  So now we know that the seeds for Amazon’s dominance in the cloud were planted over a decade and a half ago by the realization that everything needs to be a service, that the collection of services that you use to run your business can effectively become a platform, and that platform (made accessible) changes the rules.  By treating every area of their business as part of the platform, all the way down to the infrastructure that ran them, Amazon changed the rules for everybody.

How does this tie into the first two paragraphs?  What about content and process?  I’m glad you asked.  Content and process management systems take two critical parts of a business (the content it generates and the processes that it executes) and surface them as an API.  Want to find a signed copy of an insurance policy document?  API.  Want to see how many customers are in your sign up pipeline and at what stage?  API.  Want to see all of the compliance related records from that last round of lab testing?  API.  Want to release an order to ship?  API.  Want to add completed training to somebody’s personnel record?  API.  You get the idea.  By applying best practices around content and process management systems, you can quickly expose large chunks of your business as a service, and begin to think of your company as a platform.

This is transformative.  When your business is a platform, you can do things you couldn’t do before.  You can build out new and better user experiences much more quickly by creating thin UI layers on top of your services.  Your internal teams or vendors that are responsible for software that provides the services can innovate independently (within the bounds of the API contract) which immediately makes them more agile.  You can selectively expose those services to your customer or partners, enabling them to innovate on top of the services you provide them.  You can apply monitoring and analytics at the service layer to see how the services are being used, and by whom (This is one, in fact, I would argue that you MUST do if you plan to orient yourselves toward services in any meaningful way, but that’s a separate article) which adds a new dimension to BI.  This is the promise of digital transformation.

There’s certainly more to come on this in the near future as my team continues our own journey toward digital transformation.  We are already well down the path and will share our insights and challenges as we go.

 

Do’s and Don’ts of Alfresco Development

Recently I had the pleasure of going up to NYC to speak with Alfresco customers and prospects about architecting Alfresco solutions and how to align that with efforts around digital transformation.  Part of that talk was a slide that discussed a few “do’s and don’ts” of designing Alfresco extensions.  Somebody suggested that I write down my thoughts on that slide and share it with the community.  So…  Here you go!

Do:

Stick with documented public APIs and extension points.

In early versions of Alfresco it wasn’t exactly clear what parts of the API were public (intended for use in extensions) or private (intended for use by Alfresco engineering).  This was mostly fixed in the 4.x versions of the product.  This was mostly a problem for people building things out on the Alfresco Java API.  Today, the Java API is fully annotated so it’s clear what is and is not public.  The full list is also available in the documentation.  Of course Alfresco’s server side Javascript API is public, as is the REST API (both Alfresco core and CMIS).  Alfresco Activiti has similar API sets.

Leverage the Alfresco SDK to build your deployment artifacts

During my time in the field I saw some customers and Alfresco developers roll their own toolchain for building Alfresco extensions using Ant, Gradle, Maven or in some cases a series of shell scripts.  This isn’t generally recommended these days, as the Alfresco SDK covers almost all of these cases.  Take advantage of all of the work Alfresco has done on the SDK and use it to build your AMPs / JARs and create your Alfresco / Share WAR file.  The SDK has some powerful features, and can create a complete skeleton project in a few commands using Alfresco archetypes.  Starting Alfresco with the extension included is just as simple.

Use module JARs where possible, AMPs where not

For most of Alfresco’s history, the proper way to build an extension was to package it as an AMP (Alfresco Module Package).  AMPs (mostly) act like overlays for the Alfresco WAR, putting components in the right places so that Alfresco knows where to find them.  Module JARs were first added in Alfresco 4.2 and have undergone significant improvements since introduction and are now referred to as “Simple Modules” in the documentation.  Generally, if your extension does not require a third party JAR, using a Simple Module is a much cleaner, easier way to package and deploy an extension to Alfresco.

Unit test!

This should go without saying, but it’s worth saying anyway.  In the long run unit testing will save time.  The Alfresco SDK has pretty good support for unit tests, and projects built using the All-In-One Archetype have an example unit test class included.

Be aware of your tree, depth and degree matter

At its core, the Alfresco repository is a big node tree.  Files and folders are represented by nodes, and the relationships between these things are modeled as parent-child or peer relationships between nodes.  The depth of this tree can affect application design and performance, as can the degree, or number of child nodes.  In short, it’s not usually a best practice to have a single folder that contains a huge number of children.  Not only can this make navigation difficult if using the Alfresco API, but it can also create performance troubles if an API call tries to list all of the children in a gigantic folder.  In the new REST APIs the results are automatically paged which mitigates this problem.

Use the Alfresco ServiceRegistry instead of injecting individual services

When using the Alfresco Java API and adding new Spring beans there are two ways to inject Alfresco services such as NodeService, PermissionService, etc.  You can inject the individual services that you need, or you can inject the ServiceRegistry that allows access to all services.  On the one hand, injecting individual services makes it easy to see from the bean definition exactly what services are used without going to the code.  On the other hand, if you need another service you need to explicitly inject it.  My preference is to simply inject the registry.  A second reason to inject the registry is that you’ll always get the right version of the service.  Alfresco has two versions of each service.  The first is the “raw” service, which has a naming convention that starts with a lower case letter.  An example is nodeService.  These services aren’t recommended for extensions.  Instead, Alfresco provides an AoP proxy that follows a naming convention that starts with an upper case letter, such as NodeService.  This is the service that is recommended and is the one that will be returned by the service registry.

Don’t:

Modify core Alfresco classes, web scripts, or Spring beans

Alfresco is an open source product.  This means that you can, in theory, pull down the source, modify it as much as you like and build your own version to deploy.  Don’t do that.  That’s effectively a fork, and at that point you won’t be able to upgrade.  Stick with documented extension points and don’t modify any of the core classes, Freemarker templates, Javascript, etc.  If you find you cannot accomplish a task using the defined extension points, contact Alfresco support or open a discussion on the community forum and get an enhancement request in so it can be addressed.  Speaking of extension points, if you want an great overview of what those are and when to use them, the documentation has you covered.

Directly access the Alfresco / Activiti database from other applications

Alfresco and Activiti both have an underlying database.  In general, it is not a best practice to access this database directly.  Doing so can exhaust connection pools on the DB side or cause unexpected locking behavior.  This typically manifests as performance problems but other types of issues are possible, especially if you are modifying the data.  If you need to update things in Alfresco or Activiti, do so through the APIs.

Perform operations that will create extremely large transactions

When Alfresco is building its index, it pulls transaction from the API to determine what has changed, and what needs to be indexed.  This works well in almost all cases.  The one case where it may become an issue is if an extension or import process has created a single transaction which contains an enormous change set.  The Alfresco Bulk Filesystem Import Tool breaks down imports into chunks, and each chunk is treated as a transaction.  Older Alfresco import / export functionality such as ACP did not do this, so importing a very large ACP file may create a large transaction.  When Alfresco attempts to index this transaction, it can take a long time to complete and in some cases it can time out.  This is something of an edge case, but if you are writing an extension that may update a large number of nodes, take care to break those updates down into smaller parts.

Code yourself into a corner with extensions that can’t upgrade

Any time you are building an extension for Alfresco or Activiti, pay close attention to how it will be upgraded.  Alfresco makes certain commitments to API stability, what will and will not change between versions.  Be aware of these things and design with that in mind.  If you stick with the public APIs in all of their forms, use the Alfresco SDK and package / deploy your extensions in a supported manner, most of the potential upgrade hurdles will already be dealt with.

Muck around with the exploded WAR file.  

This one should go without saying, but it is a bad practice to modify an exploded Java WAR file.  Depending on which application server you use and how it is configured, the changes you make in the exploded WAR may be overwritten by a WAR redeployment at the most inopportune time.  Instead of modifying the exploded WAR, create a proper extension and deploy that.  The SDK makes this quick and easy, and you’ll save yourself a lot of pain down the road.

This list is by no means comprehensive, just a few things that I’ve jotted down over many years of developing Alfresco extensions and helping my peers and customers manage their Alfresco platform.  Do you have other best practices?  Share them in the comments!