Rethinking the Migration Pipeline with AWS

Migration

Traditionally, migrating content to an Alfresco instance has usually looked very much like an ETL process.  First content, metadata, and structure is extracted from a source system (be it a shared drive, some legacy ECM system, etc), that data may then be transformed in some way, and then it gets imported into Alfresco.  Using the OOTB tools the last step is typically accomplished via the BFSIT (Bulk Filesystem Import Tool).  This approach has a lot of advantages.  It’s well understood, growing from an incredibly common model for database migrations and BI activities.  An ETL-like migration approach provides plenty of opportunities to fix data quality issues while the data is in a “staged” state.  Properly optimized, this approach can be very fast.  In the best cases it only requires re-linking existing content objects into the repository and creating the associated metadata, no in-flight copying of large content files required.  For many migrations, this approach makes perfect sense.  In the on-premises world, even more so.

The ETL approach does have some downsides.  It’s not great at dealing with a repository that is actively being used, requiring delta migrations later.  It can be a hard thing to scale up (faster per load) and out (more loads running in parallel), especially on-premises where companies are understandably reluctant to add more hardware for what should be a short term need.  It’s typically batch driven, and an import failure of one or more nodes in the batch can really slow things down.

A Better Way?

Alfresco on AWS opens up some new ways of looking at migrations.  Recently I was catching up on some reading and had the occasion to revisit a pair of articles by Gavin Cornwell.  In these two articles Gavin lays out some potential options for using AWS AI services such as Comprehend and Rekognition with Alfresco.  Specifically, he takes a look at AWS Lambda and Step Functions.  Reading that article got me thinking about ingestion pipelines, which in turn got me thinking about system migrations.  We do a LOT of these in Alfresco consulting, and we rely on a toolbox of Alfresco products, our own internal tools, and excellent solutions that are brought to the party by Alfresco partners.

The question that I kept circling around is this:  Does a move to AWS change the patterns that apply to migration?  I think it does.  Full disclosure, this is still very much a work in progress, probably deviates from best practices in a near infinite number of ways and is a personal science project at this point.

AWS offers some opportunities to rethink how to approach a migration.  Perhaps the biggest opportunity comes from the ability to think of the migration as a scalable, event-driven process.  Instead of pushing all of the content up to S3 and then importing or linking it into the repository using an in-process import tool or some external tooling that processes the list of content that exists in S3, it is possible to use the events emitted by writing to S3 itself to take care of validating, importing and performing post-processing on the content.  Content items are written to S3, which in turn triggers a per-item process that performs the steps required to get the content into the repository.  This is a perfect fit for the serverless compute paradigm provided by AWS.

Consider the following Step Functions State Machine diagram:

Screen Shot 2018-05-09 at 11.52.03 AM

For each item written to S3, (presumably, but not necessarily, a content item and a paired object that represents the metadata) S3 can emit a CloudWatch event.  This event matches a rule, which in turn starts a Step Function State Machine.  The state machine neatly encapsulates all of the steps required to validate the content and metadata, import the content into the repository and apply the validated metadata, and then optionally perform post processing such as image recognition, natural language processing, generation of thumbnails, etc.  Each step in the process maps to an AWS Lambda function which encapsulates one granular part of the validation and import process.  Most of the heavy lifting is done outside of Alfresco, which reduces the need to scale up the cluster during the import.  It should be possible to run a huge number of these step functions in parallel, speeding up the overall migration process while only paying for the compute consumed by the migration itself.  If anything goes wrong up to the point where the import of this specific item is complete it can be rolled back.

Perhaps the best thing about this approach is that it is so easy to adapt.  No two migrations are identical, even when the source and target platforms are common, so flexibility is key.  Flexibility here not only means flexibility during initial design, but also adjustments to the migration to fix issues discovered in flight or support for more than one import pipeline.  Need to change the way validation is done?  Modify the validation Lambda function.  Don’t need image or text classification, or want to use a different engine?  Remove or change the relevant functions.  Need to decorate imported docs with metadata pulled from another LOB system?  Insert a new step into the process.  It’s easy to change.

Over the next few articles we’ll dive deeper into exactly how this could work in detail, and build out some example Lambda functions to fill in the process and make it do real work.

Alfresco Javascript API and AWS Lambda Functions. Part 1, Lambda and Cloud9 101

lambda+alfresco

I’ve written before about several ways to make use of AWS Lambda functions within your Alfresco Content and Process Services deployment.  In preparation for my DevCon talk, I’m diving deeper into this and building out some demos for the session.  I figured this is a good time to do a quick writeup on one way to approach the problem.

What is a Lambda function?

Briefly, AWS Lambda is a “serverless” compute service.  Instead of the old model of provisioning an EC2 instance, loading software and services, etc, Lambda allows you to simply write code against a specific runtime (Java, Node, Python, or .NET) that is executed when an event occurs.  This event can come from a HUGE number of sources within the AWS suite of services.  When your function is called, it is spun up in a lightweight, configurable container and run.  That container may or may not be used again, depending on several factors.  AWS provides some information about what happens under the hood, but the idea is that most of the time you don’t need to sweat it.  In summary, a Lambda function is just a bit of code that runs in response to a triggering event.

Preparing the Lambda package

Creating a Lambda function through the AWS UI is trivial.  A few clicks, a couple form fields, and you’re done.  This is fine for simple functions, but what about when you need to use an external library?  The bad news is that this takes a couple extra steps.  The good news is that once you have done it, you can move on to a more productive workflow.  The sticking point from doing it all through the AWS console is the addition of libraries and dependencies.  We can get around that by using a Zip package to start out project.  The Zip package format is pretty simple.  Let’s create one, we’ll call it AlfrescoAPICall.

Start by creating an empty director for your project, and changing into that directory:

mkdir AlfrescoAPICall

cd AlfrescoAPICall

Next, create a handler for your Lambda function.  The default name for the handler is index.js, but you can change it so long as you configure your Lambda function appropriately.

touch index.js

Now, use npm to install the modules you need into your project directory.  For this example, we’ll use alfresco-js-api.

npm install alfresco-js-api

A real project probably wouldn’t just install the needed modules piecemeal, it makes more sense to define all of the dependencies in package.json instead.  Regardless, at the end of this you should have a project root folder that contains your Lambda function handler, and a node_modules directory that contains all of your dependencies.  Next, we need to Zip this up into a Lambda deployment package.  Don’t Zip up the project folder, we need to Zip up the folder contents instead.

zip -r AlfrescoAPICall.zip .

And that’s it!  AlfrescoAPICall.zip is the Lambda package that we’ll upload to AWS so we can get to work.  We don’t need to do this again unless the dependencies or versions change.

Getting it into AWS

There are a few ways to get our newly created deployment package up to AWS.  It can be done using the AWS CLI, or with the AWS console.  If you do it via the CLI, it will look something like this:

aws lambda update-function-code –function-name AlfrescoAPICall –zip-file AlfrescoAPICall.zip

If you do it via the AWS console, you can simply choose the file to upload :

Screen Shot 2017-12-14 at 4.53.50 PM

Regardless of how you update your Lambda function, once you have uploaded your zip archive you can see the entire project in the “edit code inline” view in the Lambda page:

Screen Shot 2017-12-14 at 4.55.17 PM

Woo hoo!  Now we have a skeleton Lambda project that includes the Alfresco Javascript Client API and we can start building!

Starting development in earnest

This process is simple, but it isn’t exactly the easiest way to build in AWS.  With the relaunch of Cloud9 at re:Invent, AWS has a pretty good web based IDE that we can use for this project.  I’m not going to go through all the steps of creating a Cloud9 environment, but once you have it created you should see your newly created Lambda function in the right-hand pane under the AWS Resources tab.  If you don’t, make sure the IAM account you are using with Cloud9 (not the root account!!) has access to your function.  It will be listed under Remote Functions.  Here’s where it gets really cool.  Right click on the remote function, and you can import the whole thing right into your development environment:

Screen Shot 2017-12-14 at 5.07.00 PM

 

Neat, right?  After you import it you’ll see it show up in the project explorer view on the left.  From here it is basically like any other IDE.  Tabbed code editor, tree views, syntax highlighting and completion, etc, etc.

Screen Shot 2017-12-14 at 5.07.17 PM

One cool feature of Cloud9 is the ability to test run your Lambda function locally (in the EC2 instance Cloud9 is connected to) or on the AWS Lambda service itself by picking the option from the menu in the run panel. As one would expect, you can also set breakpoints in your Lambda function for debugging:

Screen Shot 2017-12-14 at 5.19.51 PM

Finally, once you are done with your edits and have tested your function to your satisfaction, getting the changes back up to the Lambda service is trivial.  Save your changes, right click on the local function, and select deploy:

Screen Shot 2017-12-14 at 8.35.46 PM

Simple, right?  Now we have a working Lambda function with the Alfresco Javascript Client API available, and we can start developing something useful!  In part two, we’ll continue by getting Alfresco Process Services up and running in AWS and extend this newly created function to interact with a process.

 

AWS Lambda and Alfresco – Connecting Serverless to Content and Process

lambda+alfresco

Let’s start with a rant.  I don’t like the term “Serverless” to describe Lambda or other function as a service platforms.  Yeah, OK, so you don’t need to spin up servers, or worry about EC2 instances, or any of that stuff.  Great.  But it still runs on a server of some sort.  Even nascent efforts to extend “Serverless” to edge devices still have something that could be called a server, the device itself.  If it provides a service, it’s a server.  It’s like that Salesforce “No Software” campaign.  What?  It’s freaking software, no matter what some marketing campaign says.  It looks like the name is going to stick, so I’ll use it, but if you wonder why I look like I’ve just bit into a garden slug every time I say “Serverless”, that’s why.

Naming aside, there’s no doubt this is a powerful paradigm for writing, running and managing code.  For one, it’s simple.  It takes away all the pain of the lower levels of the stack and gives devs a superbly clean and easy environment.  It is (or should be) scalable.  It is (or should be) performant.  The appeal is easy to see.  Like most areas that AWS colonizes, Lambda seems to be the front runner in this space.

You know what else runs well in AWS?  Alfresco Content and Process Services.

Lambda -> Alfresco Content / Process Services

It should be fairly simple to call Alfresco Content or Process Services from AWS Lambda.  Lambda supports several execution environments, all of which support calling an external URL.  If you have an Alfresco instance running on or otherwise reachable from AWS, you can call it from Lambda.  This does, however, require you to write all of the supporting code to make the calls.  One Lambda execution environment is Node.js, which probably presents us with the easiest way to get Lambda talking to Alfresco.  Alfresco has a recently released Javascript client API which supports connections to both Alfresco Content Services and Alfresco Process Services.  This client API requires at least Node.js 5.x.  Lambda supports 6.10 at the time this article was written, so no problem there!

Alfresco Content / Process Services -> Lambda

While it’s incredibly useful to be able to call Alfresco services from Lambda, what about triggering Lambda functions from the Alfresco Digital Business Platform?  That part is also possible, exactly how to do it depends on what you want to do.  Lambda supports many ways to invoke a function, some of which may be helpful to us.

S3 bucket events

AWS Lambda functions can be triggered in response to S3 events such as object creation or deletion.  The case that AWS outlines on their web site is a situation where you might want to generate a thumbnail, which Alfresco already handles quite nicely, but it’s not hard to come up with others.  We might want to do something when a node is deleted from the S3 bucket by Alfresco.  For example, this could be used to trigger a notification that content cleanup was successful or to write out an additional audit entry to another system.  Since most Alfresco DBP deployments in AWS use S3 as a backing store, this is an option available to most AWS Alfresco Content or Process Services instances.

Simple Email Service

Another way to trigger a Lambda function is via the AWS Simple Email Service.  SES is probably more commonly used to send emails, but it can also receive them.  SES can invoke your Lambda function and pass it the email it received.  Sending email can already easily be done from both an Alfresco Process Services BPMN task and from an Alfresco Content Services Action, so this could be an easy way to trigger a Lambda function using existing functionality in response to a workflow event or something occurring in the content repository.

Scheduled Events

AWS CloudWatch provides a scheduled event capability for CloudWatch Events.  These are configured using either a fixed rate or a cron expression, and use a rule target definition to define which Lambda function to call.  A scheduled event isn’t really a way to call Lambda functions from ACS or APS, but it could prove to be very useful for regular cleanup events, archiving or other recurring tasks you wish to run against your Alfresco Content Services instances in AWS.  It also gives you a way to trigger things to happen in your APS workflows on a schedule, but that case is probably better handled in the process definition itself.

API Gateway

Our last two options would require a little work, but may turn out to be the best for most use cases.  Using an API Gateway you can define URLs that can be used to directly trigger your Lambda functions.  Triggering these from Alfresco Process Services is simple, just use a REST call task to make the call.  Doing so from Alfresco Content Services is a bit trickier, requiring either a custom action or a behavior that makes the call out to the API gateway and passes it the info your Lambda function needs to do its job.  Still fairly straightforward, and there are lots of good examples of making HTTP calls from Alfresco Content Services extensions out there in the community.

SNS

AWS Simple Notification Service provides another scalable option for calling Lambda functions.  Like the API gateway option, you could use a REST call task in Alfresco Process Services, or a bit of custom code to make the call from Alfresco Content Services.  AWS SNS supports a simple API for publishing messages to a topic, which can then be used to trigger your Lambda function.

There are quite a few ways to both use Alfresco Process and Content services from Lambda functions, as well use Lambda functions to enhance your investment in Alfresco technologies.  I plan to do a little spike to explore this further, stay tuned for findings and code samples!