Rethinking the Migration Pipeline with AWS

Migration

Traditionally, migrating content to an Alfresco instance has usually looked very much like an ETL process.  First content, metadata, and structure is extracted from a source system (be it a shared drive, some legacy ECM system, etc), that data may then be transformed in some way, and then it gets imported into Alfresco.  Using the OOTB tools the last step is typically accomplished via the BFSIT (Bulk Filesystem Import Tool).  This approach has a lot of advantages.  It’s well understood, growing from an incredibly common model for database migrations and BI activities.  An ETL-like migration approach provides plenty of opportunities to fix data quality issues while the data is in a “staged” state.  Properly optimized, this approach can be very fast.  In the best cases it only requires re-linking existing content objects into the repository and creating the associated metadata, no in-flight copying of large content files required.  For many migrations, this approach makes perfect sense.  In the on-premises world, even more so.

The ETL approach does have some downsides.  It’s not great at dealing with a repository that is actively being used, requiring delta migrations later.  It can be a hard thing to scale up (faster per load) and out (more loads running in parallel), especially on-premises where companies are understandably reluctant to add more hardware for what should be a short term need.  It’s typically batch driven, and an import failure of one or more nodes in the batch can really slow things down.

A Better Way?

Alfresco on AWS opens up some new ways of looking at migrations.  Recently I was catching up on some reading and had the occasion to revisit a pair of articles by Gavin Cornwell.  In these two articles Gavin lays out some potential options for using AWS AI services such as Comprehend and Rekognition with Alfresco.  Specifically, he takes a look at AWS Lambda and Step Functions.  Reading that article got me thinking about ingestion pipelines, which in turn got me thinking about system migrations.  We do a LOT of these in Alfresco consulting, and we rely on a toolbox of Alfresco products, our own internal tools, and excellent solutions that are brought to the party by Alfresco partners.

The question that I kept circling around is this:  Does a move to AWS change the patterns that apply to migration?  I think it does.  Full disclosure, this is still very much a work in progress, probably deviates from best practices in a near infinite number of ways and is a personal science project at this point.

AWS offers some opportunities to rethink how to approach a migration.  Perhaps the biggest opportunity comes from the ability to think of the migration as a scalable, event-driven process.  Instead of pushing all of the content up to S3 and then importing or linking it into the repository using an in-process import tool or some external tooling that processes the list of content that exists in S3, it is possible to use the events emitted by writing to S3 itself to take care of validating, importing and performing post-processing on the content.  Content items are written to S3, which in turn triggers a per-item process that performs the steps required to get the content into the repository.  This is a perfect fit for the serverless compute paradigm provided by AWS.

Consider the following Step Functions State Machine diagram:

Screen Shot 2018-05-09 at 11.52.03 AM

For each item written to S3, (presumably, but not necessarily, a content item and a paired object that represents the metadata) S3 can emit a CloudWatch event.  This event matches a rule, which in turn starts a Step Function State Machine.  The state machine neatly encapsulates all of the steps required to validate the content and metadata, import the content into the repository and apply the validated metadata, and then optionally perform post processing such as image recognition, natural language processing, generation of thumbnails, etc.  Each step in the process maps to an AWS Lambda function which encapsulates one granular part of the validation and import process.  Most of the heavy lifting is done outside of Alfresco, which reduces the need to scale up the cluster during the import.  It should be possible to run a huge number of these step functions in parallel, speeding up the overall migration process while only paying for the compute consumed by the migration itself.  If anything goes wrong up to the point where the import of this specific item is complete it can be rolled back.

Perhaps the best thing about this approach is that it is so easy to adapt.  No two migrations are identical, even when the source and target platforms are common, so flexibility is key.  Flexibility here not only means flexibility during initial design, but also adjustments to the migration to fix issues discovered in flight or support for more than one import pipeline.  Need to change the way validation is done?  Modify the validation Lambda function.  Don’t need image or text classification, or want to use a different engine?  Remove or change the relevant functions.  Need to decorate imported docs with metadata pulled from another LOB system?  Insert a new step into the process.  It’s easy to change.

Over the next few articles we’ll dive deeper into exactly how this could work in detail, and build out some example Lambda functions to fill in the process and make it do real work.

AWS Lambda and Alfresco – Connecting Serverless to Content and Process

lambda+alfresco

Let’s start with a rant.  I don’t like the term “Serverless” to describe Lambda or other function as a service platforms.  Yeah, OK, so you don’t need to spin up servers, or worry about EC2 instances, or any of that stuff.  Great.  But it still runs on a server of some sort.  Even nascent efforts to extend “Serverless” to edge devices still have something that could be called a server, the device itself.  If it provides a service, it’s a server.  It’s like that Salesforce “No Software” campaign.  What?  It’s freaking software, no matter what some marketing campaign says.  It looks like the name is going to stick, so I’ll use it, but if you wonder why I look like I’ve just bit into a garden slug every time I say “Serverless”, that’s why.

Naming aside, there’s no doubt this is a powerful paradigm for writing, running and managing code.  For one, it’s simple.  It takes away all the pain of the lower levels of the stack and gives devs a superbly clean and easy environment.  It is (or should be) scalable.  It is (or should be) performant.  The appeal is easy to see.  Like most areas that AWS colonizes, Lambda seems to be the front runner in this space.

You know what else runs well in AWS?  Alfresco Content and Process Services.

Lambda -> Alfresco Content / Process Services

It should be fairly simple to call Alfresco Content or Process Services from AWS Lambda.  Lambda supports several execution environments, all of which support calling an external URL.  If you have an Alfresco instance running on or otherwise reachable from AWS, you can call it from Lambda.  This does, however, require you to write all of the supporting code to make the calls.  One Lambda execution environment is Node.js, which probably presents us with the easiest way to get Lambda talking to Alfresco.  Alfresco has a recently released Javascript client API which supports connections to both Alfresco Content Services and Alfresco Process Services.  This client API requires at least Node.js 5.x.  Lambda supports 6.10 at the time this article was written, so no problem there!

Alfresco Content / Process Services -> Lambda

While it’s incredibly useful to be able to call Alfresco services from Lambda, what about triggering Lambda functions from the Alfresco Digital Business Platform?  That part is also possible, exactly how to do it depends on what you want to do.  Lambda supports many ways to invoke a function, some of which may be helpful to us.

S3 bucket events

AWS Lambda functions can be triggered in response to S3 events such as object creation or deletion.  The case that AWS outlines on their web site is a situation where you might want to generate a thumbnail, which Alfresco already handles quite nicely, but it’s not hard to come up with others.  We might want to do something when a node is deleted from the S3 bucket by Alfresco.  For example, this could be used to trigger a notification that content cleanup was successful or to write out an additional audit entry to another system.  Since most Alfresco DBP deployments in AWS use S3 as a backing store, this is an option available to most AWS Alfresco Content or Process Services instances.

Simple Email Service

Another way to trigger a Lambda function is via the AWS Simple Email Service.  SES is probably more commonly used to send emails, but it can also receive them.  SES can invoke your Lambda function and pass it the email it received.  Sending email can already easily be done from both an Alfresco Process Services BPMN task and from an Alfresco Content Services Action, so this could be an easy way to trigger a Lambda function using existing functionality in response to a workflow event or something occurring in the content repository.

Scheduled Events

AWS CloudWatch provides a scheduled event capability for CloudWatch Events.  These are configured using either a fixed rate or a cron expression, and use a rule target definition to define which Lambda function to call.  A scheduled event isn’t really a way to call Lambda functions from ACS or APS, but it could prove to be very useful for regular cleanup events, archiving or other recurring tasks you wish to run against your Alfresco Content Services instances in AWS.  It also gives you a way to trigger things to happen in your APS workflows on a schedule, but that case is probably better handled in the process definition itself.

API Gateway

Our last two options would require a little work, but may turn out to be the best for most use cases.  Using an API Gateway you can define URLs that can be used to directly trigger your Lambda functions.  Triggering these from Alfresco Process Services is simple, just use a REST call task to make the call.  Doing so from Alfresco Content Services is a bit trickier, requiring either a custom action or a behavior that makes the call out to the API gateway and passes it the info your Lambda function needs to do its job.  Still fairly straightforward, and there are lots of good examples of making HTTP calls from Alfresco Content Services extensions out there in the community.

SNS

AWS Simple Notification Service provides another scalable option for calling Lambda functions.  Like the API gateway option, you could use a REST call task in Alfresco Process Services, or a bit of custom code to make the call from Alfresco Content Services.  AWS SNS supports a simple API for publishing messages to a topic, which can then be used to trigger your Lambda function.

There are quite a few ways to both use Alfresco Process and Content services from Lambda functions, as well use Lambda functions to enhance your investment in Alfresco technologies.  I plan to do a little spike to explore this further, stay tuned for findings and code samples!