Rethinking the Migration Pipeline with AWS


Traditionally, migrating content to an Alfresco instance has usually looked very much like an ETL process.  First content, metadata, and structure is extracted from a source system (be it a shared drive, some legacy ECM system, etc), that data may then be transformed in some way, and then it gets imported into Alfresco.  Using the OOTB tools the last step is typically accomplished via the BFSIT (Bulk Filesystem Import Tool).  This approach has a lot of advantages.  It’s well understood, growing from an incredibly common model for database migrations and BI activities.  An ETL-like migration approach provides plenty of opportunities to fix data quality issues while the data is in a “staged” state.  Properly optimized, this approach can be very fast.  In the best cases it only requires re-linking existing content objects into the repository and creating the associated metadata, no in-flight copying of large content files required.  For many migrations, this approach makes perfect sense.  In the on-premises world, even more so.

The ETL approach does have some downsides.  It’s not great at dealing with a repository that is actively being used, requiring delta migrations later.  It can be a hard thing to scale up (faster per load) and out (more loads running in parallel), especially on-premises where companies are understandably reluctant to add more hardware for what should be a short term need.  It’s typically batch driven, and an import failure of one or more nodes in the batch can really slow things down.

A Better Way?

Alfresco on AWS opens up some new ways of looking at migrations.  Recently I was catching up on some reading and had the occasion to revisit a pair of articles by Gavin Cornwell.  In these two articles Gavin lays out some potential options for using AWS AI services such as Comprehend and Rekognition with Alfresco.  Specifically, he takes a look at AWS Lambda and Step Functions.  Reading that article got me thinking about ingestion pipelines, which in turn got me thinking about system migrations.  We do a LOT of these in Alfresco consulting, and we rely on a toolbox of Alfresco products, our own internal tools, and excellent solutions that are brought to the party by Alfresco partners.

The question that I kept circling around is this:  Does a move to AWS change the patterns that apply to migration?  I think it does.  Full disclosure, this is still very much a work in progress, probably deviates from best practices in a near infinite number of ways and is a personal science project at this point.

AWS offers some opportunities to rethink how to approach a migration.  Perhaps the biggest opportunity comes from the ability to think of the migration as a scalable, event-driven process.  Instead of pushing all of the content up to S3 and then importing or linking it into the repository using an in-process import tool or some external tooling that processes the list of content that exists in S3, it is possible to use the events emitted by writing to S3 itself to take care of validating, importing and performing post-processing on the content.  Content items are written to S3, which in turn triggers a per-item process that performs the steps required to get the content into the repository.  This is a perfect fit for the serverless compute paradigm provided by AWS.

Consider the following Step Functions State Machine diagram:

Screen Shot 2018-05-09 at 11.52.03 AM

For each item written to S3, (presumably, but not necessarily, a content item and a paired object that represents the metadata) S3 can emit a CloudWatch event.  This event matches a rule, which in turn starts a Step Function State Machine.  The state machine neatly encapsulates all of the steps required to validate the content and metadata, import the content into the repository and apply the validated metadata, and then optionally perform post processing such as image recognition, natural language processing, generation of thumbnails, etc.  Each step in the process maps to an AWS Lambda function which encapsulates one granular part of the validation and import process.  Most of the heavy lifting is done outside of Alfresco, which reduces the need to scale up the cluster during the import.  It should be possible to run a huge number of these step functions in parallel, speeding up the overall migration process while only paying for the compute consumed by the migration itself.  If anything goes wrong up to the point where the import of this specific item is complete it can be rolled back.

Perhaps the best thing about this approach is that it is so easy to adapt.  No two migrations are identical, even when the source and target platforms are common, so flexibility is key.  Flexibility here not only means flexibility during initial design, but also adjustments to the migration to fix issues discovered in flight or support for more than one import pipeline.  Need to change the way validation is done?  Modify the validation Lambda function.  Don’t need image or text classification, or want to use a different engine?  Remove or change the relevant functions.  Need to decorate imported docs with metadata pulled from another LOB system?  Insert a new step into the process.  It’s easy to change.

Over the next few articles we’ll dive deeper into exactly how this could work in detail, and build out some example Lambda functions to fill in the process and make it do real work.