Importing Legacy CSV Data into Elasticsearch


I use Salesforce at work quite a bit, and one of the things I find endlessly frustrating about it is the lack of good reporting functions.  Often I end up just dumping all the data I need into a CSV file and opening it up in Excel to build the reports I need.  Recently I was trying to run some analysis on cases and case events over time.  As usual, Salesforce “reports” (which are really little more than filtered lists with some limited predefined object joins) were falling well short of what I needed.  I’ve been playing around with Elasticsearch for some other purposes, such as graphing and analyzing air quality measurements taken over time.  I’ve also seen people on my team at Alfresco use Elasticsearch for some content analytics work with Logstash.  Elasticsearch and Kibana lends itself well to analyzing the kind of time-series data that I was working with in Salesforce.

The Data

Salesforce reporting makes it simple to export your data as a CSV file.  It’s a bit “lowest common denominator”, but it will work for my purposes.  What I’m dumping into that CSV is a series of support case related events, such as state transitions, comments, etc.  What I want out of it is the ability to slice and dice that data in a number of ways to analyze how customers and support reps are interacting with each other.  How to get that CSV data into Elasticsearch?  Logstash.  Logstash has a filter plugin that simplifies sucking CSV data into the index (because of course it does).

The Configuration

Importing CSV data to Elasticsearch using Logstash is pretty straightforward.  To do this, we need a configuration file for Logstash that defines where the input data comes from, how to filter it, and where to send it.  The example below is a version of the config file that I used.  It assumes that the output will be an Elasticsearch instance running locally on port 9200, and will stash the data in an index named “supportdata”.  It will also output the data to stdout for debugging purposes.  Not recommended for production if you have a huge volume, but for my case it’s handy to see.  The filter section contains the list of columns that will be imported.  Using filter options you can get some fine grained control over this behavior.

input {
file {
path =>     [“/path/to/my/file.csv”]
start_position => “beginning”

filter {
csv {
columns => [

output {
elasticsearch {
hosts => “http://localhost:9200”
index => “supportdata”


No project goes perfectly right the first time, and this was no exception.  I use a Mac for work, and when I first tried to get Logstash to import the data it would run, but nothing would show up in the index.  I turned on debugging to see what was happening, and saw the following output:

[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.pipeline ] Pushing flush onto pipeline

This block just repeated over and over again.  So what’s going wrong?  Obviously Logstash can see the file and can read it.  It is properly picking up the fact that the file has changed, but it isn’t picking up the CSV entries and moving them into Elasticsearch.  Turns out that Logstash is sensitive to line ending characters.  Simply opening the CSV in TextWrangler and saving it with Unix line endings fixed the problem.

Now that I can easily get my CSV formatted event data into Elasticsearch, the next step is to automate all of this so that I can just run my analysis without having to deal with manually exporting the report.  It looks like this is possible via the Salesforce Reports and Dashboards REST API.  I’m just getting my head around this particular Salesforce API, and at first glance it looks like there is a better way to do this than with CSV data.  I’m also looking into TreasureData as an option, since it appears to support pulling data from Salesforce and pushing it into Elasticsearch.  As that work progresses I’ll be sure to share whatever shakes out of it!


Open Source is the Surest and Shortest Path to Digital Transformation

Back in 2013, Mike Olson, a co-founder of Cloudera, famously stated that “No dominant platform-level software infrastructure has emerged in the last 10 years in closed-source, proprietary form.”.  He’s absolutely right about that.  John Newton underscored this theme at a recent Alfresco Day event in NYC.  He shared this slide as a part of his presentation, which I think does a great job showing how much of our modern platforms are dependent on the open source ecosystem:


Platforms are more open today than they have ever been, with a few exceptions (I’m glaring annoyed at my iPhone as I write this).  Quite a few companies seem to have figured out the secret sauce of blending open platforms with proprietary value-adds to create robust, open ecosystems AND be able to pay the bills in the process.  This is very good news for you if you are pursuing a digital transformation strategy.

Why open source and open standards?

The advantages of open source are pretty well established at this point.  Open projects are more developer friendly.  They innovate faster.  They can fork and merge and rapidly change direction if the community needs that to happen (although there are good and bad forks).  Open has become the de-facto way that the digital business works today.  I’d challenge you to find any team within your organization that isn’t using at least one open source project or library.  Open has won.  That’s the first big advantage of open source in digital transformation today:  It’s ubiquitous.  You can find a platform or component to fill just about any need you have.

Open is also faster to try, and removes a lot of friction when testing out a new idea.  Effective digital transformation relies on speed and agility.  It’s a huge advantage to simply pull down a build of an open source technology you want to try out, stand it up and get to work.  That allows bad ideas to fail fast, and good ideas to flourish immediately.  Since testing ideas out is effectively free in terms of dollar cost, and cheap in terms of time and cognitive investment, I think we also tend to be more willing to throw it out and start over if we don’t get the results we want.  That’s a good thing as it ultimately leads to less time spent trying to find a bigger hammer to slam that square peg into a round hole.  If you decide to go forward with a particular technology, You’ll find commercial organizations standing behind them with support and value added components that can take an open source core to the next level.

If digital transformation relies on speed of innovation, then open technologies look even more appealing.  Why do open source projects tend to out-innovate their purely proprietary competitors?  There are probably a lot of reasons.  An open project isn’t limited to contributors from only one company.  Great ideas can come from anywhere and often do.  At their best, large open source projects function as meritocracies.  This is especially true of foundational platform technologies that may have originated at or get contributions from tech leaders.  These are the same technologies that can power your own digital transformation.

Open source projects also make the pace of innovation easier to manage since you get full transparency of what has changed version to version, and the visibility into the future direction of the project.  Looking at pending pull requests or commits between releases gives you a view into what is evolving in the project so that you can plan accordingly.  In a very real sense, pursuing a digital transformation strategy using open technologies forces you to adopt a modular, swappable, services driven approach.  Replacing a monolithic application stack every cycle is not possible, but replacing or upgrading individual service components in a larger architecture is, and open source makes that easier.

Software eats the world, and is a voracious cannibal

There is a downside to this pace of change, however.  Because open source projects innovate so quickly, and because the bar to creating one is so low, we often see exciting projects disrupted before they can really deliver on their promise.  Just when the people responsible for architecture and strategy start to fully understand how to exploit a new technology, the hype cycle begins on something that will supersede it.  Nowhere is this as bad as it is in the world of JavaScript frameworks where every week something new  and shiny and loud is vying for developers’ limited time and attention.  Big data suffers from the same problem.  A few years ago I would have lumped NoSQL (I hate that term, but it has stuck) databases into that category as well, but the sector seems to have settled down a little bit.

There is also a risk that an open source technology will lose its way, lose its user and developer base and become abandonware.  Look for those projects that have staying power.  Broad user adoption, frequent commits and active discussions are all good signs.  Projects that are part of a well established organization like the Apache Software Foundation are also usually a good bet.  Apache has a rigorous process that a project must follow to become a full blown project, and this drives a level of discipline that independent projects may lack.  A healthy company standing behind the project is another good sign, as that means there are people out there with financial interest in the project’s success.

Simply using open source projects to build your platform for transformation is no guarantee of success, but I would argue that carefully selecting the right open components does tilt the odds in your favor.

You Cannot Succeed at Digital Transformation Without Planning for Scale

TL;DR:  Digital transformation == scale, just by its nature.

Digital transformation affects all areas of a business, from the way leadership thinks of opportunities to the way developers build applications, and it carries challenges throughout that chain.  One of the biggest challenges for IT will come from achieving scale, often in unexpected places.

Why does digital transformation automatically mean scale?

Looking back at my last post on the journey to digital transformation, there are a few points where it should be obvious that you should be prepared to scale up.  In the digitization phase, for example, it makes sense to plan for managing a large amount of content and metadata.  Whether you are migrating from a legacy system, consolidating multiple repositories or ingesting a bunch of paper, your target repository will need to be ready to handle not only what you are bringing in today but what you plan to create and manage for the next several years.  Deploying in the cloud eases this burden significantly, freeing you from having to provision a ton of storage or DB capacity that will sit idle until it is used or deal with adding capacity to an on-premises solution.

Digitalization also drives the need to scale up.  Processes that were once done completely manually now get done via software.  Along the way a ton of useful information is captured.  Not only that information that is required to complete the process, such as attached documents, form data, assigned user, etc, but also metadata about the process itself.  How long did it take for a specific step to be completed?  Was it reassigned?  To whom and how often?  What is the current active task?  How many instances of each process are in flight?  All of this data is collected in a process management system.  The more processes move from a manual process to automated or managed processes, the bigger this pool becomes.  The net effect is an explosion in the amount of data that needs to be handled.

If digitization of content and digitalization of process lead to the need to scale, achieving digital transformation takes the problem and dials it up to 11.  Digital transformation will flip things around and turn more people that were previously consumers of information into producers, whether they realize it or not.  An employee working with a digital process may see some similarities in the types of information they are working with as they did before transformation, but behind the scenes there is a lot more data being created.

Alfresco’s platform is built for this kind of scale in both content and process.  It is built on proven, scalable and performant open source technologies, and has been deployed by thousands of customers around the globe in support of large, business critical applications.  Alfresco provides guidance in several areas to help you size your deployment, build in the cloud, and make smart decisions about how and when to scale.

What about those unexpected areas of scale?

Scaling your content and process platform as a part of a digital transformation strategy is expected from day one, and should be part of the roll out and maintenance plan built before the first application goes live.  It may start with scaling content and process technology, but it does not end there.  Let’s look at some common drivers of digital transformation.  A few days spent reading a lot of articles, literature and opinion on digital transformation yields a wealth of reasons why companies might pursue it:

  • Improve the customer experience and become more customer centric
  • Get leaner, meaner and more efficient
  • Make better business decisions
  • Responding to an increased pace of technological change
  • New competitive threats or market opportunities
  • Demand for real time information and insights

Take a look at that list.  Achieving any of these things will require more information to be captured and analyzed.  Getting more customer centric means understanding what your customers need, where they are dissatisfied with the current experiences you offer them and what you can do to improve them.  Becoming leaner and more efficient requires detailed metrics about processes so waste or delays can be identified and trimmed.  Better business decisions and real time information mean drinking from a firehose of data from across the business.

Achieving digital transformation requires you to plan for scale and not where you might expect.  It doesn’t just require you to plan for more content and more processes, but also for how to handle the data about those things that you will need to capture and analyze.

The non-technical side of scale

Ultimately digital transformation is not a technology problem, it’s a business problem.  It is unsurprising then, that we’ll be faced with challenges that we didn’t expect as we scale up that have nothing to do with technology.  Take, for example, support.  If your digital transformation rests on open APIs provided by a stack of homegrown, cloud and vendor provided services, how do you route support tickets?  If a user reports a problem, how can you narrow down the source and make sure that it gets handled by the right team?  A support team can be quickly overwhelmed if they need to sift through a dozen irrelevant error reports to find the ones that they can actually address.  The more services you rely on, the harder this problem becomes.  This is where detailed monitoring of the service layer becomes important.  Guess what that creates?  More data.  If you are using Alfresco technologies for content, process and governance, you have several options for keeping tabs on your services.

There are other non-technical areas that will be affected by scale as well.  Documentation and discovery, for example.  As the number of services rolled out in support of transformation increases, developers and business users alike need easy ways to find these services, and to understand how they work.  This in itself becomes another service.  Change management is another area that a business needs to be prepared to scale up.  Digital transformation increases the pace of change in an organization by enabling more rapid response to changing business conditions or new opportunities.  Without a solid framework in place to evaluate, decide on, execute and communicate change, digital transformation will have a hard time getting the traction it needs.

If you take away one thing from this and other articles about digital transformation, it’s this:  Achieving transformation requires scale from your systems, processes and people and not just those that deal directly with technology.  Don’t underestimate it and find yourself with a plan that cracks under the weight of change.

Digital Transformation and the Role of Content and Process

I recently had the opportunity to go to NYC and speak on architecting systems for digital transformation.  It was a terrific day with our customers and prospects, as well as the extended Alfresco crew and many of our outstanding partners.  This was the first time I’ve delivered this particular talk so I probably stumbled over my words a few times as I was watching the audience reaction as much as the speaker notes.  One of the things I always note when speaking is which slides prompt people to whip out their phones and take a picture.  That’s an obvious signal that you are hitting on something that matters to them in a way that caught their attention, so they want to capture it.  In my last talk, one of the slides that got the most attention and set the cameras snapping pictures was this one:



Digitization of content is the whole reason ECM as a discipline exists in the first place.  In the early days of ECM a lot of use cases centered around scanning paper as images and attaching metadata so it could be easily found later.  The drivers for this are obvious.  Paper is expensive, it takes up space (also expensive), it is hard to search, deliver and copy and you can’t analyze it en masse to extract insights.  As ECM matured, we started handling more advanced digital content such as PDFs and Office documents, videos, audio and other data, and we started to manipulate them in new ways.  Transforming between formats, for example, or using form fields and field extraction.  OCR also played a role, taking those old document image files and breathing into them a new life as first class digital citizens.  We layered search and analytics on top of this digital content to make sense of it all and find what we needed as the size of repositories grew ever larger.  Digitization accelerated with the adoption of the web.  That paper form that your business relied on was replaced by a web form or app, infinitely malleable.  That legacy physical media library was transformed into a searchable, streamable platform.

What all of these things have in common is that they are centered around content.


Simply digitizing content confers huge advantages on an organization.  It drives down costs, makes information easier to find and allows aggregation and analysis in ways that legacy analog formats never could.  While all of those things are important, they only begin to unlock the potential of digital technologies.  Digitized content allows us to take the next step:  Digitalization.  When the cost of storing and moving content drops to near zero, processes can evolve free from many of the previous constraints.  BPM and process management systems can leverage the digitized content and allow us to build more efficient and friendlier ways to interact with our customers and colleagues.  We can surface the right content to the right people at the right time, as they do their job.  Just as importantly, content that falls out of a process can be captured with context, managed as a record if needed.  Now instead of just having digitized content, we have digitalized processes that were not possible in the past.  We can see, via our process platform, exactly what is happening across our business processes.  Processes become searchable and can be analyzed in depth.  Processes can have their state automatically affected by signals in the business, and can signal others.  Our business state is now represented in a digital form, just like our content.

If digitization is about content, digitalization is about process.

Digital Transformation

The union of digitized content and digital processes is a powerful combination, and that helps create the conditions for digital transformation.  How?

In my opinion (and many others) the single most important thing to happen to technology in recent memory is the rise of the open standards web API.  APIs have been around on the web for decades.  In fact, my first company was founded to build out solutions on top of the public XML based API provided by DHL to generate shipments and track packages.  That was all the way back in the very early 2000s, a lifetime ago by tech standards.  Even then though, people were thinking about how to expose parts of their business as an API.

One of the watershed moments in the story of the API happened way back in the early 2000s as well, at Amazon.  By this point everybody has read about Amazon’s “Big Mandate”.  If you haven’t read about it yet, go read it now.  I’ll wait.  Ok, ready?  Great.  So now we know that the seeds for Amazon’s dominance in the cloud were planted over a decade and a half ago by the realization that everything needs to be a service, that the collection of services that you use to run your business can effectively become a platform, and that platform (made accessible) changes the rules.  By treating every area of their business as part of the platform, all the way down to the infrastructure that ran them, Amazon changed the rules for everybody.

How does this tie into the first two paragraphs?  What about content and process?  I’m glad you asked.  Content and process management systems take two critical parts of a business (the content it generates and the processes that it executes) and surface them as an API.  Want to find a signed copy of an insurance policy document?  API.  Want to see how many customers are in your sign up pipeline and at what stage?  API.  Want to see all of the compliance related records from that last round of lab testing?  API.  Want to release an order to ship?  API.  Want to add completed training to somebody’s personnel record?  API.  You get the idea.  By applying best practices around content and process management systems, you can quickly expose large chunks of your business as a service, and begin to think of your company as a platform.

This is transformative.  When your business is a platform, you can do things you couldn’t do before.  You can build out new and better user experiences much more quickly by creating thin UI layers on top of your services.  Your internal teams or vendors that are responsible for software that provides the services can innovate independently (within the bounds of the API contract) which immediately makes them more agile.  You can selectively expose those services to your customer or partners, enabling them to innovate on top of the services you provide them.  You can apply monitoring and analytics at the service layer to see how the services are being used, and by whom (This is one, in fact, I would argue that you MUST do if you plan to orient yourselves toward services in any meaningful way, but that’s a separate article) which adds a new dimension to BI.  This is the promise of digital transformation.

There’s certainly more to come on this in the near future as my team continues our own journey toward digital transformation.  We are already well down the path and will share our insights and challenges as we go.


Do’s and Don’ts of Alfresco Development

Recently I had the pleasure of going up to NYC to speak with Alfresco customers and prospects about architecting Alfresco solutions and how to align that with efforts around digital transformation.  Part of that talk was a slide that discussed a few “do’s and don’ts” of designing Alfresco extensions.  Somebody suggested that I write down my thoughts on that slide and share it with the community.  So…  Here you go!


Stick with documented public APIs and extension points.

In early versions of Alfresco it wasn’t exactly clear what parts of the API were public (intended for use in extensions) or private (intended for use by Alfresco engineering).  This was mostly fixed in the 4.x versions of the product.  This was mostly a problem for people building things out on the Alfresco Java API.  Today, the Java API is fully annotated so it’s clear what is and is not public.  The full list is also available in the documentation.  Of course Alfresco’s server side Javascript API is public, as is the REST API (both Alfresco core and CMIS).  Alfresco Activiti has similar API sets.

Leverage the Alfresco SDK to build your deployment artifacts

During my time in the field I saw some customers and Alfresco developers roll their own toolchain for building Alfresco extensions using Ant, Gradle, Maven or in some cases a series of shell scripts.  This isn’t generally recommended these days, as the Alfresco SDK covers almost all of these cases.  Take advantage of all of the work Alfresco has done on the SDK and use it to build your AMPs / JARs and create your Alfresco / Share WAR file.  The SDK has some powerful features, and can create a complete skeleton project in a few commands using Alfresco archetypes.  Starting Alfresco with the extension included is just as simple.

Use module JARs where possible, AMPs where not

For most of Alfresco’s history, the proper way to build an extension was to package it as an AMP (Alfresco Module Package).  AMPs (mostly) act like overlays for the Alfresco WAR, putting components in the right places so that Alfresco knows where to find them.  Module JARs were first added in Alfresco 4.2 and have undergone significant improvements since introduction and are now referred to as “Simple Modules” in the documentation.  Generally, if your extension does not require a third party JAR, using a Simple Module is a much cleaner, easier way to package and deploy an extension to Alfresco.

Unit test!

This should go without saying, but it’s worth saying anyway.  In the long run unit testing will save time.  The Alfresco SDK has pretty good support for unit tests, and projects built using the All-In-One Archetype have an example unit test class included.

Be aware of your tree, depth and degree matter

At its core, the Alfresco repository is a big node tree.  Files and folders are represented by nodes, and the relationships between these things are modeled as parent-child or peer relationships between nodes.  The depth of this tree can affect application design and performance, as can the degree, or number of child nodes.  In short, it’s not usually a best practice to have a single folder that contains a huge number of children.  Not only can this make navigation difficult if using the Alfresco API, but it can also create performance troubles if an API call tries to list all of the children in a gigantic folder.  In the new REST APIs the results are automatically paged which mitigates this problem.

Use the Alfresco ServiceRegistry instead of injecting individual services

When using the Alfresco Java API and adding new Spring beans there are two ways to inject Alfresco services such as NodeService, PermissionService, etc.  You can inject the individual services that you need, or you can inject the ServiceRegistry that allows access to all services.  On the one hand, injecting individual services makes it easy to see from the bean definition exactly what services are used without going to the code.  On the other hand, if you need another service you need to explicitly inject it.  My preference is to simply inject the registry.  A second reason to inject the registry is that you’ll always get the right version of the service.  Alfresco has two versions of each service.  The first is the “raw” service, which has a naming convention that starts with a lower case letter.  An example is nodeService.  These services aren’t recommended for extensions.  Instead, Alfresco provides an AoP proxy that follows a naming convention that starts with an upper case letter, such as NodeService.  This is the service that is recommended and is the one that will be returned by the service registry.


Modify core Alfresco classes, web scripts, or Spring beans

Alfresco is an open source product.  This means that you can, in theory, pull down the source, modify it as much as you like and build your own version to deploy.  Don’t do that.  That’s effectively a fork, and at that point you won’t be able to upgrade.  Stick with documented extension points and don’t modify any of the core classes, Freemarker templates, Javascript, etc.  If you find you cannot accomplish a task using the defined extension points, contact Alfresco support or open a discussion on the community forum and get an enhancement request in so it can be addressed.  Speaking of extension points, if you want an great overview of what those are and when to use them, the documentation has you covered.

Directly access the Alfresco / Activiti database from other applications

Alfresco and Activiti both have an underlying database.  In general, it is not a best practice to access this database directly.  Doing so can exhaust connection pools on the DB side or cause unexpected locking behavior.  This typically manifests as performance problems but other types of issues are possible, especially if you are modifying the data.  If you need to update things in Alfresco or Activiti, do so through the APIs.

Perform operations that will create extremely large transactions

When Alfresco is building its index, it pulls transaction from the API to determine what has changed, and what needs to be indexed.  This works well in almost all cases.  The one case where it may become an issue is if an extension or import process has created a single transaction which contains an enormous change set.  The Alfresco Bulk Filesystem Import Tool breaks down imports into chunks, and each chunk is treated as a transaction.  Older Alfresco import / export functionality such as ACP did not do this, so importing a very large ACP file may create a large transaction.  When Alfresco attempts to index this transaction, it can take a long time to complete and in some cases it can time out.  This is something of an edge case, but if you are writing an extension that may update a large number of nodes, take care to break those updates down into smaller parts.

Code yourself into a corner with extensions that can’t upgrade

Any time you are building an extension for Alfresco or Activiti, pay close attention to how it will be upgraded.  Alfresco makes certain commitments to API stability, what will and will not change between versions.  Be aware of these things and design with that in mind.  If you stick with the public APIs in all of their forms, use the Alfresco SDK and package / deploy your extensions in a supported manner, most of the potential upgrade hurdles will already be dealt with.

Muck around with the exploded WAR file.  

This one should go without saying, but it is a bad practice to modify an exploded Java WAR file.  Depending on which application server you use and how it is configured, the changes you make in the exploded WAR may be overwritten by a WAR redeployment at the most inopportune time.  Instead of modifying the exploded WAR, create a proper extension and deploy that.  The SDK makes this quick and easy, and you’ll save yourself a lot of pain down the road.

This list is by no means comprehensive, just a few things that I’ve jotted down over many years of developing Alfresco extensions and helping my peers and customers manage their Alfresco platform.  Do you have other best practices?  Share them in the comments!

Building an Open IoT Network in Birmingham. By the Users, for the Users

One of the big challenges in any IoT project is connectivity.  In a few proofs of concept and prototype projects I have worked on the choices have basically come down to either Wifi or 3G/4G connections.  Both are ubiquitous and have their place, but both also have significant drawbacks that hinder deployment.  Wifi usually requires access codes, has crap range, chews up battery and has FAR more bandwidth than most IoT projects really need.  3G/4G means a subscription or some kind of data plan and most carriers aren’t exactly easy to work with.  While platforms like Particle make this easier, it is still relatively expensive to send data and I’d like more choice in which embedded platform to use.  Are there any good alternatives?

Turns out there are, and one alternative in particular is appealing for the kind of open IoT projects that will drive us toward the future.  LoRaWAN is a Low Power Wide Area Network (LPWAN) specification governed by an open, non-profit organization that aims to drive adoption and guarantee interoperability.  With members such as Cisco, IBM and Semtech and an experienced board consisting of senior leaders from many of these same companies and others, the LoRa Alliance is well positioned to make this happen.  So that’s one possible standard but how does this enable an open IoT network?  How does it solve the problems laid out earlier and make some kinds of IoT projects easier (or possible at all)?

Enter The Things Network (TTN).  The mission of The Things Network is to create a crowdsourced global LoRaWAN network to foster innovation in much the same way as the early days of the Internet.  By deploying a free, open LPWAN, The Things Network hopes to enable innovators to build and deploy new IoT technologies that can change our communities.  That’s a mission I can get behind!  Check out their manifesto if you want to read about the full scope of their vision.

Our team seeks to built a Things Network community in the Birmingham, Alabama area.  We have already started reaching out to people across our metro in analytics, RF engineering, embedded systems, software development, entrepreneurship, community engagement / advocacy and government with the goal of building a consortium of local organizations to support a free and open IoT network.  Our vision is to build the open and transparent infrastructure required to support the future of smart cities.  Birmingham is a great place to do this.  The city center is relatively small so establishing full coverage should be achievable.  We have other smart cities initiatives in the works, including some things funded by an IBM Smarter Cities Challenge grant.  We have an active and growing technology community anchored by such institutions as the Innovation Depot, local groups like TechBirmingham and maker spaces like Red Mountain Makers.  We have active civic organizations with goals across the public sphere from economic development to air quality.  We have a can-do spirit and our eyes aimed firmly toward the future while being well aware of our past.

Assuming we can get a larger team assembled and this network launched, what do we plan to do with it?  A lot of that will come down to the people that join this effort and bring their own ideas to the table.  Initially the first few gateways will be launched in support of an air quality monitoring program using a series of low cost monitors deployed within the city.  Ideally this will expand quickly to other uses, even if those are just proofs of concept.  I, for one, plan to install a simple sensor system to tell me when the parking spaces in front of my condo building are available.  I hope others adopt this platform to explore their own awesome ideas and those ideas go on to inspire our city to become a leader in digital transformation.

I hope you will join us at the Birmingham Things Network Community and help us build the future one node at a time.

Adding a GPS to the Air Quality Monitor

I’ve gotten a little obsessed with this air quality monitoring project I’ve been working on, which tends to happen once I’ve been through the trough of sorrow and a project starts to look like it might actually work.  The last piece that is missing is the location of the readings.  It’s going to be hard to build up the kind of maps that this project calls for without it!  Once again, it’s Adafruit to the rescue with their awesome Ultimate GPS Breakout.  It was (and remains, as of the time this was written) out of stock on Adafruit’s website, but Robotshop and others have it on hand.  I’m growing very fond of Adafruit’s products, they all work as designed, have great docs, etc.

Before adding a new component into an existing project, especially something as complex as a GPS, it always pays to get it working in isolation.  In this case that means wiring up the GPS module to a spare Arduino Uno.  A tip for people just getting started and switching between boards:  make sure you change the port and board type to match what you are actually using!  When I switched from the Mega used to actually run the monitor to the Uno for testing I forgot this step and momentarily thought the Uno was dead.  That’s the kind of thing that happens during late night coding sessions when your brain isn’t quite working right.  After fixing this little issue the GPS powered up quite happily and quickly got a fix.  It only takes four wires to get the Ultimate GPS Breakout up and running.  Just add power, ground and a serial connection.  Simple, right?  The Adafruit product tutorial for Arduino covers all this, and how to use the provided sample sketches to prove it works.

The next step was to get this module integrated into the existing project.  A little easier said than done due to a lack of real estate on the perfboard.  It took some rework and creative wire routing, but it’s in there.  When the board is installed in its enclosure the antenna will not be correctly oriented, but it is good enough for testing outside the enclosure.  The final version will use an external antenna mounted on top of the enclosure anyway so that’s not a big problem.  After wiring it up and updating the monitor’s sketch to use the GPS and send data to our collection server, it could not get a fix.  Moving it outside, reorienting the antenna and changing the serial port had no effect.  After several hours the GPS would just sit there and blink at 1Hz (indicating no fix) and no location data was coming to the collection server.  What’s the problem?

A little digging around in the docs and on the internet indicates that this module is particularly sensitive to noise in its power supply.  Without an oscilloscope it’s impossible to say for sure how noisy the supply is.  A little filtering couldn’t hurt anything, so I added a few decoupling capacitors to the supply lines.  Did it work?  No.  No it didn’t.  After over an hour the GPS had still failed to obtain a fix.  Back to the drawing board.  I tested the module with the Mega directly and got it working nicely by switching from a software serial port to a hardware port on Serial3.  Great, except for the fact that the same sketch wouldn’t work in the monitor itself.  It only worked if I isolated the Arduino Mega and GPS module on a breadboard.  This result indicated that something else on the air quality monitor board was the cause.

I began to wonder if another component on the board was interfering with the GPS.  Due to space constraints, the GPS module was right next to a WINC1500 Wifi module and a 5v regulator that powers the heaters in the gas sensors.  Could one of these be the cause?  The easiest way to test this is to move the components further apart.  I used a few jumper wires to move the GPS off of the board (about 10 inches away) and the GPS acquired a fix almost immediately.  Progress!  Now it was just a matter of narrowing down exactly which component was the culprit.  Simply for convenience I removed the WINC1500 module first since it wasn’t screwed down.  Again, the GPS acquired a fix almost immediately.  Looks like that is the cause.  The GPS must be getting some kind of interference from the Wifi module.  Lesson learned, these components apparently need some distance if they are to be used in the same project.  This is likely due to the patch antenna built into the Ultimate GPS breakout being right next to the WINC1500 breakout.  Hopefully moving to the external antenna that will ultimately be needed anyway will fix the problem.  If not, remote mounting the GPS module will.

Alfresco and Solr – Search, Reindexing and Index Cluster Size

A question came up from a colleague recently, driven by a customer question.  When is it appropriate to increase the number of Alfresco Index Servers running Solr?  The right direction depends on several factors, and what exactly you are trying to achieve.  Like many questions related to Alfresco architecture, sizing and scalability the answers can be found in Alfresco’s excellent whitepapers on the subject (full disclosure and shameless self promotion: I wrote one of them).  Not only are there multiple reasons why you may need to scale up the search tier, there are a couple different ways to go about doing it.  Hopefully this article will help lend a little clarity to a sometimes confusing topic.

A place to start

A typical customer configuration starts with a number of index servers that roughly matches the number of repository cluster nodes.  The index servers sit behind a load balancer and provide search services to the repository tier.  Each index server maintains its own copy of the index, providing full failover.  It’s common to see a small to medium enterprise deployment running on a 2X2 configuration.  That is, two repository tier servers and two index servers, the minimum for high availability.  As the repository grows, user patterns change or the system is prepared for upgrade, this can prove insufficient from a search and indexing point of view.

Large repositories

When Alfresco first ran our 1B document benchmark we set a target of about 50M document indexed per index server.  So, for our 1B document environment, we had 20 index shards each containing ~50M docs.  Our testing shows that the system gives solid, predictable performance at this level.  For large repositories, this is a good starting point for planning purposes.  If you have a lighter indexing requirement per document (for example, a small metadata set or no full text indexing) it is likely possible to go higher.  Conversely, if your requirements call for full text indexing of large documents and you have a large, complex metadata set, a smaller number of documents per shard is more appropriate.  Either way, as a repository grows larger at some point you will need to consider adding additional index servers.  As with all things related to scale, take vendor recommendations as a starting point, then test and monitor.

Heavy search

Some ECM use cases lean heavily on search services.  For these cases it makes sense to deploy additional index servers to handle the load.  Spreading search requests across a larger number of servers does not improve the single transaction performance, but does allow more concurrent searches to complete quickly.  If your use case relies heavily on search, then you may need to consider adding additional index servers to satisfy those requests.  For this specific case, both sharding and replication can be appropriate.  Both sharding and replication allow you to spread your search load across multiple systems.  So how do you choose?  In most cases sharding is the better option.  It is more flexible and has additional benefits as we will outline in the next section.

If your repository is relatively small (less than 50M documents or so) and you are primarily concerned with search performance, replication can be a good option.  Replication sets up your index servers so that only one is actually tracking the repository and building the index.  This master node then replicates its index out to one or more slaves that are used to service search requests.  The advantage of this configuration is that DB pressure is reduced by only having one index server tracking the repository, and you now have multiple servers with a copy of that index to service search requests.  The downside is that it has a relatively low upper limit of scalability, and introduces a single point of failure for index tracking.  Not such a huge problem though, if the tracking server stops working you an always spin up another and re-seed it with a copy of the index from the slaves.  A replicated scenario may also increase the index lag time (time between adding a document and it appearing in the index) slightly since it must first be written to the master index and then replicated out to the slaves.  Real world testing shows that this delay is minimal, but it is present.

Reindexing and upgrades

There is another case where you may want to consider adding additional index servers, and that is when reindexing the repository or upgrading to a new version of Alfresco.  Alfresco has supported multiple versions of Solr over the years.  Alfresco 4.x used Solr 1.4, 5.0/5.1 use Solr 4, and the upcoming 5.2 release can use Solr 6.  Newer versions of Solr bring great new features and performance improvements, so customers are always eager to upgrade!  There is one thing to look out for though:  reindexing times.  Switching from one version of Solr to another does require that the repository be reindexed.  For very large repositories this can take some time.  This is where sharding is especially helpful.  By breaking the index into pieces (shards) we can parallelize the reindexing process and allow it to complete more quickly.  The less documents an individual shard reindexes, the faster it will finish (within reason, 10 doc per shard or something would be ridiculous).  So if you are considering an Alfresco upgrade and are worried about reindexing times, consider additional index servers to speed things along.  Note that most Alfresco upgrades do not require you to switch versions of Solr immediately.  You can continue to run your server on the old index while the new index builds, but during this time you cannot take advantage of Alfresco features that depend on the new index.


This list is by no means comprehensive, but it does outline the three most common reasons I have seen customers add additional index servers.  Have you seen others?  Comment below, I’d love to hear about it!

Air Quality Monitoring, Phase II: What To Do With The Data?

In a few recent blog posts I’ve laid out an air quality monitoring project that has materialized in my spare time.  If you want to start from the beginning the first and second posts about the project lay out the inspiration and goals, and the hardware that takes the measurements.  A few people have asked “OK, so you have this sensor package capturing decent quality data, now what are you going to do with it?”.  Good question, and it has several answers.

Personal questions

The genesis of this project was the desire to test some assumptions that we have had for a long time about air quality in Birmingham, where the pollution comes from and how that has affected where people choose to live.  Hopefully the data captured will be sufficient to do relative comparisons of different parts of the Birmingham metro area.  If so, it should be possible to test these assumptions and determine if the pollution in the Jones Valley is actually worse than it is over the mountain, if elevation up in the mountain area itself has an effect, and how far out from the city you have to go before the effects of industry aren’t as apparent.  I’d like to know before we move!

Community awareness

One of the most obvious uses for a system like this and the data it generates is community awareness.  How bad is the problem?  What can we do to fix it?  What communities are the most impacted?  Do the pollution measurements correlate with other data such as demographics or proximity to specific types of industry?  There is a lot of geo data out there that shows income, education, economic activity, etc.  It will be interesting to overlay the air quality measurements with this other data to see if there are any correlations.  Information is empowering.

Incentivizing change

Citizen science projects are great for engaging with the community, and can help drive change.  One particularly inspiring example is a project in the Netherlands that is providing free Wifi when air quality goals are met.  In phase II, this project will adopt a similar direction with a few tweaks.  The current plan is to tie the sensor array to a system that will provide free Wifi for people in range when the air quality is good, or at least use a captive portal system to show people the current readings as a part of a free Wifi system that is running at all times.  That same captive portal page will also contain links to make strategic donations or renewable energy credit purchases so users can take direct action.


In recent years there has been a huge emphasis on early STEM education.  Whether or not we have a shortage of skilled technical workers is up for debate, but regardless of the truth of it a lot of attention is being paid these days to giving kids the foundation skills.   This little project could be a great introductory program for basic electronics and air chemistry.  Since it is all being developed in the open, using open software and open hardware, there are few barriers to using a project like this in an educational setting.  It could be especially fun to use this sort of project in a cross disciplinary educational approach.  For example, running the ozone sensor over a long period of time alongside plants that are known to be sensitive to ozone could be an interesting way to tie together electronics / software and life sciences in a way that is easy for students to understand.


While these uses for the project and its data are interesting, what will be the most fun is watching what else shakes out as we move ahead.  Discussing interesting ideas in public is probably the #1 reason why I rebooted my blog, and I can’t wait to hear what else people come up with!

Building the DIY Air Quality Monitor – BOM and Pictures

A few days back I posted a bit about a DIY air quality monitoring project I’ve been working on.  That post just outlined why the project started, what we hoped to achieve, the high level design, component selection and software stack.  In the last few weeks the project has moved past the breadboard stage into a real physical prototype.  It’s a little ugly, built on generic perfboard and full of design compromises, but it works like a charm.  Now that it’s working, it’s time to share what went into the build, the bill of materials, and a few pics!

Top of board

On the top of the board I have mounted the Wifi module, the MQ series sensor array, a 6 circuit Molex connector for the particle sensor, a 5V regulator and the Arduino.  The Arduino mates to the perfboard using a bunch of 0.1″ male pin headers.  You can think of the perfboard as just a ridiculously, comically oversized Arduino shield.  The WINC1500 mounts the same way.  The MQ sensor breakouts use a right angle female pin connector.  Why so many connectors?  I like to reuse stuff.  With the way this is put together it is easy to pop modules on / off of the board to be reused in other projects or replaced if they stop working.  One thing to note here is that the Arduino uses wacky spacing for one of its sets of headers.  The spacing between pins 7 and 8 is not the standard 0.1″.  Why?  Who knows, seems silly to me.  I didn’t need one of those sets of headers, so I just left the problematic one with the weird spacing off of the board.

A word on power in this design.  The MQ series sensors all have internal heaters that are required to keep them at the right operating temperature.  These heaters need regulated 5V.  They consume more power than the Arduino has available from its 5V output.  So, 7.5V is fed into the regulator, which provides regulated 5V output for the MQ sensors and the DHT-22.  The same 7.5V is fed into the Arduino’s vin pin, which powers the Arduino itself.  The other low power items (WINC1500, Sharp particle sensor) are driven off of the Arduino’s regulated 5v output.  While it is possible to run the 7805 regulator without a heat sink for low current loads, my total load was high enough that it needed one.


Bottom of board

On the back of the board we find the Sharp GP2Y1010AU0F sensor, the DHT-22 temperature and humidity sensor, and a bunch of ugly globs of solder.  The one thing I don’t like about the Sharp sensor is the lack of real mounting holes.  It does have some little rails, so I used some short plastic standoffs and nuts to sandwich that rail and provide a secure mount.  The Sharp sensor’s data sheet prescribes a specific orientation for mounting.  Once this board is slid into its housing and the housing is stood up, the sensor will be oriented correctly.  The DHT-22 sensor is also mounted on this side of the board.  Why is that when it looks like there is plenty of room to mount it next to the MQ sensor array?  Recall that the MQ sensors have heaters in them.  The first iteration of this board had the DHT-22 right next to the MQ sensors on the other side.  When the MQ sensors came up to temp, the DHT-22 was consistently reading 10-15 degrees higher than it should have been.  Moving the sensor to the other side of the board seems to have corrected that.


Enclosure bottom

This type of enclosure is a pretty standard thing for sensors that live outside.  A louvered radiation shield keeps the rain and sun off of the bits and pieces.  Strong driving rain would probably still find its way in and I don’t want all these parts getting wet, so it will be mounted up under a covered area where it will stay nice and dry.  This particular shield enclosure was designed for an Ambient Weather temperature sensor but it works great for this project.  Luckily at its widest it was just a few mm narrower than the perfboard I used for the project.  Cutting a few little notches in the sides of the enclosure allows the board to easily slide in and out like it was designed to be there.  The oval shape of the enclosure cavity made part layout a little tricky.  The taller parts like the MQ sensors and dust sensor needed to be kept toward the middle so they would have enough clearance.


Bottom plate installed

The bottom plate has a standard power jack that is used to supply the system with power, and three status LEDs that show the state of the system.  Red means that there has been a fault detected and the system has halted.  When that occurs it will restart in 8 seconds or so when the watchdog timer kicks in.  The yellow light signals that the system is starting up, connecting to the network and taking test readings.  Green means that everything has started up and the monitor is successfully connected to the network.  It’s a bit crude, but does a good enough job to indicate the system status without having to hook up a USB cable and look at the serial monitor output.


All buttoned up!

The perfboard slides right in, and then the bottom two louvers are attached to the threaded posts with wing nuts.  Looks nice and clean once it’s all put together and you can’t even see that ugly board.


The posts on top of the enclosure are used to attach it to an L bracket included with the enclosure.  This bracket also comes with some U bolts that make it easy to mount the whole assembly to a pole.


Here’s what was used to build the monitor.  If you have been hacking around on electronics for a while you probably already have some of this stuff just laying around.  If you haven’t and you don’t have a good stock of bits and pieces, now’s a good time to order extras of stuff you know you’ll use a lot.  Passives, wire, that ubiquitous 0.1″ male pin header strip and of course you can never have too many LEDs!

I’ve put the wire in the BOM as well, even though it’s not strictly necessary.  I like the pretinned 24AWG stuff for laying out power and ground busses on one side of the board because it’s easy to solder it down to the pads on the board as you are routing it around. 22AWG stranded wire is good for connecting the board to stuff mounted on the enclosure (like the power jack) where some flexibility is needed.  The 30AWG insulated wire wrap wire is good for signal connections.  The 30AWG wire wrap wire is fragile though, so after I have it all working I like to tack it down with a dab of hot glue.  If you have to do some rework, the hot glue peels off easily enough.



  • 1x 10μF 25v electrolytic capacitor
  • 1x 1μF 25v electrolytic capacitor
  • 1x 220μF 25v electrolytic capacitor
  • 1x 150 Ω resistor
  • 1x 10k Ω resistor
  • 1x  each green, yellow and red LED




Moving from the breadboard to a real prototype was pretty straightforward.  With a good schematic and lots of pictures of the working breadboarded design translating it to the perfboard was mostly a matter of laying out the components in such a way that they fit neatly in the enclosure.  The schematic and Fritzing diagram will be added to the Github project shortly, they just need a little cleaning up.  If you are going to try a project like this, make sure you have your enclosure and bare board in hand at the same time or you may find that things don’t fit like you expected them to.  I was expecting the enclosure I ordered to have a larger cavity, for example.  Now that the monitor can be moved around without wires popping out everywhere we can move on to some real world testing, data collection and analysis.