A question came up from a colleague recently, driven by a customer question. When is it appropriate to increase the number of Alfresco Index Servers running Solr? The right direction depends on several factors, and what exactly you are trying to achieve. Like many questions related to Alfresco architecture, sizing and scalability the answers can be found in Alfresco’s excellent whitepapers on the subject (full disclosure and shameless self promotion: I wrote one of them). Not only are there multiple reasons why you may need to scale up the search tier, there are a couple different ways to go about doing it. Hopefully this article will help lend a little clarity to a sometimes confusing topic.
A place to start
A typical customer configuration starts with a number of index servers that roughly matches the number of repository cluster nodes. The index servers sit behind a load balancer and provide search services to the repository tier. Each index server maintains its own copy of the index, providing full failover. It’s common to see a small to medium enterprise deployment running on a 2X2 configuration. That is, two repository tier servers and two index servers, the minimum for high availability. As the repository grows, user patterns change or the system is prepared for upgrade, this can prove insufficient from a search and indexing point of view.
When Alfresco first ran our 1B document benchmark we set a target of about 50M document indexed per index server. So, for our 1B document environment, we had 20 index shards each containing ~50M docs. Our testing shows that the system gives solid, predictable performance at this level. For large repositories, this is a good starting point for planning purposes. If you have a lighter indexing requirement per document (for example, a small metadata set or no full text indexing) it is likely possible to go higher. Conversely, if your requirements call for full text indexing of large documents and you have a large, complex metadata set, a smaller number of documents per shard is more appropriate. Either way, as a repository grows larger at some point you will need to consider adding additional index servers. As with all things related to scale, take vendor recommendations as a starting point, then test and monitor.
Some ECM use cases lean heavily on search services. For these cases it makes sense to deploy additional index servers to handle the load. Spreading search requests across a larger number of servers does not improve the single transaction performance, but does allow more concurrent searches to complete quickly. If your use case relies heavily on search, then you may need to consider adding additional index servers to satisfy those requests. For this specific case, both sharding and replication can be appropriate. Both sharding and replication allow you to spread your search load across multiple systems. So how do you choose? In most cases sharding is the better option. It is more flexible and has additional benefits as we will outline in the next section.
If your repository is relatively small (less than 50M documents or so) and you are primarily concerned with search performance, replication can be a good option. Replication sets up your index servers so that only one is actually tracking the repository and building the index. This master node then replicates its index out to one or more slaves that are used to service search requests. The advantage of this configuration is that DB pressure is reduced by only having one index server tracking the repository, and you now have multiple servers with a copy of that index to service search requests. The downside is that it has a relatively low upper limit of scalability, and introduces a single point of failure for index tracking. Not such a huge problem though, if the tracking server stops working you an always spin up another and re-seed it with a copy of the index from the slaves. A replicated scenario may also increase the index lag time (time between adding a document and it appearing in the index) slightly since it must first be written to the master index and then replicated out to the slaves. Real world testing shows that this delay is minimal, but it is present.
Reindexing and upgrades
There is another case where you may want to consider adding additional index servers, and that is when reindexing the repository or upgrading to a new version of Alfresco. Alfresco has supported multiple versions of Solr over the years. Alfresco 4.x used Solr 1.4, 5.0/5.1 use Solr 4, and the upcoming 5.2 release can use Solr 6. Newer versions of Solr bring great new features and performance improvements, so customers are always eager to upgrade! There is one thing to look out for though: reindexing times. Switching from one version of Solr to another does require that the repository be reindexed. For very large repositories this can take some time. This is where sharding is especially helpful. By breaking the index into pieces (shards) we can parallelize the reindexing process and allow it to complete more quickly. The less documents an individual shard reindexes, the faster it will finish (within reason, 10 doc per shard or something would be ridiculous). So if you are considering an Alfresco upgrade and are worried about reindexing times, consider additional index servers to speed things along. Note that most Alfresco upgrades do not require you to switch versions of Solr immediately. You can continue to run your server on the old index while the new index builds, but during this time you cannot take advantage of Alfresco features that depend on the new index.
This list is by no means comprehensive, but it does outline the three most common reasons I have seen customers add additional index servers. Have you seen others? Comment below, I’d love to hear about it!