A Little SolrCloud History — Making Solr Distributed First

The first time we started talking about SolrCloud that I can remember is in 2009, at ApacheCon Oakland. At the time, the Apache Solr search engine had some distributed support, but it was mostly predicated on it DIY. Solr provided the building blocks, but it was up to the user to provide any automatic failover or fault tolerance and to distribute updates across servers. At the time, a huge percentage of search engine use cases involved one to a handful of servers. This was starting to change though and we needed to plan out a truly distributed path for Solr.

So I sat down with Yonik Seeley, the creator of Solr, and Brian Pinkerton, now the CTO at the Chan Zuckerberg Initiative, to talk about making a first class distributed mode for Solr. For most of this kick off meeting, Brian and Yonik shot back and forth high level ideas on overall architecture based on their extensive experience. Mostly I listened, commenting when some Lucene / Solr knowledge I had might be relevant, but otherwise thinking, “I don’t know until I start trying to practically build it.” That meeting started the ball rolling.

Over months, Yonik and I took a look at some other distributed systems, particularly Amazon Dynamo, and discussed what Yonik coined as ‘SolrCloud’ might be. Over time, Yonik settled in on an architecture that he thought fit the current direction and design of Lucene and Solr. We did not model it after Dynamo. In the end we favored a design that made some NoSql features much easier to implement and that meshed well with some of the practical challenges a distributed search engine has over a distributed database. If you know your CAP theorem, we chose a CP design.

Initially, not many of us were working on SolrCloud code. First just Yonik and I, with some others joining and leaving here and there. We also had plenty of other things to work on. That led to initially splitting up SolrCloud development into two phases — the easy part and the hard part or the read side and the write side.

So we tackled the read side first. Yonik and I split work much as we ended up doing on the update side. The majority of the time, Yonik took critical low level pieces and I did most of the surrounding work. We spent a lot of time talking out all of the various paths and possibilities and that guided how I put that surrounding work together. This is also when we chose to rely on Apache ZooKeeper for many of the harder distributed problems we faced. That has proven to be a very solid decision.

In 2010, I talked about this read side work at the very first Lucene Eurocon (Lucene-Solr revolution with a different name). It was still up to the user to distribute updates, but read side failover and fault tolerance was handled automatically by Solr. This version had a limited number of users, you had to be pretty sophisticated, but it had users, and some of them scaled quite happily with this solution for a long time.

After some time focusing on other things, we came back for the write side. This was much more of a challenge, much harder to get right, and much more difficult to harden and scale. Building a really large scale distributed system with a lot of large and varying users and environments is very, very, very difficult. It is easy to have a very smart and talented team with an incredible design and fail at actually getting your system into the hands of users and truly hardened against the scary, real world. We did have an advantage though. Apache Solr already had a very large user base. We just had to get them to come over to the new world.

Because Solr had such a large user base, when we added this new set of distributed features, we only enabled them if you ran in what we called SolrCloud mode. Existing users would still get the same Solr from previous releases, failover and fault tolerance on the user. This meant that even after the first versions of the full SolrCloud solution were released, I would regularly search the user list for SolrCloud related discussion and it was very rare compared to standard Solr. There were also only a few of us that could answer any questions. We had put something out there, but there was really no guarantee the users would flip the switch.

We chose to release early so that real world use was able to influence our progress early. We liked the idea of release early, release often. In open source, this is even more important because a lot of time users also become developers and we had a very small number of developers working on SolrCloud intermittently. The Lucene-Solr project had its first alpha releases for the 4.0 release, and we had an early system in place just in time to take advantage of this.

It’s one thing to build a distributed system though, it’s another to get users, keep users, and allow for successful solutions that actually drive value. Even though Apache Solr had been proven, we had not proven SolrCloud and the path to a truly hardened distributed search system is long and hard and requires a lot of resources.

The early days were a bit lonely in SolrCloud world. There were a few hardcore users, some of them successful, and many of them helped contribute ideas, bug reports, fixes, and discussion in the early days. As the project releases rolled on, the user count slowly rolled up, and our initial design was filled in where we had chosen to straw man. We had started the ball rolling, while still small and slowly. We plowed in bug fixes and improvements.

At some point, the use case demands of users and the quality of the SolrCloud code got to the point where the number of users started to grow much more rapidly. It was no longer uncommon to see SolrCloud related discussion on the user list. Some very large companies started investing in SolrCloud and pushing it’s limits, while filing bug reports and contributing back fixes and ideas. This is when the rock starting rolling faster and growing. The virtuous circle phase that is so hard to get to. When an Apache project starts driving value at some of the largest companies in the world, a lot of value is returned. That leads to even more adoption.

At this point in time, most discussion about Solr on the mailing lists is related to using SolrCloud. SolrCloud is now the first class mode of Solr. Some of the largest companies in the world have extensive SolrCloud clusters and are regularly hiring SolrCloud developers. I’ve had to abandon answering Solr related LinkedIn messages because I get too many. In the early days of few users, and little discussion about SolrCloud, I never thought this stage was a guarantee. Even with the already successful Apache Solr project, a fully distributed mode with no users was not a slam dunk. It’s humbling that it worked out though. It’s also amazing to have witnessed everything coming together. While only a couple of us put up the original base SolrCloud code, it took an army of volunteers and paid programmers and tons of users years to construct this ever hardening and more heavily scaled system that you see today. A few of us never could have built this, even with the existing Solr codebase to build on. The brilliance of Open Source is that it only took a few of us to launch it though.