From ElasticSearch back to SQL Server
Sharing with the community why we decided to move back from ElasticSearch to SQL Server
Sometimes it happens: you move to something that is considered cool and bleeding edge back to something that has dust over it.
Well, at least this is how SQL Server has been perceived lately by the majority of developers and software architects. Microsoft must have heard this sort-of disappointment against his RDBMS and decided to recover all the lost time, and try to make it cool again.
If you still think that SQL Server (and his twin, SQL Azure) is still the “old” RDBMS with all its “rigid” limitations, my advice, as Software and Data Architect, is to go back and check it again.
This is what we did in Sensoria in the last months, and now have moved everything from ElasticSearch back to SQL Server.
I think that what we discovered and the experience we did is worth sharing: I believe that a lot of companies will find themselves (or are already) in the same situation. Maybe it won’t be ElasticSearch but another product, but the story will be the same. Here it goes.
The Story So Far
Something like five years ago, if you were looking for a highly scalable database solution that could also provide a lot of flexibility in terms of object stored and could also offer a good analytical solution, one of the option was ElasticSearch. It can ingest and index JSON documents, using a schema-on-read approach (instead of a schema-on-write, typical of RDMBS), it allows for really easy scale-out, it offers a very nice monitoring tool (based on Kibana), it uses REST as communication protocol so it is compatible with everything (which make it perfect for the web), and, last but not least, it is based on the well-known Lucene technology that is solid and mature.
So ElasticSearch it was. It did great for several year, millions and millions of documents has been stored into it, both coming from end users using our products or from internal machine learning algorithm that crunches numbers and tries to turn all the raw data into something more useful for the athlete, making sense of the data we gather from garment sensors.
Requirements and possible options
We I joined Sensoria, we were in the middle of an architectural revision. Of course one of the system being reviewed was the database. We had to decide if to stay with ElasticSearch or move to something different. I was in charge for this decision, given my strong data architect background, and I could have chosen any database. We could have stayed on ElasticSearch or switched.
ElasticSearch has lately clearly stated that its focus is on search and analytics and that ruled it out immediately, since we need a database where to store all our data in first place.
We started to evaluate the options. One of our main objective is to go 100% PaaS so, using Azure, one of the first options was DocumentDB (now CosmosDB).
Beside storing and querying JSON we also looked for something that would allow us to quickly aggregate data using any possible combination of filters and aggregations logic. For example someone may want to see the heart rate aggregated on weekly basis for the last two years, while someone else may want to get the pace aggregated by day in the last month. This must be done in real-time, while, at the same time, new sessions are loaded and processed.
We need to ingest data coming from our IoT devices, Apps and SDKs, but we also need to store “classic” data like user account, preferences, device info and so on.
Of course there is no a one-size-fit-all solution. Or, better, there is no such solution at a decent costs.
We’re also implementing a Lambda architecture and the final data store that is used the serve data to the end user it only contain “squeezed” data, where the schema is actually well-known, otherwise developers wouldn’t know how to interact with stored data in order to efficiently generates statistics and charts. In the other stages data is stored in other more suitable places. For example we’ll use a streaming query engine to query data flowing into the system in real-time, and we’ll be using a data lake to store IoT raw data.
Such architecture allows us to focus only, at least for now, on the master database (I like to call it a online transactional data warehouse, since what we want to do is basically real time analytics), so the search scope is now narrowed down and well defined.
Truth is that almost all enterprise databases we evaluated, relational or not, provides, albeit with different capability levels, the features we were looking for our master database.
So, from the decision was not only technical one, but more a strategic one. What is the database that, in our context, will give us the highest natively supported features set, the lowest amount of maintenance and the simplest platform so that our developers could use it efficiently? All at competitive costs, of course.
As obvious from the title of this article, the final decision was to use Azure SQL. You may think my choice are biased since I’m a Microsoft Data Platform MVP, and my background is almost SQL Server. “If all you know is a hammer, everything will look like a nail”, right? Sure, but this could have been applied to SQL Server several years ago, the engine has improved a lot in the last version, both in term of raw performances and features offered. Here’s the feature that motivated my choice:
- Columnar storage support: perfect for real-time aggregations
- Columnar and Row storage at the same time: optimum to perform aggregate and singleton queries
- In-Memory (Lock-Free structures): great to support high-concurrency workloads
- Partitions: perfect for distributing workload
- Temporal Tables: great to simplify development
- JSON support with Non-Clustered & Generalized INverted Indexes: perfect to support migration from ElasticSearch and also to make life easier to developers
- Row-Level Security: optimum for securing data at the source
And these are the feature that are available now. In the near future SQL Azure will also provide native support for Graph Queries, all within the same engine.
So technically it’s great but that alone was not sufficient to justify it. Costs played a huge role in the final decision. Initial tests showed that we could handle the load managed by several ElasticSearch servers with just a entry level Azure SQL Premium database, thanks the columnar indexes. The costs showed a reduction ration of 3:1, which was immediately loved by the business guys.
Integration is a key factor
For me, having all the aforementioned features in just one product is really a big thing. Of course SQL Server / Azure SQL may not excel in all of them, so you don’t have the best of the best, but, except for rare edge cases, I rather prefer to have just one tool that provides you with hundred of feature at near perfect level that hundred of tools each one just doing just one thing but perfectly.
The fact that everything is integrated means that we don’t have to move data around, manually figuring out how to integrate different data stores while keeping logical consistency and integrity of data.
Simplicity is a big plus. By making our solution simpler, much simpler, means that we have less development and maintenance costs.
Having to learn only one query language is also a big plus. One of the downside of using different databases, is that NoSQL database have the tendency to have their own query language, which make everything a little more challenging: developers need to master n languages, which make them way less productive, and bugs and performance problems are just more frequent.
Mastering well a database is really important because it avoid developers to go on a caching spree. Caching is a fantastic technology if used correctly (we use the well know Redis database for that). If used just to hide performance problems, it will quickly turn your solution into something worse that the worst spaghetti-code you have ever seen. With the addition of concurrency management and stale data issues to the problem. Hell is a nice place in comparison.
So far, for us, the feature set, the integration of everything, the SQL language and the existence of very good tools (this is another really important point to keep in mind: good tools allows productivity to increase a lot) was pointing us to Azure SQL. But what about scalability? It will scale enough for us?
Scalability is in the fabric
One of the argument that was true when SQL Server was only offering an on-premises options is the scalability factor. What happens if we need to scale out to handle a lot more data and concurrent access we have right now?
Beside the fact the our architecture uses a queue in order to handle spikes of workload, especially for data ingestion, Azure offers the solution. Azure SQL allows the database to scale up to 4 TB of data and if this should be not enough, Elastic Pools will help even more. We comfortably fit in that space for now (we’re not Twitter or Facebook just yet…), so we’re fine within that limit (even because all the raw data stays in a data lake, and there is were things get huge, usually).
Being a PaaS solution also means that we don’t have to manage anymore clusters, nodes, updates, reboots and all the well known things that you normally have to deal with when you have to maintain your servers.
Python and R are also in the mix
A lot of the machine learning and advanced analytics code is based on R and Python. With the latest release of SQL Server, beside native support to R, also Python has been added as a language that can be used in-database to crunch data. This means that we can reuse all our assets based on scikit-learn. While this is not yet available on Azure SQL, I expect it to come soon, as also mentioned at //build 2017 event.
One key factor that allowed use to use SQL Azure is the native support to JSON that it now finally has. While is still a v1 feature, it is enough, for us, to implement all the desired features we needed to replace ElasticSearch. I’ll write a more technical post on the subject in future, so stay tuned if you’re interested in the implementation details but, as you can imagine, the trick was to fool the code to make sure it felt like being still talking with ElasticSearch, so overall changes to our code have been minimal.
ElasticSearch does offer a lot of nice feature, but, on Azure, is not offered as PaaS. Since it’s focused on analytics and search mostly, if we would had chosen it, would would have still needed a database to store OLTP data. CosmosDB is PaaS and also offers a lot of flexibility and also support SQL but….it’s another database that would have required us to create an additional integration solution to keep the existing Azure SQL and the newly created CosmosDB data in sync. We also have a quite well defined schema, so Azure SQL was the perfect choice for us in terms of ratio between costs, performances and features. It provides a very good range of features that ranges from OLTP to (real time) OLAP which is great for us, until we reach to 4 TB limit. Just like my beloved Python language, I like to think to Azure SQL as a solution “batteries included”.
What I’d like to underline here is that we’re now living in an era where, technology wise, technology is usually not the limiting factor. Humans and our limited knowledge is. So, what is really important now, is the balance between costs, features and simplicity of the overall solution. Technology should help us create solution that are as simple (both for developers and users) as possible. But no simpler.
And this clarifies the role of Solution/Data/Software Architectes especially well: find the best technical solution with the best benefits/costs ratio, now and in the near future. And make the solution modular so that in the distant future we can only change what needs to be changed in order to keep up with the new challenges.
Luckily we live in an era where the cloud is making it possible. We just have to evaluate and re-evaluate technologies without any preconceptions.