IoT considerations — storage and database, SQL, NoSQL, historical data
So, you’ve prototyped some hardware, designed a topology, got a rough data model, and laid out your backend infrastructure. But how are you going to store the vast amounts of data your IoT product generates?
How do you keep the current data? How do you deal with the large volumes of previous data? How do you query it efficiently? And if you’re storing files, where do you put them?
You need to start thinking about databases carefully early on. Let’s take a look at some of the different options.
Roll back ten or fifteen years and your database options were typically a little more straightforward. SQL was the de-facto standard, and generally speaking, databases had tables, tables had columns, and there were relationships between them. This was easy to develop in, although scaling and replicating data was tougher.
Microsoft SQL Server, Oracle, PostgreSQL or MySQL were the go to solutions of choice. SQL queries, joins and wheres made it quite straightforward to pull out or update any subset of data you might want.
And SQL databases are still a good option for many applications. But — database technology has changed a little, and there are alternatives which may fit your project better. Technologies which have different approaches, are easier to query, are less rigid, or are better suited to different types of data.
It would be remiss not to at least explore them!
“NoSQL” databases have existed for decades, but the moniker is relatively recent. “NoSQL” can variously be described as “Non SQL”, “Non relational” or “Not Only SQL”. They have surged in popularity with the Web 2.0 growth of modern technology companies liks Google, Facebook, Amazon.
Many of these big tech companies had needs in the last decade which were not well suited to existing relational databases, and their use of other types of database spurred the growth of NoSQL popularity.
NoSQL databases take several forms and have different advantages and disadvantages. Let’s take a look at some of them.
Key/value stores work very differently to relational databases — instead think of them more like a dictionary. A dictionary has entries for each word, and a piece of text (the explanation) associated with it. There are no duplicate entries, and you’re only storing a small amount of text for each entry.
Well key/value stores do just this — in them, you store an entry (E.g. store myName = Des). When you query back the myName entry, you get Des. Some allow you to store just strings (which you can get around by hashing or encoding JSON or XML before storage) while others let you store objects (but these are more like object DBs, see more below.)
Key/value stores are quick to query, but don’t support complex queries. It’s best to think of them as being useful as a fast place to store small snippets of data and get it back quickly.
In fact, some of the most popular key/value stores (e.g. Redis, MemCached) are created primarily for caching purposes. Others such as Amazon DynamoDB or Oracle NoSQL are designed more for persisting data.
Object or Document DBs
An object or document DB is a slightly more advanced form of key/value store, but with the ability to store complex data (objects). Popular document DBs include MongoDB, CouchDB and Couchbase.
These types of DB typically let you store JSON style data against a unique ID. For modern web applications, this is a big gain as it’s possible to pretty much push your in memory objects into a db and then pull them out again later.
They typically support nested JSON which is really useful, and some support complex relationships where other documents can be referenced or joined together when querying out.
There seems to be a lot of reluctance to use document DBs from some quarters on the web. It’s not clear where the foundation of this comes from — some point to years old (and now squished) bugs in MongoDB causing data loss, while others are simply too entrenched in the SQL type DB to see the benefits. Others complain about lack of structure (documents are typically schemaless) but this can be worked around by using good schema validation in backend/frontend code.
We’ve made a lot of use of document DBs in Lattice, and they are a great fit for representing complex real world objects in a JSON-like way. Others such as Bosch SI and the Eclipse Vorto project take a similar approach to representing the real world — so if you have the use case Document DBs can be an absolutely excellent option.
Graph databases take an alternative approach. There are distinct nodes in the database (like entities), and then you model the “edges” or relationships between them. Think of how social type data or relationships between things might be modeled and you’ll be along the right lines. Des like Cheese, Cheese is Yellow, Bananas are Yellow, Des eats Bananas, etc
In this case the nature of the relationships is where the true value lies. By analysing the relationships in a graph DB, you can begin to see bigger patterns more easily than would be possible in a standard “relational” database. For example, people who like cheese and eat bananas often buy a knife to cut them, so we should see who likes these and offer to sell them a knife.
For an IoT application — if you’re just connecting a simple sensor to the cloud and logging the data, a graph database is not well suited. But if you’re trying to analyse the relationships between many different sensors, the environment they are in, how people use them etc, then a graph DB is worth a look.
Popular Graph DBs include Neo4j, Titan and OrientDB.
One of the issues of a traditional relational DB is that while it works great for small workloads, it is not as well suited to data warehousing — that is, storing bulk amounts of data for analysis. This is because it quickly becomes computationally expensive and slow to query.
An alternative, which has been popularised by Google’s BigTable, Amazon RedShift and Postgre Columnar, is to store the data in columns instead of rows. This means adding a new piece of data is done by inserting a new column instead of a new row.
How is this different? Well, it’s possible to grab a row and immediately have all data of the same type in a format which is CSV like. Eg. the query for the sales value of all users would return something like 10,20,100,1000,30, etc — and this data is easy to pipe into something else for analysis. If you did this by querying data from a relational database, you’d have to transpose the data into an array or stream of CSV style data before working on it.
So — Columnar stores have some big speed advantages for analysis. They also often compress better as rows are composed of the same type of data.
There are many different types of NoSQL database available. They all have different strengths and weaknesses and all have different use cases. It’s worth looking at them all and see how they fit in your project.
Another consideration when trying to decide on data storage is the long term effects of storing the large volumes of data IoT can generate.
We could simply keep it all forever? True, but the more data you have the harder it will be to query in a fast way. And more data = more storage costs.
How about if we trim out the unnecessary things and keep a subset of the data? Yes — this would help ease the storage and query burden a little.
But what if you want to aggregate large volumes of data into smaller amounts (think average temperatures, min/max values, means etc)? This will still be computationally expensive. If you do it server side at time of use it means a bigger workload and more cost. If you do it client side it means transmitting a lot of raw data and then it is slow as you still need to do the aggregation on the client.
So — think of two options. 1. Pre-aggregate the data — but can you aggregate it every which way you need in advance? or 2. transform the data you need into something which is quicker to query — think columnar, or some way of storing time series data.
This one will have a big effect on how easy it is to scale your product. So make sure you think it through as it will have a major effect on how you approach the design of your storage and historical data.
I’m not going to delve deeply into file storage, only to point out that there are plenty of services for doing this.
For storing bulk files, services like Amazon S3 offer simple APIs, and charge by the amount of data stored and how frequently and quickly you need to access it.
For files which are regularly used, for example assets for your IoT dashboard, Documents or Videos, take a look at a CDN such as Cloudflare or MaxCDN. They can replicate your files to many different servers globally, and serve the user from a server which is close to them, reducing load time and improving performance.
Without a doubt, SQL databases are still the go to DB of choice for many applications. But the landscape is changing — and relational databases are not always the best fit.
Many cloud applications use more than one technology for storing data and NoSQL databases offer a better fit for many use cases — so it’s worth taking your time and thinking it through before you kick off your next IoT project!
Next time round, I’ll delve into the frontend.
Update — Part 8 is now available: Frontend and apps
Des Flynn is CTO of Lattice Research, who help companies to design, build, deploy, operate and service innovative and cost-effective IoT control systems to meet their customer’s needs. More information at www.lattice.ie