When setting up an analytics system for a company or project, there is often the question of where data should live. There is no one-size-fits-all solution here, as your budget, the amount of data you have, and what performance you want will determine the feasible candidates. This post will go over the categories of databases available, and hopefully give you a rough map of the choices.
Your application database
The simplest option by far is to just use whatever is currently storing your application data.
Common Examples:
Pros:
- It’s already up
- You don’t need to manage another database server
- You don’t need to deal with moving data around
- You don’t need to transform the data
Cons:
- Analytics workloads might slow down your application
- Scaling will become more difficult with two very different modes of use
- The data schema is often difficult to use for analytics
This is usually an intermediate stage for “real” applications, but for small internal applications, pre-launch or prototypes it is a viable choice. Once you get ready to launch (for consumer applications) you’ll typically want to migrate off this to a more scalable choice below.
A read-replica of your application database
If your main database supports read replicas the next laziest thing you can do is create a read replica of your main database, perhaps set up another namespace with your 3rd party data or events, and call it a win.
Pros:
- You don’t need to manage another kind of database
- You don’t need to explicitly move data around
- You can scale analytics and transactional load independently
- You don’t need to transform the data
Cons:
- Database software that is optimized for transactional loads are usually suboptimal for analytics purposes
- The data schema is often difficult to use for analytics
- You need to manage another database server
Typically, once you start getting serious about analytics and your scale (both in data volume and complexity of analytical queries) increases, there are significant wins in performance that drive the move to a dedicated analytics database or “data warehouse” as it is often called.
A separate data warehouse running your “normal database”
If you don’t have scale that requires you to run a database on many machines you can get away with using the same database you use for your application for a dedicated analytics data warehouse. This will often have different settings, be tuned differently and will involve actively reshaping the way your data is laid out in tables to make analytics queries faster and easier to write.
Pros:
- You don’t need to manage another kind of database
- You can scale analytics and transactional load independently
- You can optimize the data model/schema to the analytical work you wish to perform
Cons:
- You need to manage another database server
- Database software that is optimized for transactional loads are usually suboptimal for analytics purposes
- You need to move data around
- You will typically need to transform data into a more useful format
- These databases are typically limited to a single node, which limits scalability
Typically, you can get away with this until around 10–100M rows depending on your data and desired analytics workloads. Once you reach a point where common queries are taking minutes or longer, you should evaluate options with more horsepower.
SQL based analytics databases
The main distinctions between “normal” database software and databases intended for heavy analytics workloads are parallelization and data format.
Transactional workloads typically have many small reads, writes and updates. They typically can live on a single machine for much longer than analytics workloads for a given company. Analytics workloads on the other hand, have less frequent read operations that touch much larger amounts of data. For example, a common transactional query is to check a given user’s last login time to display it for them. A common analytical query is counting all user logins over the last 3 months to create a graph.
The other main difference is that transactional database software typically stores in row format. For example, let’s say we have a table with user records. A user record includes their name, address, last login time and date of birth. A typical transactional database will store all four of those fields in one unit, which lets it retrieve (or update) that record very quickly. Conversely a typical analytical database will store all of the names together, all of the last login times together, etc. This makes operations like “what is the average age of our userbase” ignore all of the data in the database that is not the user’s date of birth. By reducing the amount of data the database needs to scan, it dramatically improves performance. However, this tradeoff means that analytics databases are typically much worse at transactional style queries than databases specialized for that purpose.
Open source SQL based analytics databases
There are a number of open source options for analytics databases. These are typically based on PostgreSQL, a popular general purpose database. In both cases the database started off closed source, built by a company and later open sourced.
Common Examples:
Pros:
- Free + Open Source
- Uses a PostgreSQL dialect, which is widely understood
- Scalable
Cons:
- You need to manage another database server(s)
- You need to host it yourself
- More complicated to tune than a single server
- Need to think about data partitioning
- More complicated to tune than a single server
Hosted SQL based analytics database options
Two of the main Infrastructure-as-a-Service providers (Amazon and Microsoft) offer fully managed Data Warehouses. Google offers BigQuery which is technically not SQL but has a similar query language. These are often a great deal if you don’t have much database administration expertise in-house. Typically, companies will use the database offering that their main IaaS vendor uses, though there are increasingly cases of companies using the data warehouse offering without moving any of their other computing to the cloud provider.
Redshift
Given Amazon’s dominate position in the cloud/IaaS space, it’s not surprising that many people are using their hosted data warehouse. It’s not as performant as other options, and generally is a pain to get data into compared to some other options. However, if you’re already on AWS, it’s generally the cheapest and easiest option overall.
Pros:
- Fully managed
- Pay as you go
- Uses a PostgreSQL dialect, which is widely understood
- Scalable
Cons:
- Getting data in is fairly complicated
- Need to think about data partitioning
- You need to manually scale up
- Network I/O cost if you don’t use AWS
Azure
Pros:
- Fully Managed
- Pay as you go
- Uses a SQL Server dialect of SQL
- Scalable
Cons:
- You need to manually scale up
- Network I/O cost if you don’t use Azure
Proprietary analytics databases
There are a variety of sophisticated (and expensive) database servers optimized for analytical workloads. If you’re reading this guide, chances are you’re not in the market for a 6–7 figure engagement with a database vendor.
Pros:
- Strong services component if you need help (and can pay)
- Long operational histories
- Experience with complicated deployments
Cons:
- You need to manage another database server(s)
- Expensive
- Typically very complicated to setup and administer
- Did we mention EXPENSIVE?
Common Examples:
BigQuery
For a while, BigQuery (known internally and in the research literature as Dremel) was one of Google’s semi-secret weapons. Externally, it powered Google Analytics. These days if you pay up for Google Analytics Premium, you can see all the raw data via BigQuery. It uses a SQL-inspired language called GQL, and is absurdly fast. Rather than tying you to paying for hosts running software, it uses a fleet of machines that you don’t need to care about and charges you by your data size and how much cpu/io your queries use.
Pros:
- FAST
- Scales transparently
- Pay by compute and storage you use vs hardware-hour
Cons:
- Getting data in is fairly complicated
- Less predictable pricing
Hadoop
In many ways, Hadoop is responsible for the buzz around Big Data. Originally built to power an open source web crawler, it was heavily adopted by Yahoo when they hired its author, Doug Cutting in 2006. While many big database companies can make claims to having run large database clusters before Hadoop, it nevertheless caused a sea change in how companies thought about large amounts of data. Being free, relatively low level and flexible, it allowed many companies that were not ready to slap down 10s of millions of dollars to experiment with their growing datasets. Consumer internet companies, home to both huge data volumes and an affinity for open source solutions dove in head first.
These days Hadoop has spawned an entire ecosystem around itself. While originally billed as NoSQL, the majority of applications now center around the various SQL-on-Hadoop projects.
Low level options
While initially used as a data warehouse language, MapReduce (and its probable successor, Spark), have been phased out of end-user analytics use for the most part. They still play large parts in data transformation, machine learning, and other data infrastructure and engineering work.
Pros:
- Can scale to massive data sets
- Very flexible
Cons:
- Typically slow
- Can be complicated to operate
- Languages are very low level
- Generally not much tool support
SQL-on-Hadoop
Starting with the use of Hive at Facebook, SQL has become the primary way to analyze data on Hadoop clusters for analytics or business intelligence workloads. It allows companies to use standard tools and a lingua franca of analytics on massive amounts of data.
Common Projects:
Pros:
- Can scale to massive data sets
- Use common SQL dialects
- Decent tool support
- Can be fast
Cons:
- Languages are very low level
- Requires running a Hadoop cluster
While Hive was the original SQL-on-Hadoop project, it has been eclipsed by the others in this list. In 2016, there’s little to no reason to start using Hive, but it is still in widespread use in some organizations.
Elasticsearch
Elasticsearch is more commonly used to power site search or log archival. However, its query language has also been used to power analytics applications. It shines at scale, and offers exceptional performance at the cost of a very significant management and administration footprint.
Pros:
- FAST
- Strong ability to search your data
- If you use Elasticsearch for your application, you can also use it to power analytics
Cons:
- Slow ingestion
- Not very efficient in terms of diskstorage
- Difficult query language that is optimized for search, not analytics
Crate.io
Crate offers a SQL layer on top of Elasticsearch. This combines the performance of Elasticsearch with the more widespread understanding of SQL. The actual SQL dialect is similar to MySQL or PostgreSQL but will require a bit of study by your analysts.
In-memory databases
There’s a long-running saying that “You don’t have a big data problem if your dataset fits into RAM.” If you can fit your database into RAM, and are willing to deal with a less flexible set of queries possible than a full-SQL system, performance with these database servers is lightning fast. RAM is getting cheaper every year, and more and more companies are taking this route. It often is used in combination with a big SQL or Hadoop cluster for batch or more complicated computation.
Pros:
- FAST
- FAST
- FAST
Cons:
- You need enough RAM to fit all your data
- Have their own query languages you’ll need to learn
- Can be tricky to administer
- Typically don’t offer good ways to query across multiple tables
Open source in-memory databases
Common examples:
SAP Hana
SAP offers an in-memory database, HANA, which has a fair bit of traction in larger enterprises. Like everything else from SAP (or really any other big Enterprise Database company), it’s expensive and probably going to require a small army of consultants to get going.
Originally published on Metabase