OLTP vs OLAP in Database Design
The deciding factor when choosing how to store our data is to ask a motivating question to ourself; “how we should organize and manage our data?”, and some of the key aspects in organizing our data is to define the following :
- Schemas — how data is organized?
- Normalization — how about data dependency and redundancies?
- Views — how often user access to certain information containing joined tables?
- Access — how do we define user level access to the data?
- DBMS — whether we need to store in in SQL or NoSQL database?
- Read or Write heavy — do we deal with high throughput read or write transactions?
OLTP and OLAP are ways how we want to process our data and define how the data flows, is structured and stored. Sometimes it is best to decide how we want to design our database based on specific business case or requirement.
Let see some of the examples in real life scenario based on an e-commerce application :
OLTP Scenarios :
- Find the price of an item
- Update the latest customer transaction
- Keep track of employee hours
OLAP Scenarios :
- Calculate items with best profit margin
- Find most loyal customers
- Decide employee of the month
As you can see, OLTP handles daily transactions/operations, and OLAP focuses more on the business decision making. It is important to setup your database correctly because it will affect how you can use it effectively in the future.
Working with different type of data structures
Most likely we will work either or all these three different data structure :
- Structured Data
- Semi Structured Data
- Unstructured Data
OLTP is used to store real-time relational structured data, however OLAP is used to analyzed structured data. However, the challenges occurs when we have to store data beyond these traditional databases. If we have to deal with all kind of data structures, we introduce Data Lakes to store data of all structures. This has a significant benefit in terms of flexibility and scalability. The term “big data” has gained popularity since many years ago, especially its capability to perform parallel processing to process large amount of data.
However, if we look at traditional data warehouse technology, it also has its benefits :
- it is optimized for analytics purposes
- data is organized for reading / aggregating data
- usually read/only
- data from multiple sources
- use Massive Parallel Processing (MPP)
- leverage denormalized schema and dimensional modelling
So, the question arises, when do we need data lakes if we could also store unstructured data in traditional databases and warehouses? The answer is pretty simple; storing unstructured data in traditional database is expensive, and data lake storage is much cheaper because it uses object storage as opposed to block / file storage.
Schema-on-read , schema-on-write
Data lakes are schema-on-read, which means that the schema is created as data is being read. Warehouses are schema-on-write, because the schema is predefined. However we must also take into consideration, when building a data lake, data catalog is crucial, otherwise the data lake could become a data swamp because a data lake stores all types of data, from raw, operational databases, IoT data, logs, real-time, relational and non-relational, and can take up to petabytes.
Traditionally, we use warehouses if we want to perform any analysis on our dataset, however with big data gaining popularity, many runs big data analytics using Spark and Hadoop. If you or your data scientist team wants to perform deep learning and data discovery, this could be very beneficial since the processes require so much data, and sometimes the data doesn’t need to be cleaned first. This is the main concept of ELT (Extract Load Transform) where the data is first loaded into the data lake and then transform for different purposes, from :
- Building a data warehouse
- Visualization using BI tools
- further analytics
- exporting to another system, i.e. : reporting tools
- Deep Learning
Key deciding factors in designing database
When designing database, it is crucial to consider the following :
- define how to store the data logically, whether data is to be read or frequently updated
- define database models, relational or NoSQL, object-oriented
- define schemas, which is the foundation of a database blueprint — tables, fields, relationships, indexes and views — which must be the first class citizen when inserting data into database.
Dimensional Modelling, which is the adaptation of relational model for data warehouse design, has several characteristics, when deciding when to use OLAP vs OLTP :
- it is built purposely for OLAP queries, which is aggregate data, not updates as in OLTP
- based on star schema
Let’s discuss the background a bit.
Dimensional modelling consists of two type of tables ; Fact tables and Dimension tables.
Fact tables;
- are defined by use-cases based on business requirements
- stores metrics
- updated regularly
- are connected to dimension tables via foreign keys.
Dimension tables;
- consists of attributes
- doesn’t change often
How we want to design and store data in fact or dimension tables and its relationship is, as previously stated, based on your business requirement and use case. Whether your business requires a database to hold transcations data or a datawarehouse for the sales team to generate monthly sales report, this factor defines whether you need an OLTP or OLAP. Choosing this correctly in the beginning, will save you time and headache in the future.