45+ best data layer perspectives

Nitin Khaitan
Towards Polyglot Architecture
18 min readOct 24, 2022

--

Design data layer perspective from schema, design, query, performance, scalability, load and security perspectives
When you design the data layer from various perspectives, such as schema, query, performance, scalability, load, and security, you enhance your application's efficiency and fortify its security measures.

The data layer, a pivotal backend application component, is crucial to its functionality and performance.

Implementing changes related to the database design, schema structure, or any other aspect of the application's data layer can be a significant investment, especially if required later in the application lifecycle. By adhering to best practices, we can ensure these changes are effective, cost-efficient, and will have minimal impact on the downstream systems, thereby optimising the use of resources.

It should be designed to consider current and future business needs. It should follow the design principles and induce changes with caution, as they impact the applications built on top of it and the downstream systems that consume the data produced by the system in change.

Below is the exhaustive list of the best practices from different perspectives that we should follow to ensure that this layer is well designed.

Design perspective

In the design phase, we consider what database to use, how to use it, and how to design the database schema.

We consider the business need and the load it might have to cater to in the current and future scenarios and devise a solution accordingly.

Below listed are a few essential decision perspective pointers:

SQL vs NoSQL

Deciding when to use SQL or NoSQL as a persistent data store can be tricky. Both have their respective pros and cons.

Below listed are a few critical factors based on which we can make decisions:

  • The nature and complexity of the domain object
  • The kind of data it is expected to hold
  • The read and the write actions we need to perform on the data
  • Whether it would be a read-heavy or write-most application
  • Do we require a structured and predefined schema or want it to be flexible?

With the introduction of transaction management in MongoDB 4.0, considering MongoDB for heavy transactional applications is not an issue.

MongoDB is not a good option if we write complex read queries spanning multiple documents; though we can write, using MongoDB for such a scenario would not be optimal.

i.e., MongoDB can be a good choice for logging, catalogue, configuration, etc., whereas MySQL can be a better choice for transactions, cashback, etc.

The use case for ElasticSearch, Redis, GraphDB & BigQuery

We should consider using different database types, such as text-based (ElasticSearch), key-based (Redis), Graph-based (Neo4J), and BigQuery (columnar database), as a comprehensive database for the application stack that will empower the application to achieve specific use cases optimally. However, they should not be considered replacements for the persistent store.

We should have the required supporting flow (event-based, being-based) in the application to publish data to these databases after data massage as per the business/technical use case for which we plan to use these databases and empower different flows in the application stack to use them.

i.e., Actual catalogue data should reside in MongoDB, and all transactional operations related to the catalogue should be performed using that. After that, the resolved complex document structure on which the text-based search should be performed on the application should be published to ElasticSearch following its semantics. Further, the user search should happen on ElasticSearch.

Redis or Aerospike can be used as key or session stores, though the user data should permanently reside in a transactional database.

Neo4j can create a network of users with relationships so that we can query the network as per the use case, but the actual user data should still permanently reside in the persistent store like SQL/NoSQL.

Similarly, all the data from the persistent store and external data source should be published to BigQuery, and the analytics workload can be performed on that for most of the use cases.

In all the use cases mentioned above, we should have a system to keep the data in sync between the persistent store and the supporting database, which is done via event, cron job, data pipeline manual API invoker or a combination as per the technical architecture in place.

Changes in schema structure are expensive.

We should consider and design schema by evaluating current and future needs.

  • What columns should be part of one table versus what should be split into a new table?
  • How do we avoid unnecessary joins?
  • When should we consider data redundancy, and what will the system overhead be to maintain it? Is it required at all? What extra will we have to handle if we do so?
  • The column's data type should be specific to the business use case and not be provided as a placeholder. i.e., an integer value should not be saved in a column with a string data type.
  • The table and column names should be well thought out and aligned with the domain naming convention.
  • When creating a document in MongoDB, we should consider what data each document should own. For example, should the Order be part of the customer or independent documents?
  • We should consider what snapshot data (data that does not change once created) will be versus what data keeps changing its state. Schema design may vary based on this as well.

Changes in the schema design at a later stage are expensive and impact the application that uses it directly and the downstream system to which the data might flow. So, this should be designed with caution.

i.e., We might publish the data to BigQuery via data pipeline for analytical use cases.

Security Perspective

Who can access what?

The database per environment should be separate. In case we want to save cost, as well as be security compliant, we should have the following:

  • one database for prod
  • another database with an environment-specific schema for non-prod
  • Another database for load testing, as the database for load testing, has an equivalent configuration as a prod, and we should bring it up only at the time of load testing. Also, doing so, any impact on the database due to the load testing would not have any effect on the non-prod workload, i.e. development or testing work.

We should not have a generic user ID or common password to access the database. The user accessibility per database, schema, table and action as what he can perform should be governed based on the following:

  • User level: A user should be associated with a group(s), and the group should define what all the users related to that group can access and with what access rights.
  • Application level: One application/microservice might require access to one or more databases. The database user and its credentials per application should be different. A database user should be created per application level concerning the database/schema it must access.

We should think about accessibility proactively instead of reactively. We can also start with a constrained approach where its default is to deny access and provide access only wherever it is required.

Password rotation

We should periodically change the database password used by the application/microservice to prevent it from being compromised.

We should do it so that the application stack keeps running without failure and does not experience downtime due to this changeover.

Constraints in accessing the database

The database should not lie in the public domain and stands true for both production and non-production databases.

All the database instances should be in the private subnet. The only way a user can access it should be via VPN, whether the user is working from home or the office.

Database instance connectivity should be role-based for different AWS services like EC2, lambda, etc., and it should be configured so that the connection is established within AWS VPC without going to the public internet.

Data at rest

Data residing in the physical store should be encrypted so that even if someone gets access to the physical store, they should not be able to read any data from the file system.

This adds an overhead while reading and writing data, but that overhead is too minute, and it's worth taking to achieve data security.

Data in Motion

A client accessing data on the public internet

The database connection should always be secure, and we should use the SSL certificate and signed URL to connect to the database.

This helps us secure data transfer between the database and the client and avoid middleman attacks on data when it is in motion on the public Internet.

A client accessing data on the public internet via VPN

The VPN connection ensures data security in motion and blocks the database connectivity outside the VPN, but we can also have SSL-enabled database connectivity if we want.

Schema Perspective

Should we use the existing schema or create a new schema?

In a microservice architecture, we are expected to divide an extensive application into smaller microservices. While doing so, the best practice suggests splitting the database and its schema.

Regarding the new feature in development, we need to decide whether its tables should reside in an existing schema or create a new one.

If we decide to create a new schema, we should also consider if we would need a new database instance for this schema or if it should be part of an existing one.

This will depend upon the microservice design pattern and the expected load on the feature.

Design schema on paper

While designing the table(s)/document(s) for a feature, we should consider it iteratively from the business, data, and application perspectives. Consider the current use case and the future changes or enhancements that we expect to come.

The best way to do that is to do it on the whiteboard on a piece of paper, and when it starts to take shape and looks clean, we should move it to the system by writing DDL scripts.

Generate UML

After the DDL scripts are executed and checked on the local database instance, we should generate the UML diagram for the tables to see how they are related. This will help us visualise the structure and the relationship between the tables/documents.

A refinement in the schema structure comes from doing this as well.

Test data to validate the structure.

We should always prepare the test data for the tables/documents we are designing and run it through different business scenarios to validate that it should handle all possible use cases as desired from this structure.

The best way to do this is to create the test data in Excel and manually insert them into the database.

Review session for DDL

The DDL and the test data created in the above steps should go through a round of review sessions by the business and the technical owner for the feature.

This step helps us refine any corner concerns related to the structure.

Publish DDL changes via Automation tools

The database definition changes or changes around the configuration records related to config tables should be published via tools like:

  • for MySQL: Flyway or Liquibase
  • for MongoDB: Mongeese

We should never manually interact with the database to make changes. We should write the DDL scripts and the configuration changes in the tools mentioned above and then execute them per environment.

Metadata for the data

We should practice creating meta-data for the schema when making it and keep updating it as the schema definition changes.

We can keep that information along with the table/document/column so that the schema structure is self-documented.

This should have details about the purpose of the schema and its column, as well as related constraints.

Table Perspective

Normalisation of tables

The tables should be normalised and follow the design schematic. If we introduce redundancy, it should be well thought out. Inducing redundancy is easy, but it comes with the overhead of keeping the redundant data in sync.

A few scenarios for redundant data:

  • Snapshot tables (tables where we don't perform updates) can have redundant data, as the data state at the time of snapshot creation was necessary for that record. We can store extra redundant information to save the instance data and make read queries performant. i.e., the order table can have the customer’s name stored along with it to avoid extra calls to fetch that.
  • In a microservice architecture, one schema might require a subset of specific data from another data source so we can keep a local copy. However, it should be kept in sync with the primary data only if necessary. i.e., the customer’s schema stores the customer’s last payment and status from the payment schema locally so that we can use it at the time of customer registration to see if he is a paid customer.

Audit Fields

We should always have audit fields in the table. A few audit fields are:

  • created_on
  • created_by
  • modified_on
  • modified_by

This field records who created or last modified this record and when.

These fields don't capture the history of all the changes but just who created and who modified it last; if that is the requirement, we should create a <…>_history table and keep a log of all the changes, but the fields mentioned above should still be marked in the main table.

Cascade delete

We should refrain from using this as it increases the chances of accidentally deleting rows that we would not have wanted to be deleted.

Instead of setting cascade delete in the table, we should handle the deletes via the code logic we would write in the application's data layer.

Soft delete

There are two ways to delete a row:

  • Hard delete: When we hard delete a row from a table or document, the data for that row is removed from the physical store and is no longer visible in the table view. We must wait to retrieve these records if we want to. i.e. we can use hard delete in master tables.
  • Soft delete: This is done by adding a column in the table as is_deleted, which can have a true or false value. It is present in the physical store and available in the table view. The default value for this field is wrong, and we set this to true only for the scenario when, as per the code logic, we want to delete the record. i.e., we can use soft delete in transactional tables to recover the data if deletion is triggered by mistake.

For most use cases, we prefer soft delete as we will have data backups even when the data is getting deleted. A quiet delete flag in the table induces overhead but reduces the risk of data loss in case of an issue.

Save the date in the UTC timezone.

MongoDB, by default, saves the date in the UTC timezone. Keeping the DateTime in UTC avoids problems related to timezone conversion and issues arising from daylight saving time.

We can achieve this for MySQL by setting the server time to UTC for the server on which the database is deployed.

The best approach is for the server to store the time in UTC and the client to display the date time after converting it to the client's local timezone.

Integer enumeration

Every database supports enumeration, and as application developers or database designers, we sometimes require fields that store state, i.e., true/false, ToDo/In Progress/Done, etc.

Though we can define these columns to capture the values from a predefined list of strings, the preferred way is to allow integer values in the enumeration. The database engine optimises using numeric enums instead of string-based enums.

Column names should be small and readable.

We should keep the column name short and understandable so anyone can understand its meaning. At the same time, keeping it short is essential as it will reduce the overhead of network bytes transfer.

We should avoid using special characters in the table/column name. Underscore should be used as a word separator.

The name of the table and the column should be singular and not plural.

Column value type should be as per the requirement

A collation is a set of rules that defines how to compare, store and sort a string. It is dependent upon the charset and the locale.

We can apply a collation at different levels like:

  • table/document
  • column
  • query level

We should use collation cautiously as it also induces overhead on the database engine while executing any query.

Currency

We should always store currency values as integer or decimal fields per the requirement and never keep them as floating-point values.

Floating point values might lead to some value loss in terms of precision.

Constraints

We can apply different constraints to the table, such as a foreign key, unique key, data type, and data length.

They are good, and we should always have such constraints at the table and column level so that no direct entry in the database should go wrong. Still, the application layer should validate such limitations at the data entry level instead of the repository layer throwing the database exception.

Single primary key

We should try to have a single primary key as compared to a composite primary key:

  • The composite primary key is slow as it creates a hash using multiple columns, and then it's used as the primary key.
  • It requires more disk space.

Primary key

Should we keep the primary key as an auto-increment number or use UDID?

Loop side of using auto increment number in primary key:

  • Looking at the number, we get a clear insight into the no of transactions that have happened in the database (from a security perspective)
  • An auto-increment number breaks in case of high concurrency as it acquires a table-level lock to do an increment and avoid two records to get the same primary key.
  • We can't set the value of the primary key in the query

If we use UDID as an immediate key value instead of an auto-increment number, all three problems mentioned above are addressed.

MongoDB implicitly supports UDID as the document's primary key.

Denormalised column

We can use the redundant column in specific tables to simplify queries and improve performance.

A few use cases for the redundant column:

  • We insert and select the snapshot table where we don't expect to execute the update. In such tables, it is safe to use redundant columns as they are expected to showcase the value at the insert time, and they will avoid unnecessary joining on the tables.
  • In any other use case, if we induce a redundant column, it brings overhead to keep that column updated whenever we update the value for that column in the primary table.

Query Perspective

Which instance to Query

The event-based systems are designed to work in an async fashion. A few common problems with the event based systems are:

  • Events may fail and can be received later by the consumer after a few retries.
  • Events are received out of sequence. The Order in which they were published and the Order in which they were received may differ.
  • The same event is received more than once

The requirement around the real-time read should be served from the controller instance, and the near-real read or the downstream system should rely upon either the read instance or the controller instance.

The queries in the downstream system should be written considering the possibility that the data received is out of sync with the central system or not up to date regarding a transaction.

Index

Indexes are used to optimise the select query performance and add overhead to the insert/update query. We should create indexes based on the chosen query requirement.

We can have multiple indexes on the same table, and indexes can also be composite.

The direction of the column in the index plays an important role and helps make the query fast.

We should revisit the indexes for participating table(s) whenever we alter existing queries or add a new question for the table.

We should explain how the query uses indexes and whether it does so as expected. This insight will help us optimise the question further for better performance.

Paginate select queries

A query should always return a defined record set as a paginated result set.

We can only have a few tables in the application where the client might require all the records to be returned simultaneously, but these tables are limited in size and should have few forms.

A limited resultset led to fewer network transitions and scans in the physical store, leading to better query performance and less load on the database.

Apply projection in select queries

A query should fetch limited columns in the result set.

This led to less data transition on the network and fewer scans in the physical store, which improved query performance and reduced the load on the database.

Use Upsert if possible.

While performing an upsert, the record is inserted if it does not exist and updated if it does exist. From a business perspective, we should use Upsert.

It helps us save one extra hit on the database, reducing the load.

The loop side of using Upsert is that it hides when an insert happens versus when an update has occurred. So, if the business layers requirement is to know when we insert or update, we should not use Upsert.

Procedures

A procedure resides in the database layer.

As per design principles, the database layer should not know business logic; on the contrary, the procedure is where we write logical code.

A few use cases I would think of when we should use procedures are:

  • In this case, as part of the refactoring work, we must transform the existing data structure into a new system.
  • Because of the changes in business logic, we can use procedures to enhance the current structure of a new system as a one-time activity.
  • When we migrate from an existing application to a new application, we must run both in parallel. Both applications use different tables (old versus new tables for the same functionality) to serve the same flow on the client.

Load Perspective

Concurrent write

High concurrency occurs in scenarios like:

  • Increased load on the application due to user load
  • Multiple batch processes doing bulk writes on the same table.
  • a single batch process doing the majority register via multiple threads

In case of high concurrent, the database can become a bottleneck and get into table-level lock, as it has to generate the IDs in sequence.

In the case of MySQL, we should architect the system so that parallel inserts are avoided, as the ID (primary key) is usually expected to increment sequentially.

In MongoDB, the IDs are UDIDs that are not sequential, so they can support concurrency at a much larger scale.

Query performance on load

A query's complexity, execution frequency, and concurrency impact database performance. At times, this impact may not be visible upfront but might become visible as the number of records in the table increases.

We should always plan for load testing around the business flow with 5x data and 5x load to keep this in control.

Load testing around the business flow will implicitly invoke all the underlying queries on the multiplied magnitude. This will help us generate insight into query performance and its relative impact on the overall flow.

Schema Instance Perspective

Read replica instance(s)

We should always have at least one read replica concerning a master in the production environment.

The read replica in MySQL and the replica set in MongoDB are the replica instances on which the data replicates from the master.

A master can have 0 or many read replicas associated with it. Whenever we perform any write operation, such as insert, update, or delete, we serve the changes on the master, and then the changes are automatically replicated to the agent instance.

We can experience some replication delay in the data getting reflected in the enslaved person. The replication delay may go high in case of high concurrency, which can be controlled by tuning the machine's IOps and computing together.

The application layer should route all the read calls on the read replica for the scenario where the client must fetch the data in a near-real fashion. This way, we can optimise the load on the controller instance and keep it free for the insert/update operations.

Sharding

When the data volume starts to increase, we should create the shards to distribute big tables into different instances from a writing perspective so that the write load can be distributed evenly across multiple cases.

In this scenario, every shard should have at least two replicates to handle read load and manage failover.

So, for example, if we are dividing a database design that has:

One master and two replica set

If we would like to introduce three shards based on the growing need for the data, below mentioned is the proposed structure we will have in place:

Three shards

Two replicas set per shard

3 * (1 master + 2 replica set) = 9 instances

Archival Database

An archive database is a database to which we move the records from the central database by purging them.

We do this to reduce the number of new records from the central database so that the queries can be performed as they will not have to scan unused data.

The tables in the database can be classified into three types based on the record usage patterns:

Tables that hold configuration data

The configuration type tables usually hold the domain object's current value based on which the application/feature works. These tables are generally kept from the archive. i.e., catalogue, campaign, etc

Tables that hold transaction data

We should define the purge criteria per table level as the business requirement. i.e., Order, payment, etc

Tables that hold logs or historical records

We should define the purge criteria per log table level as per the business requirement. i.e. — order_log, action_log, etc

We should have a script in place that should be configured to execute every day and should:

  • Purge the records from the primary database daily.
  • Take a backup of the purge records (in case a purge revert is required)
  • append the purged records to the archival database

This activity should be done in a way that should not introduce replication delay on the master and the replica sets.

Entity Perspective

One entity per table

We should have one entity per table.

There should be one microservice for inserting and updating a table, and all other flows required to insert or update a particular table should rely upon this flow.

The patch calls should be part of this library as well.

Bean for reading can be different.

The exact entity should be used if we are reading from one table.

If we read from multiple tables via a complex query, it should be via the read database powered by CQRS.

The data layer should abstract the final data structure

The data layer in the application should always return the domain object or a subset of it. By doing this, we can reduce the impact of the structural changes induced as part of feature enhancement that occurs in the lifespan of the feature.

Default Value

Should we define the default value at the table or entity levels?

I prefer configuring the entity level's default value as part of business logic, and it should not reside with the database.

Testing Perspective

Mock Data from production

We should have a process to create a copy of production data and then do mocking to change PII data and remove/update critical key attributes so that no one can traverse who the real user was and what his data was.

This activity intends to copy this data snapshot to different lower environments for testing and development-related work.

--

--

Nitin Khaitan
Towards Polyglot Architecture

Strategic thinker, a technically astute developer/architect with 15+ years of experience owning engineering, backend, and frontend from infancy to success