The Database Modeling Dilemma

Arik Gortsunian
Intel Granulate

--

Structured vs Unstructured

Fast-growing companies navigating the early decisions of database architecture with Postgres RDS face the following pivotal choice:

Embrace the speed and flexibility of unstructured JSON fields for rapid development

or

Invest in the long-term benefits of structured database modeling for scalability and performance

As a Data Engineer at Intel Granulate, I’m often tasked with modeling data structures that can withstand rapid scaling without compromising performance.

The Appeal of JSON for Rapid Development

While JSON fields accelerate initial product launches with their schema flexibility, they can complicate future scalability with increased query complexity and potential performance bottlenecks.

Despite its upfront time investment, it lays a solid foundation for efficient data handling, ensuring the application’s scalability and maintainability. This approach minimizes technical debt and future refactoring costs, making it a strategic choice for companies aiming for sustainable growth in a competitive landscape.

Moreover, using JSON fields reduces the complexity of database design, with fewer tables and relationships to manage, thereby accelerating the development process. This approach allows teams to focus on innovation and product iteration, pushing projects forward at a pace that structured schemas might otherwise hinder.

The Long-Term Impact of Unstructured Data

The extensive use of unstructured JSON fields in Postgres RDS can lead to performance bottlenecks over time. The need for real-time JSON parsing increases CPU usage, impacting the overall efficiency of the database system. Additionally, this approach can strain PostgreSQL’s I/O capabilities, as data retrieval becomes more complex with scale.

The lack of structured modeling complicates query optimization, resulting in slower response times. As the application grows, these scalability issues become more pronounced, making it challenging to adapt and maintain the database efficiently.

Proper Modeling: The 80/20 Rule

Investing early in structured database modeling, guided by principles from Kimball or Inmon, acts like applying the Pareto principle to the data maturity journey, ensuring minimal initial efforts yield maximum long-term scalability and efficiency in Postgres RDS.

This foundational work not only streamlines current data management but also accelerates an organization’s progression through data maturity levels, from basic storage to advanced analytics, setting a solid groundwork for future data-driven innovations.

Real-World Examples

In a scenario with a orders table structured as

erDiagram
Orders {
bigint order_id PK
timestamp timestamp

jsonb order_details
jsonb sales_person_details
jsonb pricing_details
}
Create table orders (
order_id bigserial PRIMARY KEY,
timestamp timestamp default CURRENT_TIMESTAMP NOT NULL,
order_details jsonb,
sales_person_details jsonb,
pricing_details jsonb
);

The complexity escalates as the data grows.

Without structured indexing, queries become increasingly slow, necessitating full table scans to access order details, salesperson information, or pricing data. This design leads to significant performance issues, including slow query response times and increased load on the database server due to the heavy processing required for parsing large JSON blobs on each query.

A more efficient design would normalize the data by splitting order_details, sales_person_details, and pricing_details into separate tables, each linked to the main orders table via foreign keys.

This structure allows for the use of indexes on key columns, significantly improving query performance by enabling quick lookups and reducing the need for full table scans.

CREATE TBALE departments (
id bigserial PRIMARY KEY,
name text NOT NULL,
manager test NOT NULL
);

CREATE TABLE sales_person_details (
sales_person_id bigserial PRIMARY KEY,
department bigint REFERENCES departments(id),
email text NOT NULL,
private_name text NOT NULL,
surname text NOT NULL
);

CREATE TYPE category_types AS ENUM ('Group sale', 'VIP', 'Promotion2048', 'Occasional');

CREATE TABLE orders (
order_id bigserial PRIMARY KEY,
name text NOT NULL,
category category_types NOT NULL,
timestamp timestamp default CURRENT_TIMESTAMP NOT NULL
);

CREATE TABLE products (
id bigserial PRIMARY KEY,
product_name text NOT NULL,
product_description text,
product_categpry text NOT NULL,
);

CREATE TABLE pricing_details (
pricing_id bigserial PRIMARY KEY,
product_id bigint REFERENCES products(id),
base_price decimal(10, 2),
discount decimal(10, 2),
final_price decimal(10, 2),
details jsonb
);

CREATE TABLE order_details (
detail_id bigserial PRIMARY KEY,
order_id bigint NOT NULL REFERENCES orders(order_id),
product_id bigint NOT NULL REFERENCES products(product_id),
pricing_id bigint NOT NULL REFERENCES pricing_details(pricing_id),
sales_person_id bigint NOT NULL REFERENCES sales_person_details(sales_person_id),
quantity bigint NOT NULL,
order_discount bigint,
timestamp timestamp default CURRENT_TIMESTAMP NOT NULL
);

This approach not only simplifies data retrieval and updates but also lays a solid foundation for scaling the database as the application grows, ensuring that the system remains responsive and efficient.

From an analyst’s perspective, the goal is to design a primary table that supports 95% of queries efficiently, containing all relevant data for fast access. Incorporating structured and controlled JSON fields within this table offers a balanced approach, combining the speed and simplicity of flat data structures with the flexibility of JSON for dynamic content.

This strategy ensures that the majority of analytical queries remain performant and data-rich, while still accommodating the evolving needs of the business with minimal impact on query performance.

JSON Fields: It won’t bite when used correctly

Data normalization in Postgres significantly boosts system performance and scalability. While discussions often center around JSON fields for their flexibility, it’s the transition to JSONB data types that marks a strategic pivot for storing configuration or semi-structured data efficiently.

JSONB, with its binary format, not only retains the advantages of JSON fields but also enhances them by supporting GIN indexes. This allows for efficiently sorting and querying nested data, as per PostgreSQL’s documentation.

Emphasizing wise use, JSONB data should be kept to well-defined structures with predictable usage patterns to fully exploit PostgreSQL’s robust indexing and fast data retrieval features, ensuring optimal performance.

--For Bitmap Index Scan
CREATE INDEX ON pricing_details USING gin(details);

--For Btree Index Scan inside the json col
CREATE INDEX ON pricing_details USING btree((details ->> 'marketing_strategy_cat'));

The Path to Scalable, Maintainable Databases

Embracing thoughtful, structured modeling and calculation of JSONB fields is fundamental to crafting scalable, maintainable databases in Postgres RDS. Proper modeling, diligent management, and strategic partitioning are the cornerstones of future-proofing your database, minimizing the need for expensive overhauls.

Remember, while the allure of rapid development is strong, the longevity and health of your database shouldn’t be compromised.

Defining a well-structured JSON that includes an array of unknown values is like expecting a cat to walk on a leash — it’s technically possible, but prepare for some unpredictable outcomes!

#intel.granulate_engineering

--

--