Incorta: Bringing Agility To The Star-Schema World

Mostafa Elganainy
Engineering@Incorta
16 min readJun 25, 2023

Incorta eliminates layers of complexity in the data warehousing industry. In this post, let’s understand what it does and why it works. Also, will shed light on how it eliminates the friction caused by static Star-Schemas, and opens the door for implicit on-Demand Star-Schema. Imagine a world where business users, can maintain their own Star-Schemas in less than a day. Incorta, enables the business user to innovate and operate at an order of magnitude faster speed.

Let’s start with the long-story summary first, then dive into the details:

Let’s start with a look on the overall time taken by each approach, during initial setup and later on maintenance.

Initial Setup:

  • Incorta (3 weeks)
  • Traditional (22 weeks)
Initial Setup (Weeks): Incorta vs. Traditional

Maintenance / Change-Request:

  • Incorta (3 days)
  • Traditional (60 days)
Maintenance (Days): Incorta vs. Traditional

Now let’s understand what does it take to work on each side, and how did we come up with these estimates.

The Ask, Enterprise Transactions Analysis

Meet Michael, the Business Analyst, and John, the IT expert specializing in Data Warehousing. They’re working in a bustling enterprise, where transaction volumes reach the millions annually. It’s not uncommon for such enterprises to employ Oracle’s E-Business Suite (EBS), with its hundreds of operational tables. To navigate this labyrinth of data and understand the relationships between tables requires no small measure of dedication and a touch of artistry.

Compounding the complexity, each enterprise often tweaks the EBS schema to meet their unique needs, resulting in countless versions and making it impossible to create a one-size-fits-all model.

Returning to our heroes, Michael and John, the story often starts with Michael needing to answer some critical business questions. For instance, he might need to calculate Days Sales Outstanding (DSO), broken down by department and region.

Despite its apparent simplicity, this task isn’t straightforward. We’ve chosen this as a simple example, but even so, the necessary data is scattered across multiple tables.:

  • RA_CUSTOMER_TRX_ALL: This table stores information about invoices, debit memos, chargebacks, deposits, and guarantees.
  • RA_CUST_TRX_LINE_GL_DIST_ALL: This table holds the accounting distributions for transaction lines, tax lines, freight lines, and finance charges.
  • HZ_PARTIES and HZ_CUST_ACCOUNTS: These tables store information about customers.
  • FND_LOOKUP_VALUES: This table is used for lookup codes, which can include information like department, and region, among other things.

For this very simple example, the query would look like:

SELECT l.lookup_code AS Department,
c.customer_name AS Customer,
t.trx_number AS InvoiceNumber,
t.trx_date AS InvoiceDate,
p.amount_applied AS PaymentAmount,
p.apply_date AS PaymentDate,
AVG(p.apply_date - t.trx_date) OVER (PARTITION BY l.lookup_code) as DSO
FROM ar_payment_schedules_all p
JOIN ra_customer_trx_all t ON t.customer_trx_id = p.customer_trx_id
JOIN hz_cust_accounts c ON t.bill_to_customer_id = c.cust_account_id
JOIN fnd_lookup_values l ON c.attribute5 = l.lookup_code -- assuming department is stored in attribute5
WHERE l.lookup_type = 'YOUR_LOOKUP_TYPE' -- replace with your lookup type for department
AND t.trx_date BETWEEN TO_DATE('2023-01-01', 'YYYY-MM-DD') AND TO_DATE('2023-12-31', 'YYYY-MM-DD');

Although, this could be the most simple example, it is extremely expensive in terms of resources if applied directly on the database. For a very typical medium sized Enterprise, we might be having the following number of records for one year, in each of the above tables (Assuming 1M transactions per day)

  1. RA_CUSTOMER_TRX_ALL - 365 Million records
  2. RA_CUST_TRX_LINE_GL_DIST_ALL - About 1 Billion records (Assuming 3 lines per invoice).
  3. HZ_PARTIES and HZ_CUST_ACCOUNTS - Let's say you have 10,000 unique customers, these tables would only have around 10,000 records, maybe more if you count different versions of customer data.
  4. FND_LOOKUP_VALUES - This table would also not grow with transactions, but it's tied to the number of unique lookup values, and it's typically not a large table.
  5. AR_PAYMENT_SCHEDULES_ALL - 365 Million records per year.

This makes the analytical query, catastrophic, and can easily bring the database to its knees.

Let’s alone that it was actually the most simple query, and here you’re a medium one (Just added a bit more dimensions to the previous KPI):

It’s clear how quickly this situation can become untenable. An average query from a tool like Tableau can result in multiple pages of code.

Given these challenges, we can conclude why it’s unfeasible to directly query the database. This brings us to the next logical stop in our journey — the concept of the Star Schema.

The Star-Schema

Our Data Warehouse specialist, John, is proficient in data modeling for analytical queries. His ability to craft effective data models, known in the industry as building a Star-Schema, is key to his data warehousing strategies.

So, what is this Star-Schema, and how does it solve the performance issues associated with complex queries on the operational database?

In essence, the Star-Schema is a data model designed for efficiency. It is specifically tailored to respond to certain business queries in the fastest and most efficient way possible.

To better understand, let’s revisit our earlier example and delve deeper. Consider a set of about 20 tables, which depict some tables of interest and the interrelationships between them. It’s worth noting that, while this is a simplified example, EBS tables in real scenarios often number in the hundreds.

High Level ERD, for part of the EBS schema

The business analysis phase marked the commencement of deliberate conversations and brainstorming between John and Michael. John, being well-versed in Data Warehousing, emphasized to Michael the importance of thoughtful planning as changes post-finalization would incur significant costs. After two weeks of in-depth discussions, they finalized the following KPIs as core requirements:

  • Days Sales Outstanding (DSO)
  • Number of Sales Transactions
  • Average Transaction Value
  • Sales By Region/Country
  • Sales By Sales Representative
  • Outstanding Receivables by Customer

Subsequent to these comprehensive discussions over a fortnight, John, along with his team, was able to discern the optimal set of required tables. These are highlighted in green in the subsequent diagram:

Selected Tables for the initial KPIs

After, 6-weeks, of intensive analysis, John successfully constructed a star schema. The duration may seem extended, however, it’s quite common in such circumstances. Simplifying the process, as demonstrated in this example, often veils the intricacies involved. Typically, the diagram above would comprise hundreds of tables, thousands of columns, and hundreds of joins. John’s task was to delineate the minimal set of actions and select the best pathway concerning performance implications when devising his Star-Schema, whether it be a new or modified existing one.

If you are already familiar with Star-Schema, you would anticipate what John has accomplished. He identified the following Facts and Dimensions Tables:

If you’re not familiar with StarSchema already, think about this as an simplified version of the originally selected 7 tables, optimized for queries. Compare the above picture, to the original selection of tables:

Notice how the join path from RA_CUSTOMER_TRX_ALL -> HZ_GEOGRAPHIES, which had 4 join hops, got compressed to only one join between FACT_SALES ->DIM_LOCATIONS.

This simple join path compression, has a massive effect on the performance of queries required for our KPIs

Let’s get back to our Hero John. Now 10-weeks into the process already, and he presented this story/result in front of the CIO, and it looks a viable selection of joins and paths.

Note: Another alternative is called Snowflake-Schema, but we will ignore it for the sake of simplicity, but in principle, it’s not far from the Star-Schema

You might be wondering about the journey of data from the Raw Tables to the new Conceptual View (the Star-Schema). This is a topic worth billions in the ETL and Data pipelines industry. We will delve into it briefly in the next section.

The ETL Pipeline

Now, let’s focus on how John and his team can migrate the data stored in the Oracle EBS database to its new location. This is where the concept of a data pipeline comes into play. For the purpose of this article, we’ll assume that John is proficient with ODI (Oracle Data Integrator), which he uses to devise his solution. His solution will involve the well-known three stages of data management, Extract, Transform, and Load:

  1. Extract: During this phase, the relevant data is extracted from the source system (the raw Oracle EBS tables). It’s important to ensure that you’re pulling the correct data fields needed to populate your star schema. In ODI, this is typically done by creating a data store for each source table and a mapping to specify what data to extract.
  2. Transform: This step involves cleaning the data (handling missing values, inconsistencies, etc.), validating it (ensuring it meets certain criteria or rules), and transforming it into the format required by your star schema. This might involve aggregating data, converting data types, or creating new derived fields. In ODI, you would define these transformations in your mapping.
  3. Load: Finally, the cleaned and transformed data is loaded into the target system, which is your star schema (Fact_Sales, Dim_Customer, Dim_SalesRep, Dim_Location, Dim_Payment). In ODI, you would define a data store for each of your star schema tables and specify in your mapping how the source data maps to these target tables.

ODI may automate these steps, but it still requires careful planning and management. It’s vital to correctly define your mappings, transformations, and loading procedures to achieve accurate and reliable results in your star schema. Not only does John need to carefully consider the new requirements, but he also needs to make the most of existing pipelines and data stores to avoid unnecessary resource consumption.

Bear in mind that the schema of your data may change over time, and your ETL process should be designed with a certain degree of flexibility to adapt to these changes. Therefore, it’s neither a quick nor simple task. A rough estimate for this phase would be about 8 weeks.

At this point, 18 weeks have passed since John started on this project. It’s now garnering a lot of attention, with the business team frequently checking progress, perplexed as to why a seemingly straightforward process is taking so long.

Time for celebration? Well, not quite. The data pipeline is now in place and the new conceptual model is loaded with the eagerly anticipated data. However, despite John’s successful demonstration of some sample KPIs and insights from Tableau to the CFO and CIO — which was met with celebration — the business team was still not given access to the system. The reason is simple: the system is not yet designed for public use. Without a well-thought-out design, the data warehouse could literally crash under the weight of demands.

Let’s delve into the ‘why’ and the solution in the next section, the Data Mart…

The Data Mart

Unrestricted access to the Data Warehouse by the business team could potentially strain its resources. Remember, the Data Warehouse doesn’t serve just one team but thousands of users across the organization. Providing every business user with unhindered access to the data can have catastrophic consequences due to two primary reasons:

  • Performance Impact: If a large number of users are allowed to directly query the data warehouse, it can put a significant load on the system. Should a user execute a poorly optimized or overly complex query, it could slow down the system for all. This risk becomes particularly pronounced during peak business hours when the demand for the data warehouse is at its highest. In the worst-case scenarios, running multiple heavy-duty queries at once could even result in system crashes, causing downtime and possible data loss.
  • Data Governance and Security: Allowing every user access to all data in the data warehouse raises serious data governance and security issues. Certain data might be sensitive or confidential, and should only be accessible by specific individuals or teams. Moreover, without appropriate data governance, users may misinterpret or misuse the data, leading to incorrect conclusions or decisions.

To mitigate these concerns, organizations often establish a “Data Mart”. A Data Mart is essentially a smaller, specialized version of a data warehouse, specifically designed and optimized for a particular team or business function. By giving business teams access to their own Data Marts instead of the entire data warehouse, the performance of the data warehouse can be protected, and better control over data can be maintained.

Some of the key performance-related features of a data mart include:

  • Subsetting: Typically, only a subset of the data warehouse data is available.
  • Views: The overall load on the data warehouse can be reduced by offloading query activities to data marts.
  • Increased Concurrency: The needs of each user and query paths are separated.
  • Pre-Aggregated Data: Some data marts pre-compute and store commonly used calculations and aggregations, further easing the load on the data warehouse.

20 weeks into the process, it’s finally time to celebrate! The business users have been given access, and they start creating dashboards to see the long-awaited KPIs. They managed to complete the Dashboards within 2 weeks, and a department-wide launch and celebration is held. Everyone is thrilled at the business insights and progress achieved.

However, the inevitable moment of regret follows…

Lost in translation !

Just a few months into the process, the business team made a rather unsettling discovery.

Recall the KPI that the business team requested: Days Sales Outstanding (DSO). This financial metric calculates the average number of days it takes a company to collect payment after a sale has been made.

When the business team initially requested this KPI, they thought it was enough to have the KPI itself. They didn’t realize at the time the numerous ways this KPI could be broken down by associating it with different dimensions. They had requested for DSO by Region and DSO by Department.

However, soon they identified a key KPI that was missing: DSO by Product. This seemed to them like a simple and straightforward request. But to Michael’s astonishment, when he broached the topic with John, he was met with an unexpected response. John merely smiled slightly and estimated that the task would take about 2–3 months. Taken aback, Michael asked why. In response, John took out a piece of paper and showed Michael the additional tables that needed to be added to the warehouse (marked in red):

Then John, wrote the following rough timeline:

  • Check priorities with other teams, and IT director. And schedule this CR for execution [2–3 days]
  • Start Analysis [1 week]
  • Analysis and Design [1 month] (Check below for justification)
  • Star Schema Amendments [2–3 days] (Apply the needed changes to the star schema, adding the missing dimensions)
  • Data Extraction Change [2–3 days] (If the newly required tables are not already extracted, then we would need to add them, which includes writing and testing the new extraction logic changes)
  • ETL ODI Changes [1 week] (Adding necessary indices and prepare for downtimes…etc)
  • Data Mart Changes [2–3 days] (Make sure to avail the new changes in the Data Marts, adding the new dimensions….etc)
  • Reporting/ Dashboard Changes [2–3 days]
  • Testing/ Validation/ Documentation [1 week]

For those involved in data warehousing, the aforementioned story is far from fantasy. It’s a typical, everyday struggle. While it’s challenging for the IT team, it often proves even more exasperating and frustrating for the business team.

But there is a hope, let’s see how the HR team, in the same organization were able to achieve an order of magnitude better speed than the finance team in the above story

Unlocking agility with Incorta

Within the same organization, Mark, a business analyst for the People Organization, had a different narrative to share. When he met his friend Michael and heard his story, Mark couldn’t help but smile and shared his own tale about an efficient and speedy alternative. He recounted his experience with Incorta, explaining that the 2–3 month-long maintenance processes were reduced to merely 1–2 days.

Mark outlined the story as follows, (comparing each phase that Michael experienced with the Oracle stack, and describing how the IT team leveraged Incorta):

  • Business Analysis: At the outset of the project, the IT team didn’t seem overly insistent on getting every single detail of our requirements. They didn’t press us to remember every minor detail as they usually did, saying we would figure things out later whenever we mentioned seemingly insignificant points.
  • Data Extraction: Then, they disappeared for a week, only to re-emerge with data presented within Incorta’s UI!
  • Insights/Dashboards: Two weeks later, they presented us with dashboards that displayed our data.

In just over two weeks, the main tasks were completed. After that, they told us it was time to go through the details. Whenever we needed a model change, we would follow a process similar to this:

  • Business Analysis: Just a quick meeting where we would describe the new KPI or dimension we needed.
  • Logical Model Changes: At times, the change would be so quick that our IT counterpart would make the adjustments during the business analysis meeting itself. At most, he would request a couple more days to discuss with his team.
  • Deployment: Typically, the development phase would be completed in just a day or two, with the data then appearing on our test environment dashboards. After that, we would adhere to our standard internal process for deployment on the production environment.

In essence, change requests were fulfilled in just 2–3 days — a speed we had never seen before.

“But there must be a catch,” Michael questioned, unable to believe what he was hearing. In response, Mark showed him a video featuring Awny, Incorta’s Chief Architect, explaining how Incorta operates and why it can deliver such unparalleled speed. (You can Watch it here)

Now if you also, like Michael, skeptical on how Incorta can unleash such claimed productivity. Then keep going to understand how why it works

Dynamic Star-Schema, A Galaxy of a Thousand Stars

Before delving into why Incorta is effective, it’s essential to understand the mechanics and efficacy of the Star Schema.

The primary reason why a Data Warehouse is efficient lies in its analytics-friendly data model. This model primarily relies on compressing the join path, which significantly reduces the time and resources required to execute queries, thereby unlocking a myriad of possible optimizations.

This is because, traditionally, query processing traverses the join paths, loading the relevant data from the disk, starting from the lowest point. It then applies the necessary data joining, filtering, and processing, moving to the next join upward, until it reaches the final result. Therefore, the fewer joins and hops it must navigate, the quicker the process. To illustrate this, consider the following two examples:

SELECT ra.customer_id, hzca.account_name, hzps.party_id,
hzl.location_id, hzg.geography_name, rct.trx_number,
ril.line_type, ril.amount
FROM ra_customer_trx_all AS ra
JOIN hz_cust_accounts AS hzca ON ra.customer_id = hzca.customer_id
JOIN hz_party_sites_all AS hzps ON hzca.party_id = hzps.party_id
JOIN hz_locations AS hzl ON hzps.location_id = hzl.location_id
JOIN hz_geographies AS hzg ON hzl.country = hzg.country
JOIN ra_customer_trx_lines_all AS rct ON ra.customer_trx_id = rct.customer_trx_id
JOIN ra_invoice_lines_all AS ril ON rct.customer_trx_line_id = ril.customer_trx_line_id;s

The query processor, would do something like the following to execute this:

Query Plan

Note how many joins are executed here, and we can’t jump or optimize easily without doing the joins first.

Now, with the star schema though, we would have a query like:

The major distinguishing factor between executing queries in both cases lies in the star schema’s capability to join a smaller amount of data, given its potential for pre-aggregation, pruning, and optimization. This significantly reduces processing time.

While the star schema does offer massive advantages, the downside, as we’ve been discussing since the beginning of this article, is the considerable effort required to get it right. It necessitates time and skill to implement correctly, and maintaining it can be even more challenging.

This is precisely what Incorta eliminates. Incorta has developed a proprietary algorithm that automatically constructs a flexible join path that is always stored in memory. As a result, the developer doesn’t need to be deliberate over join paths. In fact, they don’t have to consider them at all. Schema Developers simply need to inform Incorta about which table is joining which table, and Incorta handles the rest at query time.

During a query, Incorta generates something similar to a star schema on-the-fly behind the scenes, depending on the required tables. This is made possible because it generates a composable index structure at data loading time. So, regardless of the join path demanded by the query, it can already handle it as a single compressed/composed join.

The outcome is incredibly fast queries without any need for manual intervention.

What is Incorta ?

It’s a modern & powerful system that significantly streamlined operations. Here’s how it worked:

  1. No Star-Schema: Incorta’s Direct Data Mapping eliminates the need for a star schema, which was traditionally a prerequisite in data warehouse design. Incorta’s potent ETL Engine does all the heavy lifting, working out the mapping between Oracle EBS tables and the business schema. It automatically generates the required mappings and presents the data in a readily queryable format, pulled directly from source tables.
  2. Open Standards: Incorta stores the data in Parquet, the most widely adopted big data format. And is a pioneer in adopting DeltaLake standard. This makes it open for integration with other tools, and it even provides more direct integrations with selected tools such as BigQuery and Synapse. Parquet and DeltaLake are fast becoming the de facto standard storage layer for big data and AI technical stacks.
  3. Self-service: With Incorta, business users can create reports interactively on massive source data, a first in the world of Business Intelligence. Incorta’s customers are querying terabytes of data in sub-second times, a performance unmatched by any other tool in the market.
  4. Data-fidelity: Incorta eliminates hidden data from business users, making all data readily available with minimal impact on the operational database. There are no complex queries — data is extracted and written efficiently in Parquet, using the speed of Incorta’s proprietary engine to deliver quick business query responses.
  5. Near-Realtime updates: Incorta could achieve a refresh rate as low as 5 minutes from data source to business dashboards, even with hundreds of millions of records. This far from the few-hours duration required by the leading alternatives in data pipeline solutions, marking a significant acceleration in processing speed.
  6. Cloud Friendly: Incorta supports native deployment with GCP and Azure, as well as GCS/ADLS support. Although it is a cloud-native solution, Incorta powers a significant base of on-premise customers. Notably, Incorta was the first to support ADLS gen2 datalake.

What’s next

Understand why & how Incorta works: https://incorta-2.wistia.com/medias/zhkh5fn7iv

Read more customer stories here: https://www.incorta.com/customers

Or even better, just signup for a trial: cloud.incorta.com/signup

--

--

Mostafa Elganainy
Engineering@Incorta

Incorta Senior Software Architect; former full-stack developer, now focused on performance optimization and machine learning.