Stories by Leah Nguyen on Medium

How We Built the AWS Data & Analytics Platform (Part 1)

Leah Nguyen — Fri, 11 Jul 2025 11:14:04 GMT

The story of our first data platform capability

Me and my super duper awesome aka smart data team at AWS Summit Sydney 2025

At my current company, one of Australia’s leading beverage companies, data is at the heart of nearly everything we do. In fact, about 80% of all the data we handle comes directly from SAP systems. These SAP sources aren’t limited to just one system; we’re talking about multiple interconnected subsystems like SAP S4/HANA for core transactions, Cloud for Customer (C4C) for CRM and sales data, Integrated Business Planning (IBP) for forecasting and supply chain, SAP Concur for travel and expense management and more. Each subsystem generates critical insights, but they also present unique integration challenges.

Here’s the story of how we built a scalable, AWS-based Data & Analytics (D&A) platform from scratch, the architectural decisions we made, and how it all comes together behind the scenes.

Why it is important to have a unified data platform for us?

A few years back, our analytics setup relied primarily on business warehouse that hosted within SAP environment, which was excellent for structured, SAP-only data but struggled significantly when faced with broader integration requirements. The limitations became clear as we started receiving increasing volumes of data from non-SAP sources:

Third-party vendor market research data,
Static files via SharePoint,
And more

We needed a solution flexible enough to seamlessly handle structured, semi-structured, and unstructured data alike, whether sourced from SAP or elsewhere.

Designing the right architecture

Our initial considerations revolved around two central challenges:

credit @Piotr Kononow

#1 - How could we reliably ingest data from diverse sources into a centralized repository?

credit @Piotr Kononow

As I mentioned from the get-go, our team deals with SAP and non-SAP data alike, and every source comes with its own ingestion mechanism.

For instance, within the SAP landscape alone, we pull transactional data from SAP S4/HANA and CRM data from SAP Cloud for Customer (C4C) using OData services, which is the API-flavoured object of SAP to make connection to other system, securely. But that’s just part of the story!

SAP Concur, another key SAP subsystem we rely on for travel and expense management data, doesn’t send data via API at all. Instead, we pull encrypted files directly from an SFTP server.

On top of that, external vendors push their market data directly to our platform. Other teams might share data with us through static files uploaded to SharePoint, while yet others store their data in simple CSV files hosted on third-party SFTP servers or being sent to us via email.

Now, if the variety in ingestion mechanisms wasn’t challenging enough, the types of files we receive are just as diverse. We handle everything from strictly structured data ( csv , xlxs), to semi-structured data (think jsonor xml).

One project I’m personally working on receives data in an encrypted, unstructured format called .jsonl.pgp — which, in layman terms, a JSON lines file encrypted with Pretty Good Privacy.

Each unique scenario forced us to thoughtfully design our ingestion workflows. Whether we were pushing or pulling data, structured or unstructured, encrypted or plain text, we had to ensure every ingestion pipeline reliably moved data into a central Data Lake.

#2 - Once we landed data into AWS, how would we transform and prepare it efficiently for business users?

credit @Piotr Kononow

Now, once we successfully landed all this data into this Data Lake area, another big question remained: How do we turn this raw, often messy data into something genuinely useful and understandable for the business?

Let’s take SAP S4/HANA data as an example. If you’ve ever peeked at raw SAP tables, you probably know exactly what I mean. Technical field names can feel cryptic and confusing at best. Instead of friendly terms like “Customer ID” or “Material ID,” SAP tends to spit out table headers like kunnr for customer ID or matnr for material number.

While these abbreviations might be perfectly logical for SAP consultants or anyone fluent in German (because yes, they’re mostly acronyms based on German terminology), they can leave business users scratching their heads. I mean, no offense to our German friends, but MATNR doesn’t exactly scream “material number” if you’re not familiar with the term, right?

So part of our data processing challenge involved transforming these obscure, technical column names into clear, business-friendly language. It wasn’t enough just to land the data safely into Data Lake, we had to actively interpret and reshape it.

AWS emerged as our top choice precisely because it addressed these critical pain points. However, creating an effective architecture wasn’t as simple as plugging in a few AWS services. To handle these challenges, our platform architecture was divided into clear stages: Data Ingestion, Data Processing/Loading, and Data Warehousing.

And the story begins from here.

Data Ingestion: Gathering Data from Everywhere

The raw-data pipeline is organised around three independent state machines: Ingestion, Processing, and Loading. Each state machine runs on AWS Step Functions, which gives us deterministic task ordering, built-in retries, and centralised error handling while keeping the JSON definition of every workflow in version control.

We manage every ingestion workflow with AWS Step Functions. Each workflow is defined as a state machine that lists the exact sequence of tasks, the success criteria for each task, and the error-handling branch if any step fails.

When a new load is triggered, the state machine starts by invoking an AWS Lambda function that collects connection metadata: endpoint URLs, authentication tokens, and the specific objects or tables requested for that run. The Lambda function captures the run-time metadata and passes a reference to the next state.

The next task in the state machine launches an AWS Glue job. Glue connects to the source system using the metadata provided, extracts the data, and writes the raw payload to an S3 prefix named ingested. All payloads are stored in JSON to preserve schema details exactly as received.

After Glue finishes, Step Functions routes execution to another Lambda function that validates the job result.

The validator checks row counts, file size, and schema conformance, then updates an execution log in DynamoDB.

If validation fails or the Glue job errors or times out, the state machine switches to a failure branch that publishes an alert through Amazon SNS. The alert contains the run-ID and error details so support engineers can diagnose the issue quickly.

If validation succeeds, Step Functions marks the ingestion workflow as complete and emits a success event on Amazon EventBridge. Downstream processing pipelines subscribe to this event and begin the processing stage.

Processing: same state machine, new purpose

The Processing pipeline re-uses the exact Step Functions template we built for Ingestion — metadata Lambda, Glue task, completion Lambda, then a split to SNS (failure) or EventBridge (success). Nothing changes in the orchestration layer; the distinction lives entirely inside the Glue script.

The script does three things:

First, it scans every JSON object we just landed under the ingested/ prefix.
Second, it appends a single column, ingested_time, stamped at UTC so we can trace every record back to a specific run without touching the source system.
Third, it writes the result to the formatted/ prefix as compressed Parquet, preserving column types and order while cutting query latency on Redshift by more than half.

All retry policies, timeouts, and alert routes inherit from the shared state-machine definition, which keeps operational behaviour consistent across every stage of the raw-data pipeline.

Loading: finishing the trip in Redshift

The third state machine, Loading, mirrors the same Step Functions skeleton we used for Ingestion and Processing : initial Lambda for run-time metadata, a single Glue task, completion Lambda, then a fan-out to SNS or EventBridge. Again, orchestration is unchanged; only the Glue logic shifts.

The loader starts by picking up the Parquet files we just wrote to the formatted/ prefix. The cleaned dataset is streamed directly into a staging table in Redshift staging., using Amazon Redshift’s native COPY command behind the scenes. All temp operations land in an S3 scratch directory so we avoid cluttering the warehouse with intermediate files.

Operational behaviour stays consistent with the earlier pillars. Any error triggers SNS with a run ID and the stack trace so on-call can jump straight to the failing task.

With this final load step in place, the raw-data pipeline closes its loop: JSON lands in ingested/, gets normalised to Parquet in formatted/, and arrives in Redshift staging ready for type casts, SCD logic, and business modelling. End-to-end latency for a typical source drop is now measured in minutes, not hours, and alignment across the three state machines lets us operate the stack with one shared runbook.

Lessons Learned from Building the Raw-Data Pipeline

1. One state-machine per stage keeps troubleshooting narrow

Splitting the flow into three independent Step Functions — Ingestion, Processing, Loading — gave us clean blast-radius boundaries. If Loading fails, we re-run only that state machine; Ingestion and Processing stay green. Sainsbury’s Data & Analytics adopted the same approach after discovering that chaining ten Lambdas in one giant workflow made it “very difficult to find single points of failure” and polluted their S3 buckets with half-finished files

2. Treat metadata capture as a first-class task

Our first Lambda in every state machine writes a snapshot of connection details, source object names, and run IDs to S3. That decision paid off during audits: we can reproduce any historical load without digging through CloudWatch logs. AWS prescriptive guidance flags the same pattern as a prerequisite for reconstructing failed runs and for partitioning large data sets later.

3. Visual orchestration beats log digging

Step Functions’ runtime graph immediately shows which state failed (green vs red). That visibility slashed mean-time-to-diagnose during the first month of go-live. Sainsbury’s team called the visual trace “the lovely graphic” that pinpoints failing states faster than combing through Lambda logs.

What’s Next

Part 2 will cover how we take the raw Redshift staging tables you’ve just read about and turn them into fully modelled, Slowly-Changing-Dimension-aware business layers — with dbt. Stay tuned!

Gratitude corner 💙

A huge thank-you to Harold and Jamie, if you guys are reading this :) hands-down the best managers a data nerd could hope for. Your support and mentorship made staying up late to finish this blog post on a Friday night feel more like a passion project.

I’m equally grateful to Gurpreet, Michael, Praveen, and Kazuma for answering every “Is this an AWS thing or am I just tired?” question on my zero-to-hero cloud journey. Couldn’t have done it or kept my sanity without all of you.

How We Built the AWS Data & Analytics Platform (Part 1) was originally published in Data Engineer Things on Medium, where people are continuing the conversation by highlighting and responding to this story.

Databases, Warehouses & Lakes: A Kitchen Tour

Leah Nguyen — Tue, 27 May 2025 11:13:01 GMT

If you know your way around a fridge, you’re five minutes from grasping modern data storage — let me show you how.

I was pushing my trolley through Woolies (short for Woolworths, Australia’s everyday supermarket) when a thought popped up between the bananas and the bread: Whenever I’m coaching newcomers to data, three terms reliably trip them up — database, data warehouse, and data lake. They appear in every blog post and architecture diagram, yet most explanations assume you were born fluent in cloud-native jargon.

I used to get lost, too, until I realised all three ideas match the way I store food. If you can picture the rhythm of a fridge, a pantry, and a walk-in freezer, you can picture the rhythm of modern data storage.

No doctorate required.

Why a kitchen in the first place?

Every household sorts food into three destinations:

Milk and leftovers live in the refrigerator because breakfast happens on autopilot.
Bulk goods line orderly pantry shelves so you can see, at a glance, whether you have enough rice for the next fortnight.
The “maybe one day” supplies — frozen berries, mystery dumplings, that discounted brisket — hibernate in the deep-freeze until inspiration strikes.

Data travels the same route.

Some pieces must be lightning-fast, some must be neatly organised for long-range analysis, and the rest simply need a cheap berth until you decide what to cook. Once you grasp those storage instincts, most LinkedIn buzzwords start translating themselves.

The fridge: databases and the speed of breakfast

A database — PostgreSQL, MySQL, SQL Server, take your pick — is the digital “refrigerator”. It handles thousands of mini-transactions every second: checking a bank balance, updating a rideshare driver’s location, or logging the precise moment you tap “Skip intro” on Netflix. It achieves that speed by enforcing a rigorous schema-on-write rule; every new row must slide neatly into a fixed table blueprint, the same way every carton in your fridge claims a specific shelf.

That structure unlocks the famous ACID guarantees: (a)tomic, (c)onsistent, (i)solated, and (d)urable changes, which is a grand way of promising your money won’t evaporate if the power cuts out mid-transfer. In short, a database keeps the essentials cool, structured, and instantly reachable, because nobody will wait five seconds for a splash of milk.

The pantry: warehouses and the luxury of hindsight

A fridge collapses the moment you stuff six months of groceries inside. A million question: “How did lager sales trend across Australia last year?” — need the breathing room of a data warehouse. Platform such as Snowflake, Redshift, BigQuery, Synapse, each organises historical facts into columnar files and spins up armies of query workers through massively parallel processing. Those soldiers race across millions of rows, returning answers in the time it takes to sip a coffee.

Warehouses prefer star and snowflake schemas: a central fact table surrounded by dimensions for product, customer, date — like pantry bins labelled grains ➜ rice ➜ basmati.

Most teams restock overnight via ETL, though near-real-time streams are creeping in wherever dashboards must be as fresh as cold brew. Either way, the pantry balances cost and order so analysts can slice trends without hogging the fridge every time they run a monthly report.

The walk-in freezer: data lakes and the luxury of raw potential

Past the pantry door lies the data lake, a warehouse-sized freezer where anything remotely promising can be tossed — server logs, sensor feeds, high-res images, unfathomably wide CSVs. Cloud object stores such as S3 or ADLS let you dump everything first and worry about structure later, a philosophy called schema-on-read. This freedom is catnip for machine-learning teams that crave petabytes of untouched signal and for compliance officers who must archive transactions just in case regulators come knocking.

Of course, freedom courts chaos. A lake without metadata quickly degrades into the dreaded data swamp, that murky realm of files named final_FINAL_v2.csv. Unless you tag and catalogue aggressively, you’ll spend Friday night spelunking for a single fragment you swear existed. “Lakehouse” tools—Databricks, Iceberg, Redshift Spectrum—attempt to bolt warehouse-style query engines onto those cheap ice blocks, essentially fitting the freezer with LED shelving so you can both store and cook in the same room. Clever, but still evolving.

Choosing the right shelf for the job

Choosing a storage layer is easier than assembling flat-pack furniture

If an interaction demands instant feedback — authorising a credit-card payment — reach for the fridge.
If leadership wants a Monday-morning KPI deck, load the pantry and let its parallel engines chew through history.
Training a recommendation model on three years of clickstream chaos? Shovel everything into the freezer first, then carve smaller, curated slices back into the pantry when analysis time arrives.

Trying to cram all three workloads into a single layer is like balancing ice-cream tubs on a hot stovetop — messy, stressful, and destined for tears.

The quick reality check

Yes, technically everything could live in the lake! But storing tonight’s butter in a chest freezer three floors down is a lifestyle choice, not a best practice. Each storage layer maximises a different balance of cost, speed, and structure. Use the right shelf for the right need, and both dinner and data arrive on schedule.

One-minute recap

Database ≅ Fridge — ultra-fast, strictly organised, and designed for high-volume transactions.
Warehouse ≅ Pantry — neatly modelled history that turns sprawling data into trend-friendly bites.
Lake ≅ Freezer — dirt-cheap vault for raw information; label it diligently or risk a swamp.

Master these three storage instincts and the rest of data-engineering lingo starts reading like everyday recipes.

Curious about the eternal ETL vs. ELT debate? That deep-dive is coming soon — tap Follow so it pops into your feed the minute it’s ready.

Databases, Warehouses & Lakes: A Kitchen Tour was originally published in Data Engineer Things on Medium, where people are continuing the conversation by highlighting and responding to this story.

dbt Incremental — The Right Way

Leah Nguyen — Fri, 21 Jul 2023 13:37:02 GMT

dbt Incremental — The Right Way

From Full-Load Pain to Incremental Gain (and a Few Mistakes Along the Way)

Photo by Lukas Tennie on Unsplash

When my team at GlamCorner began transitioning from traditional MySQL databases to ELT on a Postgres database with dbt as the transformation and modeling layer, we were overjoyed. We set up dbt projects and profiles, dedicated macros for our models, and built more data marts to serve downstream needs. We thought we were done — I thought we were done until we hit our first bump: model run time. In this article, I explain how I overcame one of the toughest performance challenges at the time by adopting dbt incremental, making mistakes (like who doesn’t?), and learning valuable lessons along the way.

The Evolving Monster

At GlamCorner, we’re in the circular fashion game. Our “back-end” team plays with RFID scanners in the warehouse, scanning items in and out like pros. We also use fancy platforms like Zendesk and Google Analytics to make our customers feel extra special. And to top it off, we’ve got our own in-house inventory system — thanks to our brilliant software engineers — that links all our front-end and back-end systems together. It’s like a match made in heaven. But as we grow and add more years of operation, our database is getting bigger and bigger. And let’s just say the traditional full-table load is starting to feel a bit like a pain in the you-know-what.

The Pain

You either understand the pain of “I want the data to be ready by 9am” or you don’t.

Image by the author

The team’s put the efforts to create a flawless (E)xtract and (L)oad, we gather and toast. Then one day, the (T)ransformation decided like “Nah, that’s not how it works around here” and decided to spin up the total runtime from 10 minutes to 90 minutes. I may exaggerate on the 10 to 90 minutes part because yes, everything has its own reason, but the fear of the business team knocking on your door at 8.55 am in the morning when you haven’t even started your first cup of coffee, just to ask: “Where is the newest data?” is hell of the ride to work every day. This is like dumping all the hard work in the trash and I, myself, cannot accept that reality.

Let’s go back to the thing I said: everything has its own reason, and why the fairytale once was taken 10 minutes of my time now has become a red horn demon of 90 minutes. To illustrate this, let’s take the example of the fct_booking data figure. This table contains all booking information taken from the website each day. Each booking_id represents one order that was booked on the website.

Image by the author

Every day, around 4 orders are added to the booking table, which already contains 80 orders. When this model is run using dbt, it delete the entire table from the last day, replaces all of them with 84 records, including the old and new orders (80 orders from historical cumulative data + 4 new orders added for the latest day). To add to the list, for every new 4 records added, the query time increases by around 0.5 seconds.

Image by the author

Now, imagine that 4 orders are equivalent to 4000 per day and 80 orders actually represent 800,000 records. Can you guess how much time it will take to transform the fct_bookings table, and where we will be in, for example, 3 months?

Well, I’ll leave the math for you.

The Golden Egg

So, after aimlessly wandering through dbt Community threads and halfheartedly skimming through dbt documentation (I mean, who hasn’t done that?), I stumbled upon the holy grail of dbt Incremental. It’s like finding a needle in a haystack, except the needle is golden and the haystack is made of code.

In layman’s terms, dbt Incremental means that you don’t have to go through the hassle of processing all data from the beginning. You just process the new and modified data, saving you time and resources. It’s like a shortcut that actually works and won’t get you in trouble with your boss.

Image by the author

If you want to know more about the nitty-gritty details of dbt Incremental, then check out this blog and document:

To set up this model in your dbt model, you need to add a config block at the beginning of your model script, keeping these two components in mind:

Materialized: By default, a dbt model’s materialized view is equal to ‘table’ when there is no configuration. To set the incremental mode, set the materialized view to ‘incremental’. For more information on other dbt materializations, please visit:

Materializations | dbt Developer Hub

Unique_key: Although setting up a unique key is optional according to the dbt documentation, it is extremely important to rationally consider how you want to set this up. Essentially, the unique key will be the main driver that lets dbt know if the record should be added or changed. Some questions to keep in mind are:
Is the unique key really unique?
Is it a combination of two or more columns?

Failing to set up a unique key can lead to missing data and ambiguous values, so be careful!

Here is an example of how the config block is set up for a single unique key:

https://medium.com/media/bddf772657924ee8f67f2255aa5808ba/href

In the case the unique key is the combination of several columns, you can tweak the config to be:

https://medium.com/media/78e1233db19f1d1592e0598bfed3b744/href

Note: if you’re using BigQuery or Snowflake to store your data, you might have the option of tuning more extra configs like setting up sync_mode. But since my company’s database is built on Redshift, specifically Postgres, we don’t have those fancy gears.

Once that’s taken care of, there’s just one more important step we need to add to our dbt incremental models' script: a conditional block for the is_incremental() macro.

https://medium.com/media/fa8f148149ae8512dc991a81f142e6f0/href

The is_incremental() macro returns True if the following conditions are met:

The destination table already exists in the database.
dbt is not running in full-refresh mode.
The running model is configured with materialized=’incremental’

Note that the SQL in your model needs to be valid regardless of whether is_incremental() evaluates to True or False.

Returning to the example of fct_booking, here is the original query:

https://medium.com/media/43b4f8ecb2cfb85c0839c192b8962401/href

After applying the incremental setup described above, we have a model that includes the unique key, a tag for the model, and a conditional block for the is_incremental() macro as follows:

https://medium.com/media/6d3ab9640caff848b6c9820a8cae8366/href

As seen in the code, the unique_key has been set to the booking_id, as one booking_id corresponds to one order.

To make it fancier, I have also added a model tag as incremental_model for any other model that I integrate with an incremental materialized. The main driver is that, often when things go wrong with dbt model incremental, they often go wrong ‘in bulk’. Thus, to refresh them without affecting other models and don’t have to remember every single model with incremental mode enabled, I can run the above code instead of having to specify each model name with incremental mode separately.

dbt run — select tag:incremental_model --full-fresh

Also note that if the incremental model is set up incorrectly and updates incorrect data in the production table, I would need to run the model again using the --full-refresh command. However, you should keep in mind that running it in full load refresh instead of incremental mode will be slower, so remember to pick the right time to do it (tips: don’t do it at 9 am in the morning).

The Slap Back

Up until this point, life was good again! I set up the table flawlessly, and the performance query significantly improved. Finally, I can sleep at night. My hand can touch the grass, and dbt incremental grant miss little Leah — a dream come true. However, not long after, a guy from the Finance team rushed to my desk with a report in his hand and aggressively claimed, “You gave me the wrong data!”

It turned out that the incremental models accidentally skipped many orders in a day and then went to the next day. “How on earth could this happen? I followed the expert tutorial — this can’t be wrong!” I whispered in my head. Except there is something going on upstream that I might have missed. After some digging, the issue came to light.

Every day, a data extraction and load process takes place at midnight to synchronize all the data up until that moment. This synchronization typically occurs at midnight, but its timing can be influenced by factors such as start spinup time and package cache. It’s important to note that the Extract part of the process may begin slightly after midnight.

Consider a scenario where the extract starts at 12:02 am and someone decides to make a booking around 12:01 am. In this situation, the data will also include a small portion of the orders from that day, which is referred to as “late arrival data” in more technical terms.

However, there’s a drawback with the current logic of the WHERE filter. The filter’s efficiency is compromised because it only appends new records from the latest date value of created_at. This means that it won’t capture all the data for the entire day.

In order to fix this, we will twist this logic a little bit:

https://medium.com/media/604c06ed0ac9146dcf198be2ba338139/href

The new filter involves syncing all data from the past 7 days. Any new data will be added to the existing dataset, while any old data with updated field values will be replaced.

The Tradeoff

As you’ve been following along, you might be wondering, “How many days should I go back using the is_incremental filter? And why did I choose 7 days for my case? What if I need data for the last 30 days?” Well, the answer is not straightforward — it depends on your specific scenario.

In my situation, I ensure that each day should have at least one order. Since there could be internal changes in the data during the last 7 days, I set my filter to append new and update existing data within that timeframe. However, if you feel confident about your query performance and want to go back further, say the last 365 days, you are free to do so! Just be mindful that there are tradeoffs to consider.

The primary reason for using an incremental model is to reduce costs in terms of model run performance. However, scanning through a larger dataset for the last 7 days could slow down performance, depending on the size of your data and your company’s specific use case. It’s essential to strike the right balance based on your needs.

For a more general approach, I recommend using 7 days as a standard rule. You can set up data update schedules on a weekly or annual basis for full-refreshes of the dbt incremental models. This approach allows you to account for unexpected issues, as no matter how well your setup is, there may still be occasional downtimes.

In my use case, I typically schedule the incremental run on a full-refresh during the weekend when there are fewer operational tasks. However, this schedule can be customized according to your team’s requirements.

Remember, the key is to find the right tradeoff between data freshness and query performance, ensuring that your data remains accurate and up-to-date while optimizing your model’s efficiency.

dbt Incremental — The Right Way was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Yet Another Article about Star and Snowflake Schema

Leah Nguyen — Wed, 19 Jul 2023 14:21:31 GMT

Unraveling the Mysteries of Star Schema and Snowflake Schema

Introduction

In the ever-evolving landscape of data management, two prominent database modeling techniques, Star Schema and Snowflake Schema, have captured the attention of data enthusiasts and organizations alike. Despite being well-established concepts, they continue to bewilder and confuse many individuals due to their nuanced differences. In this essay, we will explore the essence of Star and Snowflake Schema, delve into their practical applications through a fictional company called “DataCo,” and shed light on why some organizations choose Star Schema while others opt for Snowflake Schema.

Defining Star Schema and Snowflake Schema

Star Schema and Snowflake Schema are both data warehousing designs aimed at organizing data efficiently.

In a Star Schema, data is structured around a central fact table, while multiple dimension tables surround it, forming a star-like shape. Each dimension table represents a specific attribute or category related to the central fact table, facilitating easier query performance and analysis.

Star Schema Design

On the other hand, Snowflake Schema is an extension of the Star Schema, characterized by breaking down dimension tables into more normalized sub-dimensions. This normalization reduces data redundancy and optimizes storage.

For further read about normalization and denormalization, you can visit my previous blog on this topic here

“Data Modeling: Love, Breakups, and Complicated Relationships — A Hilarious Guide!”

As a result, the Snowflake Schema looks like a snowflake, with the fact table in the center and the normalized dimension tables branching out.

Snowflake Schema Design

Use Case Scenario: DataCo’s Sales Analytics

Imagine “DataCo,” a fictional online retailer, eager to understand its sales performance for strategic decision-making. To analyze sales data efficiently, they must design a suitable database schema for their data warehousing.

⭐ Application of Star Schema at DataCo

DataCo chooses Star Schema for their sales analytics. They create a fact table that stores crucial sales data like order numbers, product IDs, and transaction dates. Surrounding this fact table, they establish dimension tables such as “Product,” “Customer,” and “Time” which contain detailed information about products, customers, and time periods respectively.

DataCo — Star Schema

When DataCo needs to analyze sales based on specific products or customer segments, the Star Schema allows for swift queries by directly connecting the required dimension tables to the fact table. This design streamlines data retrieval, making it ideal for real-time reporting and business intelligence purposes.

❄️ Application of Snowflake Schema at DataCo

Now, DataCo aims to expand its sales analytics to encompass a deeper level of data granularity. They choose Snowflake Schema for its ability to normalize data and reduce redundancy.

With Snowflake Schema, DataCo further breaks down dimension tables, like Product and Customer, into sub-dimensions. For example, the Product dimension can be divided into Product Category and Product Subcategory tables. Similarly, the Customer dimension can be split into Customer Details and Customer Location tables.

DataCo — Snowflake Schema

While the Snowflake Schema demands more complex joins between tables compared to the Star Schema, it results in significant storage savings and facilitates easier maintenance. This schema is particularly useful when DataCo wants to scale its database for massive amounts of data without compromising performance.

Main Differences and Reasons for Choosing Star Schema and Snowflake Schema

Complexity — Star Schema is simpler to implement and query due to its denormalized structure, making it more suitable for smaller data volumes and real-time reporting. On the other hand, Snowflake Schema’s normalization increases complexity but offers scalability advantages, making it suitable for handling vast data and long-term data management.
Query Performance — Star Schema often outperforms Snowflake Schema in query response time because of its reduced joins between tables. However, Snowflake Schema can still achieve excellent performance with proper indexing and optimization.
Storage Efficiency — Snowflake Schema excels in storage efficiency due to normalization, eliminating redundant data. This makes it an excellent choice for large enterprises dealing with vast amounts of data.
Maintainability — Star Schema is easier to maintain due to its straightforward design, but it may suffer from data duplication. Snowflake Schema, despite being more complex, offers better maintainability by reducing redundancy and adhering to database normalization principles.

In conclusion, Star Schema and Snowflake Schema are two powerful database modeling techniques that can significantly impact an organization’s data analytics capabilities. Star Schema’s simplicity and real-time reporting advantages make it a compelling choice for smaller businesses and quick data insights. On the other hand, Snowflake Schema’s advanced normalization and scalability make it an attractive option for larger enterprises dealing with extensive data volumes and long-term data storage needs.

Understanding the nuances of these schemas can help organizations design robust and efficient data warehousing solutions tailored to their unique needs. By leveraging the strengths of both Star and Snowflake Schemas, companies like DataCo can extract valuable insights from their data, empowering them to make informed decisions and gain a competitive edge in their industry.

A Brief Guide to Database Normalization

Leah Nguyen — Tue, 18 Jul 2023 15:08:19 GMT

Understanding the Basics and Advanced Levels of Database Normalization

If you’re new to database design, you’ve probably heard about database normalization. This is the process of organizing data in a database so that it is consistent, efficient, and easy to manage. There are several levels of normalization, each with their own benefits and use cases. In this article, we’ll cover the basics of normalization, including first normal form (1NF), second normal form (2NF), third normal form (3NF), and other advanced normal forms.

Normal Form

Concept of normalization and normal forms were introduced, after the invention of the relational model. Database normalization is an essential procedure to avoid inconsistency in a relational database management system. It should be performed in the design phase. To achieve this, redundant fields should be refactored into smaller pieces.

Normals forms are defined structures for relations with set of constraints that relations must satisfy in order to detect data redundancy and correct anomalies. There can be following anomalies while performing a database operation:

insert: data is known but can not be inserted
update: updating data requires modifications in multiple tuples (rows)
delete: deleting some data causes some other data to be lost

First Normal Form has initial constraints, further normal forms like 2NF, 3NF, BCNF, 4NF, 5NF would add new constraints cumulatively. In other words, every 2NF is also in 1NF; every relation in 3NF is also in 2NF. If all group of relations are represented as sets, following figure can be drawn:

First Normal Form (1NF)

First normal form (1NF) is the simplest level of normalization. It involves ensuring that each table in the database has a primary key and that each column in the table contains atomic values. In other words, each row in the table should have a unique identifier, and each value in the table should be indivisible.

Let’s take an example to understand this better. Consider a table that stores information about employees. The table might have columns like employee_id, name, address, and phone_number. However, the address column could contain multiple values, like street name, city, state, and zip code.

Example Table

To bring this table to 1NF, we need to split the address column into separate columns, each containing a single value.

1NF Output

Second Normal Form (2NF)

Second normal form (2NF) builds on the foundation of 1NF and involves ensuring that each non-key column in a table is dependent on the primary key. In other words, there should be no partial dependencies in the table.

Let’s continue with our employee table example. Suppose we add a column for department to the table. If we find that the value in the department column is dependent on the employee_id and name columns, but not on the phone_number column, we need to split the table into two tables, one for employee information and one for department information.

2NF Output

Third Normal Form (3NF)

Third normal form (3NF) builds on the foundation of 2NF and involves ensuring that each non-key column in a table is not transitively dependent on the primary key. In other words, there should be no transitive dependencies in the table.

Let’s take another example. Consider a table that stores information about books. The table might have columns like book_id, title, author, and publisher.

However, the publisher column could be dependent on the author column, rather than on the book_id column. To bring this table to 3NF, we need to split it into two tables, one for book information and one for author information.

3NF Output

BCNF — Boyce-Codd Normal Form

Boyce-Codd Normal Form (BCNF) is a higher level of normalization than 3NF. It is used to eliminate the possibility of functional dependencies between non-key attributes. A table is in BCNF if and only if every determinant in the table is a candidate key.

To understand BCNF better, consider a table that stores information about students and their courses. The table might have columns like student_id, course_id, instructor, and instructor_office. In this table, the determinant is course_id, and the non-key attribute is instructor. However, a course can have multiple instructors, so there is a possibility of functional dependencies between non-key attributes. To bring this table to BCNF, we need to split it into two tables, one for course information and one for instructor information.

BCNF Output

Fourth Normal Form (4NF)

Fourth Normal Form (4NF) is used to eliminate the possibility of multi-valued dependencies in a table. A multi-valued dependency occurs when one or more attributes are dependent on a part of the primary key, but not on the entire primary key.

To understand 4NF better, consider a table that stores information about employees and their skills. The table might have columns like employee_id, skill, and proficiency_level. In this table, the primary key is a combination of employee_id and skill. However, the proficiency level is dependent on the skill, but not on the entire primary key. To bring this table to 4NF, we need to split it into two tables, one for employee information and one for skill information.

4NF Output

Fifth Normal Form (5NF)

Fifth normal form (5NF) is the highest level of normalization and is also known as Project-Join Normal Form (PJNF). It is used to handle complex many-to-many relationships in a database.

In a many-to-many relationship, where each table has a composite primary key, it is possible for a non-trivial functional dependency to exist between the primary key and a non-key attribute. 5NF deals with these situations by decomposing the tables into smaller tables that preserve the relationships between the attributes.

To understand this better, consider a database that stores information about movies and their actors. The tables might have columns like movie_id, actor_id, character_name, and salary. In this database, it is possible for a non-trivial functional dependency to exist between the primary key (movie_id, actor_id) and the salary attribute.

To bring this database to 5NF, we need to decompose the tables into smaller tables. For example, we might create tables for movies, actors, and characters, and then use a join table to connect them. Each table would have a single primary key, and the join table would include foreign keys to the other tables.

5NF Output

Reflection

Today, many organizations rely on databases to store, manage, and retrieve their data. In order to ensure that the data is organized in a way that is both efficient and consistent, normalization is often used. There are several levels of normalization that can be applied, with 1NF, 2NF, and 3NF being the most commonly used.

In addition to 1NF, 2NF, and 3NF, there are also advanced normalization techniques such as Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF). BCNF is used to eliminate the possibility of functional dependencies between non-key attributes. 4NF is used to eliminate the possibility of multi-valued dependencies in a table. 5NF, also known as Project-Join Normal Form (PJNF), is used to handle complex many-to-many relationships in a database.

While these levels of normalization can provide further data consistency and management benefits, they can also result in more complex table relationships, slower queries, and larger numbers of tables. Therefore, it’s important to carefully consider the use cases and benefits of each technique before applying them in database design.

Normal Forms Comparison Table

Data Modeling: Love, Breakups, and Complicated Relationships — A Hilarious Guide!

Leah Nguyen — Sat, 15 Jul 2023 14:02:01 GMT

Data Modeling: Love, Breakups, and Complicated Relationships — A Hilarious Guide!

How to Relationally Date Your Data

Image by the author

Welcome to the wild and wacky world of data modeling! When I try to picture it in a “non-technical” way, it’s like a whimsical adventure where we play matchmaker for our data, creating connections and relationships that would make Cupid proud.

Imagine a dating app, but instead of finding your soulmate, you’re finding the perfect match for your data. Buckle up and get ready to swipe right on some hilarious concepts of relational data modeling.

Normalization: The Art of Breaking Up

Ah, the early stages of dating. You don’t want to rush into a committed relationship and risk having your heart (or in this case, your data) broken. That’s where normalization comes in! It’s like a breakup, but a friendly one. We break up our data into smaller, more manageable tables, so they can keep their options open. No one wants to be stuck in a suffocating relationship with one massive table, right? It’s all about flexibility and giving your data the freedom to explore.

Image by the author

For instance, picture a table of customers. Each customer has a phone number, email address, and mailing address. Instead of cramming all that info into one overwhelming table, we break it down into smaller, more digestible tables. We create a table for phone numbers, another for email addresses, and one more for mailing addresses. Now, if a customer changes their phone number, you only need to update one table instead of the whole enchilada. Talk about efficient dating!

Denormalization: Getting Back Together

Sometimes, though, breaking up is harder than we thought. You might start missing the good old days when everything was in one place. That’s when denormalization swoops in like a knight in shining armor. It’s all about reconciliation, my friend! We merge those smaller tables back together, like rekindling an old flame. Suddenly, your data is reunited, stronger than ever.

Image by the author

Imagine you have a table of orders, each with a customer ID, a product ID, and a quantity. On the other hand, you have separate tables for customers and products. Rather than playing the “join the tables” game every time you want to know what a customer ordered, you can denormalize like a boss. Add the customer’s name and the product’s name to the order table. Boom! Now you only need to consult one table to get all the juicy details. It’s like having a cozy reunion with all your data snuggled up in one place.

Fact/Dimension Tables: It’s Complicated

Life isn’t always about dating just one person. Sometimes, you’re in a full-on relationship with a whole group of people. That’s when things get a bit complicated, my friend. Enter the fact/dimension tables! It’s like managing a bustling love triangle, but with data. Intriguing, right?

Image by the author

The fact tables hold the raw data, like sales or inventory, while the dimension tables provide the juicy metadata. Think dates, locations, or product information. It’s like balancing multiple partners without the drama. Smooth, huh?

Imagine you have a fact table of sales. Each sale has a date, a product ID, a store ID, and a quantity. You also have dimension tables for dates, products, and stores. Instead of throwing everything into one chaotic table, you use the fact table as the link to the dimension tables. It’s like playing matchmaker for your data. They all get along, and you get the information you need without the headache. Complicated relationships can work, my friend, especially in the world of data modeling!

Different Schema Models: It’s Not You, It’s Me

Just like in dating, we all have different preferences when it comes to organizing our data. It’s not about finding “the one” schema model, but rather finding what works best for you. Let’s explore some of the options together, shall we?

First up, we have the star schema. It’s like a simple, straightforward date night. Perfect for easy queries and reporting. Everything revolves around one central table, like sales, and branches out to other related tables, such as customers, products, and dates. It’s like a star-studded romantic comedy where everyone has their role.

Image by the author

Then we have the snowflake schema. It’s for those who love a bit of complexity and enjoy diving deep into their data. You take the star schema and add more intricate details, like regions and stores. It’s like a snowflake falling from the sky, each branch leading to more data. It may be chilly, but it’s worth it!

Image by the author

Lastly, we have the fact constellation schema. It’s like the grand finale, designed for huge data sets and multiple fact tables. You want to gather all the information you can get, right? Picture a constellation in the sky, connecting different points of data in a mesmerizing dance.

So there you have it, the wonderful world of relational data modeling with a humorous twist! With these concepts in your arsenal, you’ll be able to create loving relationships between your data and make them giggle with joy for years to come. Happy data matchmaking!

Data Modeling: Love, Breakups, and Complicated Relationships — A Hilarious Guide! was originally published in Data Engineer Things on Medium, where people are continuing the conversation by highlighting and responding to this story.

How I Went from Clueless to Confused in the Ever-Changing World of Data Engineering

Leah Nguyen — Wed, 12 Jul 2023 07:32:17 GMT

From Noob to Hero, and Back Again

Picture this: It’s a sunny day in the world of data professionals. Birds are chirping, and everyone is sipping their morning coffee while trying to catch up on the latest data engineering news. But wait, what’s that? Oh, it’s just another article proclaiming the arrival of a groundbreaking data tool that will revolutionize the industry. Sigh… Here we go again.

The Trend’s whack-a-mole

2023 MAD (Machine Learning, Artificial Intelligence & Data)

Tell me, looking at the picture above, what do you see? Ah, behold the magnificent picture that represents the ever-so- “MAD” landscape of Machine Learning, Artificial Intelligence, and Data! It has managed to capture the hearts of two distinct types of individuals, each with their unique reactions. On one side, we have the enthusiasts, brimming with excitement, their minds racing with the countless tools they can devour and master. On the other side, we have the unfortunate souls who gaze upon this visual chaos, clutching a bucket in hand, as if ready to unleash the contents of their stomachs. I must confess, I’ve personally experienced both of these extreme states.

Initially, this bizarre realm appeared as a fascinating wonderland, where I, a mere nut crack in the vast expanse of data, could find my place. But as time went on, and the updates came in at an ever-increasing pace, I found myself growing more and more disenchanted with this so-called wonderland. Hadoop, Kafka, Docker, Kubernetes — the list goes on and on. And just when you think you’ve caught up with the latest tool, it slips away and another one comes crashing down. It’s like trying to catch a wave in the ocean. It’s a never-ending cycle of hype, disappointment, and confusion. But hey, at least I get to add more buzzwords to my LinkedIn profile, right?

Remember when Hadoop was hailed as the savior of big data? It was supposed to be the Twitter killer of the data world. But then came Spark, swooping in like Threads, stealing the spotlight and leaving Hadoop in the shadows. And let’s not forget about poor Flink, who was briefly hailed as the next big thing until something shinier came along.

In the midst of this chaos, I’ve come to realize that chasing every new tool is a fool’s errand.

Here’s How to Win

1. Prioritize the purpose behind the tools

Instead of chasing every new tool, ask yourself why you need it. Is it to cut down on costs? Improve efficiency? Or just to impress your boss at the next meeting? By understanding the purpose behind the tools, you can better evaluate which ones are truly necessary and which ones are just adding to the noise.

Example: f you’re working on a project that requires analyzing large amounts of data, you may want to invest in a tool like Apache Spark that can handle big data processing. On the other hand, if you’re working on a small project with a limited budget, a simpler tool like Microsoft Excel may be sufficient for your needs. By prioritizing the purpose behind the tools, you can make better decisions about which tools to invest your time and resources in.

2. Master the fundamentals

No matter how many new tools emerge, the most foundational ones are often the most important. Focus on mastering the basics like SQL, Python, and Excel. This will give you a strong foundation of knowledge and understanding that will help you navigate the ever-changing landscape of data engineering.

3. Take a break from chasing the next big thing

Oh, how we love being told we need more tools. Because obviously, we don’t already have enough on our plates. The media insists on bombarding us with countless reasons why we need certain tools and how they can enhance our data game. Yawn. It’s like we hear the same phrase over and over again. And yet, we blindly follow with the thought, “Yeah, we might need that!” Been there, done that, and it always ends up being a waste of time.

Here’s the catch: if you don’t feel like you need it, you don’t need it. I know, I know, it sounds too simple. But trust me, it works. Instead of mindlessly following the latest trends, ask yourself why you shouldn’t. It’s like a breath of fresh air in the midst of all the hype. So go ahead, take a deep breath, and embrace the power of saying no to unnecessary tools.

5. Embrace a sense of humor

When you feel like you’re drowning in a sea of buzzwords and acronyms, it’s easy to get frustrated and overwhelmed. A little humor goes a long way in keeping you sane. So, next time someone tries to one-up you with their knowledge of the latest tool, just casually drop a phrase like

“Oh, yeah, I remember using that back when it was still in beta.”

Predict Next Month Transaction with Linear Regression (Final)

Leah Nguyen — Fri, 28 Oct 2022 22:31:22 GMT

Feature Selection and Data Modelling

Photo by Markus Winkler on Unsplash

A Rewind

In this project, we have followed the CRISP-DM management approach to construct a comprehensive ML project framework. Some of the covered topics from the previous parts are illustrated as follows:

✅ Business understanding

✅ Data understanding

✅ Data preparation

❌ Modelling

❌ Evaluation

❌ Deployment

In this article, I will cover the rest of the approach, including — Modelling, Evaluation and Deployment

Review previous parts —

GitHub code repository —

GitHub - ndleah/transactions: 🪙 Linear regression model, predict monthly transaction amount

Feature Selection Technique (FST)

In machine learning, it is essential to provide a pre-processed and high-quality input dataset in order to achieve better results. Typically, the dataset consists of half noisy, irrelevant data, and half useful data.

The massive amount of data slows down the training process of the model, and if there is noise or irrelevant data, the model may not accurately predict and perform. In order to eliminate these noises and unimportant data features from the dataset, Feature selection techniques need to be adopted and used wisely in order to retain only the best feature for the Machine Learning model.

Feature Selection Techniques

Some benefits of using feature selection in machine learning:

It helps in avoiding the curse of dimensionality.
It helps in the simplification of the model so that it can be easily interpreted by the researchers.
It reduces the training time.
It reduces overfitting hence enhances the generalization.

From Part 2, we have discovered that the problem for this project is a Supervised Regression problem. This can be overcome by deploying a Linear Regression model. Therefore, this section will focus on explaining different terms and techniques used for Feature Selection for Supervised models with Wrapper Method.

A closer look at FST — Wrapper Method

Wrapper methodology approaches feature selection as a search problem, in which different combinations are created, evaluated, and compared to other combinations. It iteratively trains the algorithm using the subset of features.

Features are added or subtracted based on the model’s output, and the model has trained again with this feature set.

Some techniques of wrapper methods are:

Forward selection — Forward selection is an iterative process, which begins with an empty set of features. After each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or not. The process continues until the addition of a new variable/feature does not improve the performance of the model.
Backward elimination — Backward elimination is also an iterative approach, but it is the opposite of forward selection. This technique begins the process by considering all the features and removing the least significant feature. This elimination process continues until removing the features does not improve the performance of the model.
Exhaustive Feature Selection — Exhaustive feature selection is one of the best feature selection methods, which evaluates each feature set as brute force. It means this method tries & to make each possible combination of features and return the best-performing feature set.
Recursive Feature Elimination — Recursive feature elimination is a recursive greedy optimization approach, where features are selected by recursively taking a smaller and smaller subset of features. Now, an estimator is trained with each set of features, and the importance of each feature is determined using coef_attribute or through a feature_importances_attribute.

Modelling

The scope of this project consists of 2 areas of ML modelling:

Basic Model Fitting — Developing a linear regression model with monthly_amount as the target for industry = 1 and location = 1.
Avanced Model Fitting — Developing a linear regression model with monthly_amount as the target for all industries and locations.

Basic Model Fitting

In this section, numerous Multiple Linear Regression (MLR) models will be developed and assessed with various combinations of predictor variables, which are filtered by Location 1 & Industry 1. As for the approach, I will adopt the stepwise model selection method as backward elimination.

Model 1 — Full Model

Firstly, I start with a full model which is a model with all possible co-variants or predictors included, and then I will drop variables one at a time until a parsimonious model is reached. Noted that even though we start the model with all variables, I will exclude the location, industry and year as we only filter by Location 1, Industry 1 and the year of 2013-2015, which can overfit our MLR model.

Basic Model Fitting: Model 1 — Output

The month number variable is introduced to accommodate for the seasonality of the sales amount. As summarized in the linear model with the formula formula = monthly_amount ~ date + month_number, this model performs quite impressively with the Adjusted R-Square equivalent to 0.7457. In other words, this indicates that approximately 74,57% observations in the training set are explained by the model.

Model 2 — Fit the model with month_number variable

Basic Model Fitting: Model 2— Output

Based on the Multiple R-squared value, our Model 2 can only account for approximately 54% of the variance. This indicates that fitting the month_number alone provide a moderate predictor of monthly_amount which specifically perform worse than the first model. We can also get confirmation by looking at the p-value of 0.02583 which tells us that the month predictors are unlikely to be a good fit to the data.

Model 3: Fit the model with date variable

Basic Model Fitting: Model 3 — Output

With the third one where we fit only the date variable to the model, it even get a worse performance with only 36% of the variability in the average monthly sales amount is explained by it, leaving a whopping of unexplained 64% variance.

In conclusion, Model 1 provide the best fit so far compared to the other 2 combinations. Thus, I will use this model for making a prediction for monthly_amount in December 2016.

After having chosen Model 1 as the final model for the Basic Model Fitting, my next step is to create a new data frame specifying only 2016 records. I then made the prediction for the transaction amount in December 2016.

I will examine whether our December 2016 forecast reasonable by plotting a line plot with the predicted data.

We thencan quantify the residuals by calculating a number of commonly used evaluation metrics. I’ll focus on the following three:

Mean Square Error (MSE): The mean of the squared differences between predicted and actual values. This yields a relative metric in which the smaller the value, the better the fit of the model.
Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in the same unit as the label. The smaller the value, the better the model.
Coefficient of Determination (usually known as R-squared or R2): A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.

The performance of prediction is significantly lower than the model’s performance on train data. The R2 is 0.55 lower than 0.83, which presents its low fit to the actual data. The prediction error RMSE is 11293, representing an error rate of ~ 6%, which is still good.

Advanced Model Fitting

We want to apply our model (mean amount + date + month number) across all industries and geographical locations. To do this, I will construct a loop function as calculate_predictions to run everything through.

To be more specific, the loop function will do the following tasks:

Train the model for each industry and location.
Include a column for December 2016 in the table.
Calculate the mean square error (MSE) and root mean square error (RMSE).
Make a December 2016 prediction.
Consolidate all data into a dataframe.

We were running all locations and industries through the below model:

mean_monthly ~ time_number

In that, time_number represent the date order.

Worst industries and locations — assessed by RMSE score

Among data sets, we picked two worst performed industries and locations by its highest RMSE:

Industry 6 & Location 1
Industry 10 & Location 8

In order to find out potential reasons that lead to poor performance of these locations, I will retrain the model based on these 2 industries and locations and then plot the model in order to see how they are performing.

Let’s take a look at the diagnosis plots for these 2 components:

The diagnostic plot — Indsutry 6 & Location 1

The diagnostic plot — Industry 10 & Location 8

We can see that, both models have outliers records. For industry 6 and location 1, there are 1 and 2 points which are far, above and below, from the model, respectively. For industry 10 and location 8, there 3 outliers exist above the plotted line.

To confirm our theory, I plot another linear model plot for industry 10 & location 8 and industry 6 & location 1.

In this plot, we can have a clearer view of outstanding outliers contained in both models. Additionally, it is also observed from the plot that the fitted line cannot catch the constant up-and-down trend of the monthly mean due to seasonality, which could be a reason for poor performance. By developing more advanced models that could account for those fluctuations and removing these outliers can likely lead to a more accurate and powerful model.

Code of the project and relevant files-

GitHub - ndleah/transactions: 🪙 Linear regression model, predict monthly transaction amount

Predict Next Month Transaction with Linear Regression (Part 2)

Leah Nguyen — Sat, 18 Jun 2022 03:44:53 GMT

Exploratory Data Analysis and Feature Engineering

Photo by Christopher Gower on Unsplash

In Part 1 of this blog, we worked on basic data analysis and understanding the transaction dataset following the CRISP-DM methodology. We found out that our problem is the supervised regression problem from looking at the data type of the target variable — monthly amount.

For the Part 1, please visit-

Predict Next Month Transaction with Linear Regression (Part 1)

In this part, we will continue to work on the Exploratory Data Analysis of the data, which will help to uncover business insights for the later modelling stage as well as perform feature engineering for variables selection.

You can view all my code using for this project on GitHub.

Exploratory Data Analysis (Part 2) — The Business Insights

Transaction amount vs. transaction number trend over time

The number of transactions and the total amount of sales rose sharply throughout the years, from 2013 to 2017. The seasonal trend can be found in the total amount of sales while the up trend for the number of transactions is quite smooth.

We could see that there is a seasonal pattern, although the trend is not clear yet. To investigate more, we would make a yearly polar plot:

Seasonal Trend Over the Years

A closer examination reveals that the total volume of transaction amount increases significantly from January to October and subsequently decreases from November to the end of the year. This pattern can be ascribed to the fact that people trade less during the holidays, particularly during the month surrounding big holidays like Christmas and New Year.

This, however, might be based on a variety of different factors rather than on individual conclusions about each region or industry. As a result, additional information is required to substantiate these hypotheses.

Transaction amount by Location vs. Industry

When looking at the monthly amount by location and industry, it is not surprising that total sales of locations 1 and 2 increased significantly compared to other locations. Meanwhile, in terms of industry, industry 2, 3 and 1 shows rapid growth over the years while others’ progress is quite slow.

Data Preparation

Feature Engineering

When the data has been fully understood, data scientists generally need to go back to the data collection and data cleaning phases of the data science pipeline so as to transform the data set as per the expected business outcomes. To expand the information that is already at hand and better represent the information we have, the best practice is to perform Data Preparation or Feature Engineering, meaning the creation of new features from the ones already existing.

In this case study, the data will need to be modified as we will be applying a linear regression model later on.

# write a reusable function
aggregate_transactions <- function(df) {
  
  # aggregate the data, grouping by date, industry and location, 
  # and calculating the mean monthly_amount
  output = df %>%
    group_by(date, industry, location) %>%
    summarize(monthly_amount = mean(monthly_amount, na.rm = TRUE))
  
  # create a column for the month number and another one for year number
  output = output %>%
    # create new column for month number
    mutate(month_number = format(as.Date(date), "%m")) %>%
    # create new column for month number
    mutate(year_number = format(as.Date(date), "%Y"))
  
  # Make sure the new columns are of the correct type
  output$month_number = as.character(output$month_number)
  output$year_number = as.character(output$year_number)
  
  transform(output, month_number = as.integer(month_number), year_number = as.integer(year_number))
  return(output)
}

# create a new variable that store new df with transformed features
aggregated_transactions <- aggregate_transactions(df)
# A tibble: 3,886 x 6
# Groups:   date, industry [470]
# date       industry location monthly_amount month_number year_number
#                                      
#   1 2013-01-01 1        1               136081. 01           2013       
# 2 2013-01-01 1        10              188735. 01           2013       
# 3 2013-01-01 1        2               177840. 01           2013       
# 4 2013-01-01 1        3               141632. 01           2013       
# 5 2013-01-01 1        4               221058. 01           2013       
# 6 2013-01-01 1        5               178138. 01           2013       
# 7 2013-01-01 1        6               133400. 01           2013       
# 8 2013-01-01 1        7               231599. 01           2013       
# 9 2013-01-01 1        8               143778. 01           2013       
# 10 2013-01-01 1        9               157416. 01           2013       
# ... with 3,876 more rows


# turn the df into a Markdown table format 
rmarkdown::paged_table(aggregated_transactions)

A snapshot of new feature engineering variables

An aggregated data set using the fields date, industry and location, with a mean of the monthly amount is created. There are a total of 3,886 rows with each row presenting a mean of a monthly amount ranging from 2013 to 2016.

Train-Test split

Now that we have a new adjusted data set, I’m going to split the data into train and the test set for the aim of building a prediction model. The train set includes three years of data from 2013 to 2016 while the test set includes one last year of data, 2016.

Additionally, we have 2 requirements for this assignment, which are:

Basic Model Fitting: Developing a linear regression model with monthly_amount as the target for industry = 1 and location = 1.
Advanced Model Fitting: Developing a linear regression model with monthly_amount as the target for all industries and locations.

I will generate an additional data set that filters only Industry 1 and Location 1 records. The train and split test for the Advanced Model Fitting section can be kept the same as there are no further adjustments needed.

As new dataset is created, I will also use it to create a line plot of the variable monthly_amount for industry = 1 and location = 1 with the purpose of gaining more insights from targeted areas.

It is clear from the graph that there is a seasonality trend observed from the mean transaction amount of Industry 1 & Location 1. More specifically, a downtrend at the end of the year followed by an up trend at the beginning of the year is presented with the months of December and January are low months for this industry and location, and the sales bounce back from March to June. This pattern of fluctuation is repeated during the year and in the time span of 3 years from 2013 to 2017. On average, the monthly mean amount of sales is increasing slowly over time.

However, it is worth mentioning that the year-end trend in 2016 was upward, which was the inverse of previous years. As a result, we will need to take a closer look at this occurrence by examining the amount of money moved by month for each year using the graphic below.

As can be seen, the anomalous increase towards the end of 2016 was previously noticed as a result of a lack of transaction data in December 2016. As a result, we discovered another insight based on facts observed from the trend chart above.

Code of the project and relevant files-

GitHub - ndleah/transactions: 🪙 Linear regression model, predict monthly transaction amount

Predict Next Month Transaction with Linear Regression (Part 1)

Leah Nguyen — Sat, 11 Jun 2022 03:39:09 GMT

Basic Exploration of the dataset

Photo by Josh Appel on Unsplash

Introduction

This article aims to analyse and provide insights from the monthly transaction data set to understand the customer transaction patterns better. The article also offers a study on the linear regression model, an essential concept in the field of machine learning and explains how this model can assist in the decision-making process of identifying trends in bank transactions within the years 2013–2016.

To well capture this information, the CRISP-DM management model is adopted to provide a structured planning approach to a data mining project with 6 high-level phases. In particular, these phases assist companies in comprehending the data mining process and serve as a road map for planning and executing a data mining project (Medeiros, 2021).

Cross-Industry Standard Process for Data Mining (CRISP-DM project, 2000)

This study explores each of the six phases and the tasks associated with each in the following orders:

Business understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment

In the scope of this article, I will cover the first 2 points of the CRISP-DM: Business Understanding and Data Understanding (EDA — Part 1).

You can view all my code using for this project on GitHub.

Business Understanding

Business Understanding is the first taken step in the CRISP-DM methodology. In this stage, the main task is to understand the purpose of the analysis and to provide a clear and crisp definition of the problem in respect of understanding the Business objectives and Data mining objectives.

In our case study, the posed question-related Business object paraphrased from the sales manager’s request is:

What is driving the trends and increasing total monthly revenue?

On the other hand, we wish to achieve the data mining object by applying data visualization tools to identify any underlying patterns from the dataset.

Data Understanding

Following that, the Data Understanding phase is where we focus on understanding the data collected to support the Business Understanding and resolve the business challenge (Wijaya, 2021). Data preprocessing and data visualization techniques play an essential role in this. Thus, I’m going to divide the section into 2 main components:

Exploratory Data Analysis (Part 1) — The Dataset, including:

Stage 1: Basic Exploration
Stage 2: Univariate, Bivariate & Multivariate Analysis

2. Exploratory Data Analysis (Part 2) — The Business Insights

The data was imported into the software package R to construct visualizations representing the findings found during the analysis.

Exploratory Data Analysis (Part 1) — The Dataset

Stage 1: Basic Exploration

First, I will run the libraries which will be necessary for reading & manipulating our data and then conducting the graphs.

##----------------------------------------------------------------
##  Load the Libraries                                          --
##----------------------------------------------------------------
library(here)            # assess the file path
library(DataExplorer)    # EDA visualizations
library(tidyverse)       # data wrangling
library(kableExtra)      # write table
library(bannerCommenter) # create comment banner
library(ggplot2)         # data visualization
library(forecast)        # times-series forecasting
library(ggradar)         # plot seasonal trend
library(sqldf)           # using SQL
library(dplyr)           # data processing
library(ggpubr)          # combine plots into single page
theme_set(theme_pubr())
library(reshape2)        # transpose table
library(fmsb)            # create radar chart
library(modelr)          # computing regression model performance metrics
library(caret)           # streamline the model training process
library(xts)             # convert df to ts object

Once libraries are loaded, we explore the data with the goal of understanding its dimensions, data types, and distribution of values. In this assignment, a time series data set of financial transactions was used as the major source of data. The attributes information is specifically presented as follows:

Data Description

After having a good idea of the data description, I want to have an understanding of what the data look like in general. TheDataExplorer package can help to retrieve that piece of information within a few lines of code:

Data preview

As apparent from the table, the data records 470,000+ observations across 5 columns, which are equivalent to 94,000+ bank transactions. The 5 features contained in this data set including date, customer_id, industry, location, monthly_amount, clearly indicate the total transaction amounts for customers each month spanning a 3-year period over a range of industries and locations. Therefore, no further justification needs to be made on column names.

Data columns inspection

It is also worthwhile to note that features are made up in multiple formats that include both numerical and time-series data. However, the output shows that the date column has the wrong data type which will need to be converted to date format later.

Additionally, I investigate further by looking at the response field. Recall from the business question, we would expect to use themonthly_amount column as the target field since our goal is to get the predicted value of the monthly transaction value next month. Since the observation in this column is continuous, thus, I can conclude that our problem is defined as the supervised regression problem. Having known this information is extremely essential to selecting the right Machine Learning model in the later stage of this report.

Plot missing values

From the plot, it shows that there are no missing values on any fields of data. Nevertheless, some data sets define missing observations in categorical/character columns as a new category such as "NA", "NULL", etc. so there are chances that we possibly miss these observations, which can lay a tremendous negative impact on the real data distribution. Consequently, a further address on the missing values of our categorical columns need to be made in order to confirm this observation.

The code output below interprets that there is no new missing value category exists in categorical columns. Thus, we can confirm our hypothesis that there is no missing values from both numerical and categorical columns in this data set. Furthermore, it also indicates that there are 1 row that contain odd value in monthly_amount column that will need to be resolved.

Stage 2: Univariate, Bivariate & Multivariate Analysis

To evaluate the impact of each feature in the phenomenon, a univariate, bivariate, and multivariate analysis is performed with all features.

Univariate: Check the distribution of each field

The univariate analysis is the study of the data distribution. In research from Sharma (2020), the distributions of the independent variable and the target variable are assumed to be crucial components in building linear models. Therefore, understanding the skewness of data helps us in creating better models.

Firstly, I will plot a histogram to check which group of industry and location statistically contribute the most to the significant difference.

Distribution histogram

As can be seen from the plot, the location 1 and 2 made the top contributions for the industry column while the industry 2 and industry 1 occupied for the highest frequency distribution for the location. These results imply that the model can perform better at predicting the total transaction amount for next month with location 1, 2 and/or industry 1, 2.

Boxplot to check for outliers when plotting Monthly Amount against Location & Industry

Next, the boxplot of sale transactions by the industry and location presents their high variance with a considerable amount of outliers. The median amount of spending per customer for industry 6 and 9 are highest, over 500,000 while the lowest ones belong to industry 1 and 10, at less than 200,000. In terms of locations, most of the locations had a median amount of spending of less than 500,000.

Bivariate Analysis: Relationship between each column and target field & Collinearity

After having known the distribution of our transaction dataset, it is essential to also check for correlation and collinearity assumptions between fields in the Bivariate Analysis. Some basic transformations regarding data types are performed beforehand for the sake of plotting visualizations.

Correlation plot

Having known this information is essentially important to gain a better understanding of the transaction data set and provide great insights for transforming data in the later stage.

For the Part 2, please visit-

Predict Next Month Transaction with Linear Regression (Part 2)

References

Medeiros, L. (2021, December 19). The CRISP-DM methodology — Lucas Medeiros. Medium. https://medium.com/@lucas.medeiross/the-crisp-dm-methodology-d1b1fc2dc653
Sharma, A. (2020, December 23). What is Skewness in Statistics? | Statistics for Data Science. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/Wijaya, C. Y. (2021, December 19).
CRISP-DM Methodology For Your First Data Science Project. Medium. https://towardsdatascience.com/crisp-dm-methodology-for-your-first-data-science-project-769f35e0346c