Stories by John Elisa on Medium

Azure & Power BI Project: Build a Retail Data Warehouse from Scratch

John Elisa — Wed, 01 Oct 2025 04:21:43 GMT

Before starting the tutorial, download these .csv files https://drive.google.com/drive/folders/10BKlhBGrjFuPNvgBb5czxNiil5R8xXUy?usp=sharing

These csv files are short and simple, but nonetheless will be sufficient for the purposes of the tutorial.

1. The Story: Why We Are Building This

Bobby is an Operations Manager for a growing retail chain called “BobbyMart”. As the business grows, he wants to start making data-driven decisions.

1.1 How Retail Data is Commonly Handled

A common case in the business world, especially in sectors like retail, is that it’s rare for a single system to handle everything. Data is almost always separated into at least two categories: Transactional Data and Master Data.

Transactional Data is the high-volume data generated by daily operations. For a retailer, this comes directly from the Point-of-Sale (POS) system.

Legacy/On-Premise POS Systems: Commonly, at the end of the business day, the system will automatically generate a file — most often a CSV— containing every single sales transaction from that day.
Modern/Cloud POS Systems (e.g., Shopify, Square): These systems often have APIs for real-time data access and cloud storage.

Master Data is the descriptive, low-volume data that doesn’t change often. This data is almost always stored in a relational SQL database that is part of a larger system.

1.2. Why build a data warehouse?

You might be thinking, isn’t it possible to answer critical questions like “What are our top 10 selling products?” by using a standard SQL database?

All we would need to do is set up a MySQL server instance, create tables, load the csv data into the tables, and run a single query to get the answer.

So, why build a full data warehouse?

A standard database (like MySQL) is an OLTP (Online Transaction Processing) system. It’s designed to be fast and efficient at handling thousands of small transactions per second, and not designed to handle huge, complex analytical queries that scan millions of rows.

A Data Warehouse (like Azure Synapse) is an OLAP (Online Analytical Processing) system. It is specifically architected to be incredibly fast at running large, complex analytical queries across massive datasets.

A simple SQL approach requires manual work everytime. Creating a pipeline to automate the whole process ensures proper data transformation, reliability, and quality.

When a product category changes, a transactional (OLTP) database usually just overwrites the value. A data warehouse (OLAP) instead uses Slowly Changing Dimensions (SCDs) to preserve history. We don’t need to understand all the details of that yet, just knowing about it is good for now.

2. Azure & DevOps Setup

2.1. Create Azure DevOps Project:

Go to dev.azure.com and click the “Get started with Azure” button. You need to sign in & have an active subscription, which Azure provides a free trial for.

You will be redirected to the main Azure portal. Then, click the “Azure DevOps organizations” service.

Then, click the “Create new organization” button. An organization is basically a container for all your future projects, name it after you or your company.

We can now create a project. I’ll name mine “Retail-Data-Warehouse”, keeping the visibility as private.

Navigate to “Repos” and initialize a main branch with a README file.

Then, create a new folder “sql-scripts” for our .sql files. Git folders can’t be empty, so just add a text file in there for now. Ensure you commit the change.

2.2. Create Azure Resource Group:

Great! Now that we’ve set up our repo, let’s head back to the Azure portal (portal.azure.com).

A Resource Group is used for:

Access control : Assigning the Engineering team a “Contributor” role and the Finance team a “Reader” role.
Lifecycle management : If our project has 10 different Azure services, deleting the entire resource group prevents us having to shut down all of the services manually one at a time.
Billing & Cost Management : In large companies, different departments have to pay for their own cloud usage (e.g. running a report comparing the cost of resource groups “retail-data-warehouse” vs “webapp-prod”)

Click on the “Resource groups” option.

Create a new Azure resource group. Give it a descriptive name like “rg-retail-dw-project” (rg = resource group; dw = data warehouse) and choose your region.

2.3. Create Storage Account (Data Lake):

This is a highly scalable cloud storage service, which will be be the central landing zone for all our raw, unprocessed data (our .csv files).

Then, go to your newly created resource group and click “Create”.

Search up “Storage Account”, choose the one from microsoft and click create.

Name it uniquely like “stadatalakebobbymart” (st = storage; adatalake (purpose); bobbymart (retail store name).

Crucially, on the “Advanced” tab, check “Enable hierarchical namespace” (will be explained later on why it’s crucial).

Then, click on Review + Create.

Once created, go to the resource, navigate to “Containers”, and create a container named raw-data.

Inside the “raw-data” container, upload the three CSV files you downloaded earlier.

Enabling “hierarchical namespace” is the specific action that transforms a standard Azure Storage Account into an Azure Data Lake Storage Gen2 (ADLS Gen2).

When we use the portal to upload the .csv files into a “folder” we named “raw-data”, the system doesn’t create a real folder. It simply creates a blob and gives it the literal name “raw-data/products.csv”.

This is important for a few reasons:

Listing Files is Slow: If we wanted to get all files in the raw-data folder, it can’t just open a directory. It has to scan the name of every single blob in the entire container to find the ones that start with the prefix “raw-data/”.
Renaming “Folders” is Impossible: You cannot perform a single operation to rename “raw-data” to “processed-data”. You would have to manually run a script that copies every single one of your files to a new blob with the new name prefix and then deletes the old ones.
Directory-Level Security Doesn’t Exist: You can’t apply a security rule like “Grant the sales team access to the raw-data folder.” Your only options are to grant them access to the entire container or to manage permissions for every single file individually.

Enabling hierarchical namespace is great for scalability and efficiency. Although we don’t need it for this particular project, it’s good practice to know for when we work with big data.

2.4: Create Azure Data Factory (ADF):

ADF tells other services when and how to perform their tasks.

In the resource group, click “Create”.

Search for “Data Factory”

Name it “adf-retail-dw-project” (dw = data warehouse).

Click “Review + create”.

2.5: Create Azure Synapse Analytics Workspace:

This serves as both the final, structured data warehouse and the workbench for querying and analysing that data.

In the resource group, click “Create”. Search for “Azure Synapse Analytics”.

Name the workspace “synw-retail-dw-project”.

Create a new Data Lake Storage Gen2 account when prompted (this is for Synapse’s own use), different to our earlier data lake we made which is for raw csv data.

During creation, set the SQL administrator password. Remember this password!

Click “Review + create”. This can take a few minutes.

3. Data Modeling — Build the Warehouse Structure

Ensure you pause Your Dedicated SQL Pool to save your free credits when not using it. Navigate to your Synapse Workspace, Go to SQL Pools, Select Your Pool, and click Pause.

First, we need to create a dedicated SQL pool.

Ensure the performance level slider is all the way to the left for this project

Click the “Review + create” button, then “Create”. The deployment will begin, it can take 5–10 minutes for the dedicated SQL pool to be ready.

In the meantime, let’s take a look of what our .csv files look like:

products.csv:

stores.csv:

sales_*.csv:

Let’s reiterate our goal here, transform raw CSV files into a fast, interactive dashboard, to make data-driven desicions. We now have an empty warehouse that we’ve finished setting up. Now it’s time to define the structure and blueprint of our data warehouse.

This is a prerequisite for injecting data. We want to first create the tables according to our csv data structure, and INSERT data into them.

So, let’s open Synapse Studio:

Click the “Develop” tab, and create 3 SQL scripts.

For each of these SQL scripts, ensure you have connected to your dedicated SQL pool that you’ve created earlier before running the scripts.

When you create a Synapse Workspace, you get two different SQL engines (in-built & dedicated). We’ll be using the dedicated SQL pool.

A Dedicated SQL Pool is a traditional, stateful database. You can do everything (CREATE, INSERT, UPDATE, DELETE), and the data permanently stored inside the pool.

The Serverless SQL Pool is a stateless query engine. It doesn’t store any data itself. Its only job is to run SELECT queries on files that live outside of it, in our data lake. We cannot run UPDATE or DELETE commands because we’re just reading external files, not managing data in a database.

SQL Script 1:

This script builds the foundational structure of our final, optimized data warehouse. It creates the three permanent, internal tables (DimProduct, DimStore, FactSales) that will hold the clean and structured data ready for analysis.

-- Drop existing tables
IF OBJECT_ID('dbo.FactSales', 'U') IS NOT NULL
    DROP TABLE dbo.FactSales;
GO

IF OBJECT_ID('dbo.DimProduct', 'U') IS NOT NULL
    DROP TABLE dbo.DimProduct;
GO

IF OBJECT_ID('dbo.DimStore', 'U') IS NOT NULL
    DROP TABLE dbo.DimStore;
GO

-- Create tables fresh
-- Dimension for Products
CREATE TABLE DimProduct (
    ProductKey INT IDENTITY(1,1) NOT NULL,
    SourceProductID VARCHAR(50) NOT NULL,
    ProductName VARCHAR(100) NOT NULL,
    Category VARCHAR(50),
    CONSTRAINT PK_DimProduct PRIMARY KEY NONCLUSTERED (ProductKey) NOT ENFORCED
)
WITH (
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

-- Dimension for Stores
CREATE TABLE DimStore (
    StoreKey INT IDENTITY(1,1) NOT NULL,
    SourceStoreID VARCHAR(50) NOT NULL,
    StoreName VARCHAR(100) NOT NULL,
    State VARCHAR(50),
    CONSTRAINT PK_DimStore PRIMARY KEY NONCLUSTERED (StoreKey) NOT ENFORCED
)
WITH (
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

-- Fact Table for Sales
CREATE TABLE FactSales (
    SalesKey INT IDENTITY(1,1) NOT NULL,
    ProductKey INT NOT NULL,
    StoreKey INT NOT NULL,
    SaleTimestamp DATETIME,
    Quantity INT NOT NULL,
    UnitPrice DECIMAL(10, 2) NOT NULL,
    TotalSaleAmount DECIMAL(10, 2) NOT NULL,
    CONSTRAINT PK_FactSales PRIMARY KEY NONCLUSTERED (SalesKey) NOT ENFORCED
)
WITH (
    DISTRIBUTION = HASH(ProductKey),
    CLUSTERED COLUMNSTORE INDEX
);

To learn more about T-SQL stuff, I recommend watching:

The naming convention (DimProduct, FactSales, SourceProductID) is best practice that comes from a data warehousing design pattern called a Star Schema.

In a star schema, you have two types of tables:

Dimension Tables (Dim): These tables hold descriptive, contextual information of our data. DimProduct describes our products, and DimStore describes our stores, their role is to provide context.
Fact Tables (Fact): This is the central table that holds the numbers, FactSales stores the sales transactions.

SQL Script 2:

This script creates a direct connection between Synapse and the raw CSV files stored in our data lake. It defines “external tables” that act as pointers, allowing us to read and query the raw source data using SQL without having to import it first.

-- Step 1: Create Database Scoped Credential
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'Managed Identity';
GO

-- Step 2: Create External Data Source
CREATE EXTERNAL DATA SOURCE RawDataSource
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://raw-data@.dfs.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO

-- Step 3: Create External File Format
CREATE EXTERNAL FILE FORMAT CsvFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        FIRST_ROW = 2,
        USE_TYPE_DEFAULT = FALSE
    )
);
GO

-- Step 4: Create External Tables
CREATE EXTERNAL TABLE ext_products (
    ProductID VARCHAR(50),
    ProductName VARCHAR(100),
    Category VARCHAR(50)
)
WITH (
    LOCATION = '/products.csv',
    DATA_SOURCE = RawDataSource,
    FILE_FORMAT = CsvFormat
);
GO

CREATE EXTERNAL TABLE ext_stores (
    StoreID VARCHAR(50),
    StoreName VARCHAR(100),
    State VARCHAR(50)
)
WITH (
    LOCATION = '/stores.csv',
    DATA_SOURCE = RawDataSource,
    FILE_FORMAT = CsvFormat
);
GO

CREATE EXTERNAL TABLE ext_sales (
    SaleTimestamp DATETIME,
    ProductID VARCHAR(50),
    StoreID VARCHAR(50),
    Quantity INT,
    UnitPrice DECIMAL(10, 2)
)
WITH (
    LOCATION = '/sales_2025-09-23.csv',
    DATA_SOURCE = RawDataSource,
    FILE_FORMAT = CsvFormat
);
GO

Here’s what’s going on:

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'Managed Identity';

This step creates a security credential. Instead of using a password, it uses the Managed Identity of your Synapse workspace.

CREATE EXTERNAL DATA SOURCE RawDataSource
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://raw-data@.dfs.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO

This simply gives a name (RawDataSource) to the physical location of your data lake for easy access.

CREATE EXTERNAL FILE FORMAT CsvFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        FIRST_ROW = 2,
        USE_TYPE_DEFAULT = FALSE
    )
);
GO

This tells Synapse how to read the files. It’s a set of instructions:

FORMAT_TYPE = DELIMITEDTEXT: The file uses a character to separate values.
FIELD_TERMINATOR = ‘,’: The character used is a comma.
FIRST_ROW = 2: This is important. It tells Synapse to skip the first line (the header row) and start reading the actual data from the second line.

CREATE EXTERNAL TABLE ext_products ( ... )
WITH ( LOCATION = '/products.csv', ... );

This is the final step where everything comes together.

CREATE EXTERNAL TABLE does not create a real table that stores data inside your database. Instead, it creates a metadata-only object that combines:

The Schema: The column names and data types (ProductID VARCHAR(50), etc.).
The Location: The specific file to point to (/products.csv).
The Address: Where to find that file (using RawDataSource).
The file format: How to read that file (using CsvFormat).

After running this, you can write a query like “SELECT * FROM ext_products”, and Synapse will, in real-time, go to the data lake, open products.csv, use the rules from CsvFormat to read it, and show you the results as if it were a normal SQL table.

SQL Script 3:

This script packages the entire data transformation and loading process into a single, repeatable command. It creates a stored procedure that extracts data from the external tables, transforms it by joining tables and calculating new columns, and loads the final, clean results into our permanent warehouse tables.

-- Create the procedure
CREATE PROCEDURE sp_LoadRetailData
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE DimProduct;
    TRUNCATE TABLE DimStore;
    TRUNCATE TABLE FactSales;
    
    -- Load DimProduct from external table
    INSERT INTO DimProduct (SourceProductID, ProductName, Category)
    SELECT ProductID, ProductName, Category
    FROM ext_products;
    
    -- Load DimStore from external table
    INSERT INTO DimStore (SourceStoreID, StoreName, State)
    SELECT StoreID, StoreName, State
    FROM ext_stores;
    
    -- Load FactSales from external table
    INSERT INTO FactSales (ProductKey, StoreKey, SaleTimestamp, Quantity, UnitPrice, TotalSaleAmount)
    SELECT
        dp.ProductKey,
        ds.StoreKey,
        rs.SaleTimestamp,
        rs.Quantity,
        rs.UnitPrice,
        rs.Quantity * rs.UnitPrice AS TotalSaleAmount
    FROM ext_sales rs
    JOIN DimProduct dp ON rs.ProductID = dp.SourceProductID
    JOIN DimStore ds ON rs.StoreID = ds.SourceStoreID;
END;
GO

Here’s what’s going on:

TRUNCATE TABLE DimProduct;
TRUNCATE TABLE DimStore;
TRUNCATE TABLE FactSales;

The first thing it does is completely empty out your three main data warehouse tables. The TRUNCATE command is a very fast way to delete all rows. This ensures that every time the procedure runs, you are starting fresh with the latest data and not mixing it with old data.

INSERT INTO DimProduct ... SELECT ... FROM ext_products;
INSERT INTO DimStore ... SELECT ... FROM ext_stores;

Next, it performs a simple data copy. It reads all the rows from our external “pointer” tables (ext_products and ext_stores) and inserts them directly into our permanent dimension tables (DimProduct and DimStore).

INSERT INTO FactSales (ProductKey, StoreKey, SaleTimestamp, Quantity, UnitPrice, TotalSaleAmount)
    SELECT
        dp.ProductKey,
        ds.StoreKey,
        rs.SaleTimestamp,
        rs.Quantity,
        rs.UnitPrice,
        rs.Quantity * rs.UnitPrice AS TotalSaleAmount
    FROM ext_sales rs
    JOIN DimProduct dp ON rs.ProductID = dp.SourceProductID
    JOIN DimStore ds ON rs.StoreID = ds.SourceStoreID;

This final INSERT command is the most important part, as it transforms and loads our main sales data in one powerful step.

Here’s exactly what it does:

It combines the data, reading from the raw sales data (ext_sales) and joining it with the DimProduct and DimStore tables we just loaded.
It swaps the text IDs for number keys. The join looks up and replaces the original, slow text IDs (like ‘PROD-001’ and ‘MEL-01’) with their fast, integer-based ProductKey and StoreKey.
It calculates a new column on the fly by multiplying the Quantity and UnitPrice to create the TotalSaleAmount.
It loads the final, clean data — now enriched with the correct keys and the calculated total — into the permanent FactSales table.

4. ETL — Build the Data Pipeline in ADF

Ensure you have Power BI Desktop installed

Open Power BI Desktop.
On the Home ribbon, click Get data.
In the search box, type Synapse and select Azure Synapse Analytics SQL. Click Connect.
Find Your Server Name: You need your Synapse SQL endpoint.
Go to the Azure Portal and navigate to your Synapse workspace.
On the Overview page, find the Dedicated SQL endpoint. It will look like your-workspace-name.sql.azuresynapse.net. Copy this value. Paste the endpoint URL into the Server box in Power BI.
Set Data Connectivity mode to DirectQuery. This means Power BI will query the database live instead of importing the data. Click OK.
Ensure you’ve signed in.
Select Tables: A “Navigator” window will appear, showing your database. Expand your database.

Build Visual 1: Total Sales Card:

Check the boxes next to your three tables: DimProduct, DimStore, and FactSales. Click Load.

In the Visualizations pane on the right, click the Card visual (looks like 123). A blank card will appear on your canvas.

In the Data pane (far right), expand FactSales and drag the TotalSaleAmount field onto the “Fields” well of the card visual. You’ll now see the grand total of all sales.

Build Visual 2:

Sales by State Bar Chart: Click on a blank area of the canvas. In the Visualizations pane, click the Stacked bar chart icon.

Drag fields from the Data pane: From DimStore, drag State to the Y-axis. From FactSales, drag TotalSaleAmount to the X-axis.

Hopefully, this tutorial helped provide a foundation for Azure data warehousing. There are many more things to explore and experiment with, this is just the tip of the iceberg in the world of data analytics!

7 Essential Advanced SQL Techniques Explained Simply

John Elisa — Wed, 09 Jul 2025 06:43:53 GMT

Let’s break down some powerful SQL tricks for solving real-world problems. We’ll use simple examples and step-by-step queries to see how they work, focusing on PostgreSQL.

1. ROW_NUMBER()

We have an orders table with order_id, customer_id, order_date, and other fields.

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    amount NUMERIC
);

INSERT INTO orders (customer_id, order_date, amount) VALUES
(1, '2021-01-01', 50),
(1, '2021-01-05', 100),
(1, '2021-01-03', 70),
(2, '2021-02-01', 200),
(2, '2021-01-15', 120),
(3, '2021-01-20', 60),
(3, '2021-02-02', 40);

Your boss tells you to find the most recent order — Well, that’s simple right?

SELECT * FROM orders
ORDER BY order_date DESC;

But then, your boss wants the most recent order for each customer (including its details). How can we turn this:

Into this:

A naive approach might use aggregation to get the max date per customer, but then we still need the order details.

Approach 1:

SELECT customer_id, MAX(order_date) AS recent_date
FROM orders
GROUP BY customer_id;

Approach 2:

One workaround is to join back the above table to the orders table

SELECT o.customer_id, o.order_id, o.order_date, o.amount
FROM orders o
-- join the table from approach 1 with original orders table
JOIN (
    SELECT customer_id, MAX(order_date) AS recent_date
    FROM orders
    GROUP BY customer_id
) AS m
-- match based on customer id and date
  ON o.customer_id = m.customer_id
  AND o.order_date = m.recent_date
ORDER BY o.customer_id;

Limitation: What if Two Orders Have the Same MAX(order_date)? Then both rows would show up in the result — so you’d get 2 orders for the same customer_id.

☝️🤓 A Better Way: Use ROW_NUMBER() Instead

SQL has a window function called ROW_NUMBER() that lets you assign a rank to each row, within groups. Let’s walk through how to use it step by step.

Step 1: Add Row Numbers (No Grouping Yet)

SELECT 
  order_id, 
  customer_id, 
  order_date,
  ROW_NUMBER() OVER() AS rn
  
FROM orders

Assigns a unique row number to each row

Step 2: Row Numbers Per Customer (Still No Order)

SELECT 
  order_id, 
  customer_id, 
  order_date,
  ROW_NUMBER() OVER(PARTITION BY customer_id) AS rn
  
FROM orders

This resets the row number for each customer. But the order of rows within each customer is still random unless we tell SQL how to sort them.

Step 3: Add Sorting — Most Recent Order First

SELECT 
  order_id, 
  customer_id, 
  order_date,
  ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC) AS rn
  
FROM orders

Now we’re telling SQL:

Group by customer
Sort each customer’s orders by date descending
Then number the rows

That means the most recent order for each customer is always rn = 1.

Step 4: Keep Only the Most Recent Order Per Customer

Now we wrap the above in a subquery and filter for rn = 1:

SELECT order_id, customer_id, order_date, amount
FROM (
-- Move step three into a subquery
  SELECT 
    order_id, 
    customer_id, 
    order_date,
    amount,
    ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC) AS rn
  
  FROM orders
) as ranked
-- filter for rn = 1
WHERE rn = 1;

This gives you exactly one row per customer, even if multiple orders had the same date.

As an alternative in PostgreSQL, you could use DISTINCT ON:

SELECT DISTINCT ON (customer_id) *
FROM orders
ORDER BY customer_id, order_date DESC;

DISTINCT ON (customer_id) * says:
→ “Only return one row for each customer. The asterisk is to include all columns from that row”

ORDER BY customer_id, order_date DESC ensures:
→ “Pick the row with the latest date per customer.”

This only works in PostgreSQL — other databases like MySQL, SQL Server, or SQLite don’t support it.

If you’re using PostgreSQL, and you want one row per group, based on some ordering, then DISTINCT ON is a powerful and clean shortcut.

However, If you want to stay database-portable, or need more flexibility (like top 3 orders per customer), go with ROW_NUMBER().

2. RANK() or DENSE_RANK()

Let’s say you work with a sales table that tracks how much each product sold, and you want to figure out the best-selling products in each category.

You can see there are ties — multiple products with the same sales in a category.

A simple way to get “the top seller per category” might be using something like:

SELECT 
  category, 
  MAX(sales) 
FROM sales_data 
GROUP BY category;

But that just gives you the number — not which product actually sold the most, or what the full ranking looks like. That’s where ranking functions come in handy.

We want to rank the products by sales, within each category.

SELECT
  category,
  product,
  sales,
  RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank,
  DENSE_RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS dense_rank
FROM sales_data;

What’s the difference?

Let’s look at Gizmos. WidgetD and WidgetE both sold 150 — a tie!

RANK() gives them both 1, and then jumps to 3 for the next one. So it skips the number 2. DENSE_RANK() also gives them both 1, but the next rank is just 2 — no gap.

- Use RANK() if you want to reflect actual placement like in sports. (Ties push the next item down.)

- Use DENSE_RANK() if you want consecutive numbers with no gaps — easier for filtering top N items.

3. LAG() and LEAD()

Imagine you’re looking at a table of daily sales:

CREATE TABLE daily_sales (
  sale_date DATE PRIMARY KEY,
  total_sales INTEGER
);

INSERT INTO daily_sales (sale_date, total_sales) VALUES
  ('2021-01-01', 1000),
  ('2021-01-02', 1200),
  ('2021-01-03', 900),
  ('2021-01-04', 1500);

We want to know:

What were today’s sales?
What were yesterday’s sales?
How much did sales go up or down compared to yesterday?

LAG() lets you peek at the value from the previous row.

SELECT
  sale_date,
  total_sales,
  LAG(total_sales) OVER (ORDER BY sale_date) AS prev_day_sales
FROM daily_sales
ORDER BY sale_date;

LAG(column_name): Give me the value of column_name from the previous row.

But how does SQL know what “previous” means? That’s where the OVER clause comes in:

OVER (ORDER BY sale_date): Order the data by sale_date, then look one row backward from the current row.

Notice how the first row has no previous row, so the result is NULL.

LEAD(): Look at the Next Row

LEAD() does the opposite — it looks ahead to the next row.

SELECT
  sale_date,
  total_sales,
  LEAD(total_sales) OVER (ORDER BY sale_date) AS next_day_sales
FROM daily_sales;

Again, the last row doesn’t have a “next day,” so it returns NULL.

Want to See Sales Differences?

SELECT
  sale_date,
  total_sales AS sales_today,
  LAG(total_sales) OVER (ORDER BY sale_date) AS sales_yesterday,
  total_sales - LAG(total_sales) OVER (ORDER BY sale_date) AS diff
FROM daily_sales
ORDER BY sale_date;

4. CASE

Let’s say you’re working with an orders table that has a column called amount, and you want to label each order as either “small”, “medium”, or “large”, depending on how much was spent.

-- Create the orders table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    amount NUMERIC
);

-- Insert sample data
INSERT INTO orders (amount)
VALUES
    (25),   -- small
    (49.99),-- small
    (50),   -- medium
    (120),  -- medium
    (199.99),-- medium
    (200),  -- large
    (350);  -- large

Think of CASE like an if/else in code. It lets you check conditions row by row and return a value depending on which condition matches.

SELECT order_id,
       amount,
       CASE 
         WHEN amount < 50 THEN 'small'
         WHEN amount < 200 THEN 'medium'
         ELSE 'large'
       END AS size_category
FROM orders;

This creates a new column called size_category that classifies each order. Note: The conditions are checked in order, and SQL stops at the first one that matches.

If you don’t include an ELSE, SQL will just return NULL for rows that don’t meet any condition. So it’s a good habit to always add an ELSE just in case.

Another Example: Grouping Customers by Region

SELECT customer_id, country,
       CASE
         WHEN country IN ('US', 'Canada') THEN 'North America'
         WHEN country = 'Mexico' THEN 'Central America'
         ELSE 'Other'
       END AS region
FROM customers;

Now you’ve got a nice region column that tells you where each customer is from, based on their country.

Remember that:

Every CASE must end with an END.
Always list more specific conditions first. If a general one comes first, it can “hide” the others.
No ELSE means unmatched rows will be NULL.

5. Common Table Expressions (CTEs)

Imagine you’re working with a sales table that tracks region, sale_date, and amount.

Your boss asks:

“Can you tell me the average monthly sales per region over the past year?” Assuming today is 09/07/2025

This might sound simple, but under the hood, it requires multiple steps:

First, you need to total up the sales per month and region.
Then, calculate the average of those monthly totals for each region.

Instead of cramming all that logic into one messy query, we can use a CTE to break it down.

A CTE (Common Table Expression) is like a named temporary table you define at the top of your query using WITH.

Think of it like a temporary table that holds results of a subquery we can reuse.

-- Step 1: Build a CTE to calculate monthly sales
WITH monthly_sales AS (
  SELECT
    region,
    DATE_TRUNC('month', sale_date) AS month,
    SUM(amount) AS month_total
  FROM sales
  WHERE sale_date >= (CURRENT_DATE - INTERVAL '1 year')
  GROUP BY region, DATE_TRUNC('month', sale_date)
)

-- Step 2: Use that result to calculate the average monthly sales
SELECT
  region,
  AVG(month_total) AS avg_monthly_sales
FROM monthly_sales
GROUP BY region;

DATE_TRUNC(‘month’, sale_date): This trims the sale_date down to just the first day of that month.

For example:

‘20–03–2025’ becomes ‘01–03–2025’
‘15–08–2024’ becomes ‘01–08–2024’

This makes it easier to group everything that happened in the same month.

6. Cumulative and Running Totals with Window Aggregates

Let’s say we have a table that tracks daily sales in different countries. It looks like this:

CREATE TABLE sales (
  country TEXT,
  sale_date DATE,
  sales_amount INTEGER
);

INSERT INTO sales (country, sale_date, sales_amount) VALUES
('US', '2021-01-01', 500),
('US', '2021-01-02', 300),
('FR', '2021-01-01', 200),
('US', '2021-01-03', 700),
('FR', '2021-01-02', 400);

We want to answer two questions:

What’s the running total of all sales day by day?

For each day, add up all the sales from that day and every day before it.

SELECT
  sale_date,
  SUM(sales_amount) OVER (ORDER BY sale_date) AS running_total
FROM sales;

2. What’s the running total per country, day by day?

For each country, show the total sales so far for that country, ordered by date.

SELECT
  country,
  sale_date,
  SUM(sales_amount) OVER (
    PARTITION BY country
    ORDER BY sale_date
  ) AS country_running_total
FROM sales
ORDER BY country, sale_date;

PARTITION BY country: This resets the running total for each country. So now, US sales are summed separately from FR sales.

7. GROUP BY ROLLUP

Rollup tells SQL:“Please group by all combinations of these columns, plus their subtotals and the grand total.”

Let’s say we have a table that tracks how much was sold each month of each year. Like this:

CREATE TABLE sales (
  year INTEGER,
  month INTEGER,
  amount INTEGER
);

INSERT INTO sales (year, month, amount) VALUES
(2020, 1, 1000),
(2020, 2, 1500),
(2021, 1, 1200),
(2021, 2, 1800);

Now we want a sales report that shows:

Monthly sales
Yearly sales totals
A final row showing grand total sales across all years and months

SELECT 
  year, 
  month, 
  SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP (year, month)
ORDER BY year, month;

ROLLUP walks down the hierarchy — first year/month, then just year, then nothing (the grand total: 2500 + 3000).

What this does:

Groups by both year and month to get monthly totals
Adds subtotal rows per year (month = NULL)
Adds a grand total row (year = NULL and month = NULL)

Those NULLs aren’t real missing data. They’re markers saying: “This last row is the total of everything”, But seeing NULL in a report can be confusing.

We can use COALESCE() to fix that:

COALESCE(value1, value2, …, valueN) returns the first non-NULL value from the list.

EXAMPLE:

SELECT COALESCE(NULL, NULL, 'hello'); -- returns 'hello'
SELECT COALESCE(NULL, 0, 99);         -- returns 0

So, if we find a missing year column, change it to the string “ALL YEARS”. Similar logic with month as well!

SELECT 
  COALESCE(CAST(year AS TEXT), 'ALL YEARS') AS year_label,
  COALESCE(CAST(month AS TEXT), 'TOTAL') AS month_label,
  SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(year, month)
ORDER BY year, month;

Because year is an INTEGER, and ‘ALL YEARS’ is a TEXT (string).
And in SQL, all arguments passed to COALESCE() must be of the same data type, we do CAST(year AS TEXT).

SQL Survival Guide

John Elisa — Thu, 03 Jul 2025 07:07:34 GMT

Back when I was a student taking the Databases unit at Monash University, I remember coming back from mid-sem break with a quiz looming and just wishing I had a no-nonsense recap of everything I needed to know to ace it.

Are you in the same boat? Maybe you’re cramming for a uni exam. Or maybe you tweaked your resume a little and now you’re scrambling to survive that SQL online assessment. Or maybe you’re just genuinely curious and want to learn SQL the right way.

Either way — you’re in the right spot.

In this guide, we’ll walk through the SQL concepts that actually matter, from the academic basics to how things are used in real companies and production environments. Whether you’re prepping for exams or job interviews, this is your one-stop survival kit. Let’s dive in.

1. What is SQL, and Why Use It?

SQL (Structured Query Language) is how we talk to databases — specifically, relational databases that store information in rows and columns, kind of like a super-powered spreadsheet. With SQL, you can grab the data you need, add new entries, update stuff, or delete things you no longer want. It also lets you define how different pieces of data relate to each other — like linking a user to all the orders they’ve made.

If your data is structured and predictable — like a list of users, products, bookings, messages — SQL is usually the best tool for the job. It’s great when you need to filter, sort, join, or analyze data, especially when relationships between data matter.

When you’re first learning about databases, you’ll probably hear people say things like “MongoDB is better for unstructured data,” or “Firebase is good for real-time apps”, but what does that actually mean? Why wouldn’t you just store everything in SQL and call it a day?

MongoDB is your go-to when your data keeps changing shape. Say you’re building a marketplace where sellers customize their profiles differently — one adds social links, another lists store hours, and a third includes multiple locations. SQL would force you to constantly redesign your tables. MongoDB just stores everything as flexible documents, so each seller can have completely different fields without breaking anything.

Firebase shines when you need live updates. Think group chats where messages appear instantly, or collaborative docs where you see others typing in real-time. Firebase handles this automatically without any complex setup.

Each tool has its strengths. It’s less about which is better overall and more about which is right for the job.

2. Setting up your Database and Tables

You can try out code from this tutorial with an online SQL sandbox: https://sqlplayground.app/

Imagine you’re asked to build a simple employee directory for a company with multiple departments.

2.1 We need a table to store our department names.

But how do we guarantee that we can uniquely identify each department, even if two have similar names?

In SQL, we use something called a Primary Key. A primary key is a column in a table whose values uniquely identify each row. It has two simple rules: it can’t be empty (NULL), and it must be unique. Think of it like a Social Security Number or a student ID.

CREATE TABLE departments (
    dept_id   INTEGER PRIMARY KEY,  -- A unique ID for each department (1, 2, 3, ...)
    name      TEXT                  -- Name of department (e.g., "Sales")
);

We told SQL we want a new table named departments, Then created a column named dept_id which We specified its data type as INTEGER (Positive only) and designated it as the PRIMARY KEY. We also added a column for the department’s name, which will hold text.

2.2 We need a table to store employee data.

This table will also need its own primary key (emp_id) to uniquely identify each employee.

How do we specify which department an employee belongs to?

We could add a text column and type “Sales” or “Engineering.” But that leads to potential typos and errors. A much better way is to refer to the ID from our departments table. This creates a reliable link.

This link is called a Foreign Key. A foreign key is a column in one table that refers to the primary key of another table. It’s the “bridge” that connects our data. It enforces a rule: you can only put a department ID in the employees table if that ID already exists in the departments table. This ensures you can’t assign an employee to a non-existent department!

CREATE TABLE employees (
    emp_id    INTEGER PRIMARY KEY, -- A unique ID for each employee
    name      TEXT,                -- The employee's name
    salary    INTEGER,             -- Their salary
    dept_id   INTEGER,             -- This will link to the departments table

    -- Now, we define the relationship
    FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
);

-- We're telling SQL that the dept_id column 
-- in this table is a foreign key that points 
-- directly to the dept_id column in the departments table.

2.3 Adding data to our tables

Let’s populate our departments and employees table.

-- Insert some departments
INSERT INTO 
  departments (dept_id, name) 
VALUES
  (1, 'Sales'),
  (2, 'Engineering'),
  (3, 'HR'),
  (4, 'Marketing'),
  (5, 'Legal');

-- Insert some employees
INSERT INTO 
  employees (emp_id, name, dept_id, salary) 
VALUES
  (1, 'Alice',   2, 80000),
  (2, 'Bob',     2, 60000),
  (3, 'Charlie', 1, 70000),
  (4, 'Diana',   3, 50000),
  (5, 'Ethan',   1, 40000),
  (6, 'Fiona',   4, 75000);

Notice how the dept_id for each employee corresponds to one of the IDs we just created. For example, Alice is in department 2, which is “Engineering”. We now have a small, functional database!

Notice we didn’t add any employees to the “Legal” department (dept_id 5). This will be useful for examples later on.

3. Basic Data Retrieval with SELECT

The most common thing you’ll do in SQL is ask for data. The command for this is SELECT.

3.1 Getting All Data from a Table

Let’s start by looking at everything in our employees table. The * symbol is a wildcard that means “all columns.”

SELECT * FROM employees;

This gives us a full table view of all our employee records, just as we entered them.

3.2 Selecting Specific Columns

SELECT name, salary FROM employees;

This is much cleaner and gives you just the data you asked for.

3.3. Creating Calculations and Renaming Columns with “AS”

You can also perform calculations directly in your SELECT statement!

Let’s say the salary column stores the monthly salary, and you want to see the annual salary. You can do the math right there. But a column header like salary * 12 is ugly. We can give it a more readable, temporary name using an alias with the AS keyword.

SELECT
    name,
    salary * 12 AS annual_salary
FROM employees;

The output will now have a clean column titled annual_salary.

3.4 Finding Unique Values with “DISTINCT”

What if you wanted to make sure all departments have at least 1 employee? If you just select dept_id from the employees table, you’ll get duplicates (e.g., 2, 2, 1, 3, 1, 4). To see only the unique department IDs, use SELECT DISTINCT.

SELECT DISTINCT dept_id FROM employees;

We find out that the “Legal” department (dept_id=5) has no employees!

3.5 Filtering Data with “WHERE”

Getting all your data is great, but the real power of SQL comes from filtering it to find exactly what you need. This is done with the WHERE clause.

3.5.1 Basic Filtering

Let’s find all the employees who work in the “Engineering” department (dept_id = 2).

SELECT name, dept_id FROM employees
WHERE dept_id = 2;

Note: SELECT name FROM employees also works, and will only display the names instead

3.5.2 Combining Filters with AND and OR

Let’s find employees who are in the Engineering department (dept_id = 2) AND earn more than $60,000.

SELECT name, salary, dept_id FROM employees
WHERE dept_id = 2 AND salary > 60000;

Only Alice meets both criteria.

Now, let’s find employees who are in the Sales department (dept_id = 1) OR whose name starts with ‘D’.

SELECT name, dept_id FROM employees
WHERE (dept_id = 1) OR (name LIKE 'D%');

This will give you Charlie and Ethan (from Sales) and Diana (whose name starts with ‘D’). The LIKE ‘D%’ clause matches any name that starts with the letter “D”, where % means “any sequence of characters” after “D”.

Pro Tip: When you mix AND and OR, AND is processed first. It’s a good habit to use parentheses () to make your logic clear, like WHERE (dept_id = 1 AND salary > 50000) OR dept_id = 3;.

3.5.3 Other Useful Filtering Tools

IN: A shorthand for multiple OR conditions. To find employees in either the Sales (1) or HR (3) departments:

SELECT name, dept_id FROM employees
WHERE dept_id IN (1, 3);

BETWEEN: For checking a range (inclusive). To find employees earning between $50,000 and $80,000:

SELECT name, salary FROM employees
WHERE salary BETWEEN 50000 AND 80000;

LIKE: For finding patterns in text. The % wildcard matches any number of characters.

Find names starting with ‘A’: WHERE name LIKE ‘A%’ (matches Alice).
Find names ending in ‘a’: WHERE name LIKE ‘%a’ (matches Diana and Fiona).

4. Sorting and Limiting Your Results

By default, your database doesn’t guarantee the order of the results. To get a sorted list, you need to be explicit using ORDER BY.

ASC: Ascending order (A-Z, lowest to highest). This is the default.
DESC: Descending order (Z-A, highest to lowest).

Let’s get a list of employees sorted by their salary, from highest to lowest.

SELECT name, salary FROM employees
ORDER BY salary DESC;

What if you only want the top 3 highest-paid employees? You can combine ORDER BY with LIMIT.

SELECT name, salary FROM employees
ORDER BY salary DESC
LIMIT 3;

5. Summarizing Your Data (and Grouping)

Before diving into multiple tables, let’s look at another powerful aspect: aggregate functions and grouping. The most common are:

COUNT(): Counts the number of rows.
SUM(): Adds up the values in a column.
AVG(): Calculates the average.
MIN() / MAX(): Find the minimum or maximum value.

5.1 Simple Summaries

How many employees do we have in total?

SELECT COUNT(*) as num_of_employees 
FROM employees

What’s the average salary across the entire company?

SELECT AVG(salary) as avg_salary 
from employees

5.2 Summarizing by Category with GROUP BY

This is where things get really interesting. What if you want to know the average salary per department? This is what GROUP BY is for. It bundles rows together based on a column and then runs the aggregate function on each bundle.

SELECT
    dept_id,
    AVG(salary) AS average_salary,
    COUNT(*) AS number_of_employees
FROM employees
GROUP BY dept_id;

This query tells SQL:

First, group the employees by their dept_id. (SQL splits the table into smaller “mini-tables”, one for each unique dept_id.)
Then, for each of those groups, calculate the average salary and count the number of employees.

5.3 Filtering Groups with HAVING

The WHERE clause filters rows before they are grouped. But what if you want to filter the groups themselves? For example, “Show me only the departments with more than 1 employee.”

For this, you use the HAVING clause, which works on the results of your aggregations.

SELECT
    dept_id,
    COUNT(*) AS num_employees
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 1;

WHERE filters individual rows BEFORE grouping, while HAVING filters entire groups AFTER GROUP BY is done.

6. Bringing Our Tables Together with “JOIN”

So far, we’ve worked with our employees and departments tables separately. But the real power of a relational database comes from connecting them.

Let’s say you want a list of all employees and the name of the department they work in. The employee’s name is in the employees table, but the department’s name is in the departments table. How do we pull information from both at the same time?

6.1 The Most Common Join: INNER JOIN

An INNER JOIN looks for rows where the key (our dept_id) exists in both tables. It finds the perfect matches and returns a combined row.

SELECT
    e.name AS employee_name,
    d.name AS department_name
FROM
    employees AS e
INNER JOIN
    departments AS d ON e.dept_id = d.dept_id;

FROM employees AS e INNER JOIN departments AS d: We’re telling SQL we want to join employees and departments. We also give them short nicknames (aliases) e and d to make our query easier to read and write.

ON e.dept_id = d.dept_id: This is the crucial part. It’s the rule for the join. It says, “Match a row from the employees table with a row from the departments table wherever their dept_id values are the same.

6.2 Seeing Everything with LEFT JOIN

But what if you wanted to see a list of all departments, and just show which employees are in them, if any? You might want to do this to find departments that are empty. For this, we use a LEFT JOIN.

A LEFT JOIN says: “Give me every single row from the left table (the one mentioned first), and then bring in any matching rows from the right table. If there’s no match, just show NULL.” While previously, INNER JOIN only shows rows where there is a match in both tables.

SELECT
    d.name AS department_name,
    e.name AS employee_name
FROM
    departments AS d
LEFT JOIN
    employees AS e ON d.dept_id = e.dept_id;

The NULL value instantly tells us that the Legal department has no matching employees.

6.3 Other Joins

There are also RIGHT JOIN (the opposite of LEFT JOIN) and FULL OUTER JOIN (which shows all rows from both tables, matching where possible and using NULL for any non-matches). However, INNER and LEFT joins will handle over 95% of your needs.

7. Modifying Your Data

Databases are living things. People get raises, new employees are hired, and sometimes people leave. You need commands to manage these changes.

7.1 Changing Existing Data: UPDATE

Let’s say Alice got a well-deserved raise, and her new salary is $90,000. We use the UPDATE command to change her existing record.

UPDATE employees
SET salary = 90000
WHERE emp_id = 1; -- Using the unique ID is safest!

This is the most important rule of updating: Always use a WHERE clause! If you forget it, the command UPDATE employees SET salary = 90000; would give every single employee a salary of $90,000.

7.2 Removing Data: DELETE

Now, imagine Bob has decided to leave the company. We need to remove his record from the database using the DELETE command.

DELETE FROM employees
WHERE emp_id = 2;

8. Ensuring Safety with Transactions

Imagine a banking app. When you transfer money, two things must happen:

Money is subtracted from your account.
Money is added to the other person’s account.

What if the system crashes after step 1 but before step 2? The money vanishes! To solve this, databases use transactions.

A transaction is a wrapper around a sequence of SQL commands that treats them as a single, all-or-nothing operation.

It starts with BEGIN or START TRANSACTION.
You run your UPDATE, INSERT, or DELETE commands.
If everything is successful, you COMMIT the changes, making them permanent.
If something goes wrong, or you change your mind, you ROLLBACK the changes, and the database returns to the state it was in before you started.

This is guaranteed by a set of properties known as ACID (Atomicity, Consistency, Isolation, Durability), which ensure your data remains reliable and consistent.

Let’s safely give all Engineering employees a 10% raise!

-- Step 1: Start the transaction
BEGIN;

-- Step 2: Apply the raise (but don’t commit yet!)
UPDATE employees
SET salary = salary * 1.10
WHERE dept_id = 2;

-- Step 3: Safety check — inspect the updated rows
SELECT name, salary
FROM employees
WHERE dept_id = 2;

Our data originally looked like this:

After running the above code, it looks like this:

Manually check if these are the updated values we want! It looks correct here!

-- (if something looks wrong): Undo it
-- ROLLBACK;

-- (if everything looks good): Save it
COMMIT;

Without BEGIN, Every UPDATE, INSERT, or DELETE is immediately saved — you can’t undo anything unless you manually reverse it.

With BEGIN…COMMIT or ROLLBACK:

You can test and inspect changes
Only COMMIT makes them permanent
ROLLBACK erases them like they never happened

9. Evolving Your Database Structure

What happens when your needs change? Maybe you decide you need to store each employee’s email address. You don’t have to start over; you can modify your database’s structure (its schema) on the fly.

9.1 ALTER TABLE

This command lets you modify an existing table. Let’s add that email column:

ALTER TABLE employees
ADD email VARCHAR(100); -- A string up to 100 characters

Now, the employees table has a new email column, which will be NULL for all existing employees until we UPDATE them.

9.2 DROP TABLE

If you no longer need a table, you can delete it completely. This action is permanent and deletes both the table structure and all the data inside it, so use it with extreme caution!

-- example usage
DROP TABLE departments;

10. Creating Shortcuts with Views

Imagine you frequently need to run a complex JOIN query to see high-earning employees and their department names. Typing it out every time is tedious and prone to error.

A View is a “virtual table” that is based on the result of a query. You can save a complex query as a View and then interact with it as if it were a simple table.

Let’s create a view for our high-earners.

CREATE VIEW high_earners_view AS
SELECT
    e.name,
    e.salary,
    d.name AS department_name
FROM
    employees AS e
INNER JOIN
    departments AS d ON e.dept_id = d.dept_id
WHERE
    e.salary > 60000;

Now, instead of re-typing that whole query, you can simply do this:

SELECT * FROM high_earners_view;

And there you have it — the SQL survival guide I wish I had during uni. Thanks for reading and coding along with me!

Building a Full Stack React App: Integrating AWS Cognito + Lambda + RDS(MySQL) + Amplify + API…

John Elisa — Thu, 16 Jan 2025 03:57:48 GMT

Building a Full Stack React App: Integrating AWS Cognito + Lambda + RDS(MySQL) + Amplify + API Gateway

In this tutorial, we’ll build a secure REST API for our e-commerce website by integrating AWS Cognito, AWS Lambda, and MySQL (RDS).

User Authentication

We start with an already implemented sign-in button in our navigation bar that leverages Cognito to handle user authentication, if you haven’t implemented one yet, here is a 6 minute tutorial of mine that should get you up to speed:

Setting Up Amazon Cognito for your React App

We won’t be implementing the frontend as much and will mostly focus on the backend for the purpose of this tutorial.

Final Result:

https://medium.com/media/c05a23a59761df474c83bc75d04546dd/href

How it’s going to work:

User Authentication with AWS Cognito: The React-based front end uses Cognito for sign-in/sign-out processes. When a user clicks “Sign In,” Cognito manages the authentication flow, securely storing user tokens and profile information.
Serverless Backend with Lambda: After a successful sign-in, the application calls various REST endpoints to interact with user data. AWS Lambda functions act as the backend logic, triggered by API Gateway when endpoints like /syncUser, /profile, or /products are called. These Lambda functions use the mysql2/promise library to connect to a MySQL database on RDS
Data Persistence with MySQL (RDS): User details — such as email, name, userID — are stored in a MySQL database, connected using MySQL workbench. The Lambda functions handle the CRUD operations for products and purchases.
Pages — Products, Sold Items, My Purchases, My profile:
Authenticated users can access these pages. This page interacts with the backend through functions like fetchUserProfile and updateUserProfile, fetchProducts, updating the MySQL database via Lambda.
Conditional UI Rendering:
In the navigation bar, features like the “Sell” button (represented by a ‘+’ symbol) or profile dropdown only appear for authenticated users. This means that only signed-in users can list items for sale or access their account details.

Creating the REST API (Lambda + API Gateway)

Imagine you want to create a small service on the internet, like a button people can click to get a greeting message (e.g., “Hello, world!”).

A REST API is just a way to connect that button to the code that makes the greeting.
AWS Lambda is a tool that runs your code for you on the internet. You write a small piece of code that says, “When someone clicks the button, show them ‘Hello, world!’”, Lambda runs this code without you needing to set up a whole computer or server
AWS API Gateway is like the doorman to your Lambda. It listens for clicks on the button (or requests from anyone on the internet) and sends them to your Lambda code. Once Lambda finishes running, API Gateway sends the result back to whoever clicked the button.

How do you connect these?

You write the “Hello, world!” code in Lambda.
You use API Gateway to create a URL (like a website link) that triggers your Lambda when visited.
Now, anyone who clicks the link gets your greeting!

This is the same backend architecture we’re going to use for our app!

Initializing Database (RDS MySQL):

Proceed to your AWS Console, search “RDS”, and create a database. Click “Easy create”, choose your engine type and the free tier, then create your database.

Go to your created database, click “modify” and make the database publicly accessible, and apply the changes immediately.

You also want to click on the VPC your database is on, and add these inbound rules. This will allow us to connect to MySQL from our device.

Now, go to your database instance and take note of your endpoint and port.

Paste your endpoint into the hostname field, and your port in the port field. Fill in your username and password you created earlier, and the default schema should be the name of your database you’re going to make.

MySQL Workbench

Great! Let’s create the database and its tables:

-- 1. Create the database (schema) 
CREATE DATABASE chemazon;

-- 2. Switch to the database
USE chemazon;

-- 3. Users table
CREATE TABLE IF NOT EXISTS users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  cognito_sub VARCHAR(255) NOT NULL,
  email       VARCHAR(255) NOT NULL,
  name        VARCHAR(255),
  created_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP 
                 ON UPDATE CURRENT_TIMESTAMP
) 

-- 4. Products table
CREATE TABLE IF NOT EXISTS products (
  id          INT AUTO_INCREMENT PRIMARY KEY,
  seller_id   INT NOT NULL,
  title       VARCHAR(255) NOT NULL,
  description TEXT,
  image_url   VARCHAR(255),
  price       DECIMAL(10,2),
  discount    DECIMAL(3,2),
  stock       INT,
  rating      DECIMAL(3,2),
  status      VARCHAR(20),
  created_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP 
                 ON UPDATE CURRENT_TIMESTAMP
) 

-- 5. Purchases table
CREATE TABLE IF NOT EXISTS purchases (
  id           BIGINT AUTO_INCREMENT PRIMARY KEY,
  product_id   INT NOT NULL,
  buyer_id     INT NOT NULL,
  purchased_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT fk_product
    FOREIGN KEY (product_id) REFERENCES products(id)
      ON DELETE CASCADE
      ON UPDATE CASCADE,
  CONSTRAINT fk_buyer
    FOREIGN KEY (buyer_id) REFERENCES users(id)
      ON DELETE CASCADE
      ON UPDATE CASCADE
)

API Gateway + Services:

Sign in to https://aws.amazon.com/console/ if you haven’t already and search up “API Gateway”.
Press “Create API”, scroll down to “REST API” and click “build”. Go ahead and name your API (e.g. chemazonAPI), and confirm creation.

We’re going to have these methods:

syncUserInfo(userInfo) : This method sends user information to the server (API Gateway + Lambda) to sync or store it. It uses an HTTP POST request to send the data and ensures it’s in a JSON format.
fetchUserProfile(userSub) : This method retrieves a user’s profile from the server based on their unique identifier (userSub). It sends an HTTP GET request with the user ID as a query parameter to get the data.
updateUserProfile(userSub, profileData) : This method updates a user’s profile on the server with new information. It uses an HTTP PUT request, sending both the user ID and updated profile data in JSON format.
createProduct(productData) : Sends a POST request to POST /products with the new product’s details in the request body. Creates a new product listing in the database.
fetchProducts(): Sends a GET request to GET /products, optionally with query parameters (e.g., filter, page, etc.). Retrieves a list of products from the server.
buyProduct(productId, userSub): Sends a POST request to POST /buyProduct with productId and userSub in the request body. Allows a user to purchase a specific product (e.g., creating a record in the “purchases” table).
fetchPurchases(userSub): Sends a GET request to GET /purchases?userSub={userSub}. Retrieves all the purchase records for a given user from the server.
fetchUserProducts(sellerSub): Sends a GET request to GET /products?seller_sub={sellerSub}. Fetches products that belong to a specific seller (identified by their Cognito sub).

Linking API Gateway to Lambda

First, go to your database instance and set up a lambda connection.

I will provide an example of how the function “SyncUserFunction” api service works; The rest of the functions are available on my github if you would like to see its implementations.

So, create a new function “SyncUserFunction”, and choose your existing database proxy.

Then, proceed to our previously created “chemazonAPI”, click create resource, and type in “syncUser” as the resource name. Then, choose POST as the method type, tick the Lambda proxy integration, select the syncUserFunction for the lambda function field and create method.

Now, ensure that your lambda function has the below outbound rules:

Finally, let’s add the logic to our lambda function.

Proceed to the “syncUserFunction” code section, and replace the index.mjs code with this:

import mysql from 'mysql2/promise';

export const handler = async (event) => {
  // Define CORS headers to include in all responses
  const corsHeaders = {
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "OPTIONS,POST,GET,PUT",
    "Access-Control-Allow-Headers": "Content-Type"
  };

  let connection;

  try {
    console.log("Received event:", JSON.stringify(event));

    // Handle request body parsing
    let body;
    if (typeof event.body === 'string') {
      try {
        body = JSON.parse(event.body);
      } catch (parseError) {
        throw new Error("Failed to parse JSON body: " + parseError.message);
      }
    } else if (typeof event.body === 'object' && event.body !== null) {
      body = event.body;
    } else {
      throw new Error("Request body is not valid JSON.");
    }

    const { sub, email, name } = body;
    if (!sub || !email || !name) {
      throw new Error("Missing required user fields (sub, email, name).");
    }

    // Create a MySQL connection using mysql2/promise
    connection = await mysql.createConnection({
      host: process.env.DB_HOST,
      port: process.env.DB_PORT || 3306,  // default MySQL port
      user: process.env.DB_USER,
      password: process.env.DB_PASS,
      database: process.env.DB_NAME,
      connectTimeout: 5000  // 5-second timeout
    });

    // Check if user already exists
    const [rows] = await connection.execute(
      'SELECT id FROM users WHERE cognito_sub = ?',
      [sub]
    );

    // Insert new user if not found
    if (!rows.length) {
      await connection.execute(
        `INSERT INTO users (cognito_sub, email, name, created_at, updated_at) 
         VALUES (?, ?, ?, NOW(), NOW())`,
        [sub, email, name]
      );
    }

    return {
      statusCode: 200,
      headers: corsHeaders,
      body: JSON.stringify({ message: 'User synced successfully.' }),
    };
  } catch (error) {
    console.error('Error syncing user:', error);
    return {
      statusCode: 500,
      headers: corsHeaders,
      body: JSON.stringify({ error: 'Error syncing user.', details: error.message }),
    };
  } finally {
    if (connection) {
      try {
        await connection.end();
      } catch (endError) {
        console.error('Error ending connection:', endError);
      }
    }
  }
};

This AWS Lambda function synchronizes Cognito user details with a MySQL database by checking if the user exists and inserting them if not. It handles database connections, parses the request, and ensures proper CORS headers and error handling.

Now, you have a working REST API!

Automated Build and Deployment for Web Hosting

Now, all you need to deploy your app for other people to see!

Here is my app: https://main.d1ktvyh6vc45ny.amplifyapp.com/

All you have to do is first, push your project onto github.

Finally, create a new amplify app on aws amplify:

Select your repository, branch, and now every time you push onto github, these changes will automatically build and be deployed to your amplify domain.

Setup Web Hosting for your React App with AWS Amplify + S3 + Cognito

John Elisa — Tue, 31 Dec 2024 05:26:39 GMT

This tutorial will teach you how to deploy your react app online using AWS amplify + s3. If you have setup cognito on your react app, there is a section at the end of this tutorial on how to configure it as well with amplify + s3.

If you haven’t added cognito (user authentication, e.g. signIn/Out) to your app and you’re interested in adding cognito, read this quick 6 minute tutorial: https://medium.com/@johnelisaaa/setting-up-amazon-cognito-for-your-react-app-787de7999c07.

1. Run “npm run build” on your react project

2. Proceed to the AWS Console:

2.1: Creating an S3 bucket (storage for all your frontend assets):

Click “Create Bucket”
Add a name to your bucket, keep the default settings for the rest of the fields, and click “Create Bucket”.
Then, click your created bucket, press “Upload”, and upload your build/ folder. If you’re using vite + react, this will be a dist/ folder instead.
Go to your bucket’s properties, scroll down to “Static website hosting”, enable static website hosting and add your index document in there (index.html).
After that, click the “create amplify app” button.

2.2: Creating Amplify App + Linking to S3 bucket:

After clicking the “create amplify app” button, choose the “Amazon S3” method, find and choose your previously created S3 bucket, and click “Save and Deploy”.

It should now be deploying your frontend to the given domain appropriately.

3. Linking Cognito to new deployed URL (OPTIONAL):

3.1: Changing code configuration:


const cognitoAuthConfig = {
  authority: "",
  client_id: "",
  // Change redirect_uri from http//localhost:xxxx --> amplify deployed URL
  redirect_uri: "https://staging.xxxxxxxxx.amplifyapp.com/",
  response_type: "code",
  scope: "email openid profile",
  onSigninCallback: (_user) => {
    window.history.replaceState({}, document.title, window.location.pathname);
  },
};

const signOut = async () => {
    await auth.removeUser();
    
    const clientId = "";
// Change logoutURi from http//localhost:xxxx --> amplify deployed URL
    const logoutUri = "https://staging.xxxxxxxxx.amplifyapp.com/";
    const cognitoDomain = "";
    window.location.href = `${cognitoDomain}/logout?client_id=${clientId}&logout_uri=${encodeURIComponent(logoutUri)}`;
  };

3.2: Adding Authorized URLs to Cognito and External Providers:

Begin by going to AWS Console > Cognito > Your project’s user pool > App Clients > Your app client > Login Pages > click “Edit” for the Managed login pages configuration section.
Add your amplify deployed URL to both the callback URL and Sign Out URLs.

Next, if you have an external provider like Google, go to Google Cloud Platform console > Your project > APIs & Services > Credentials > Add your deployed amplify URL to the authorized redirect URIs.

Final Result:

Setting Up Amazon Cognito for your React App

John Elisa — Mon, 30 Dec 2024 14:26:43 GMT

Set Up Amazon Cognito for your React App in 6 minutes

I have a react project setup initialized with vite; The react app is basically a simple replica of amazon. I have a navigation bar here that has a “Sign In” button.

The goal of this tutorial:
- Implement the functionality of user registration/login using cognito.
- Add a sign in method using an external identity provider (Google).
- User registration and logging in through username/password.
- Once a user is logged in, the “Sign in” button in the navbar changes to a “Sign Out” button

If you have absolutely no idea what Amazon Cognito is, watching this video by “be a better dev” will help lots, but you don’t need to watch it to proceed with this tutorial.

https://medium.com/media/f93c6789781e6e617a630374803448c5/href

Part 1: Setup (AWS)

Go to https://aws.amazon.com/. Create an AWS account and Sign in to the console.
Search “Cognito”. Click “Get started for free in less than five minutes” under the “Add sign-in and sign-up experiences to your app” section.
Define your application and click “Create”. We’ll skip the return URL section for now.

Part 2: Setup (Google Sign-in)

Go to https://console.cloud.google.com/ > create a new project > APIs & Services > Credentials > Create Credentials > OAuth client ID. Fill In the name and application type, and click “Create”.

IMPORTANT: Add your localhost/web URL and “/oauth2/idpresponse” to the authorized redirect URIs section. After creation, you will be given a ‘Client ID’ and ‘Client Secret’.

Next, go back to the AWS console > proceed to your just created cognito application > user pool > social and external providers > add identity provider > Google.

Copy your Client ID and Client Secret obtained earlier and paste it to the appropriate fields. In the “Authorized scopes” section, add “email profile openid”

You’ve successfully added a “Sign in with Google” feature.

Part 3: Code Implementation

Here was my initial code:

// main.jsx
import { StrictMode } from 'react'
import { createRoot } from 'react-dom/client'
import './index.css'
import App from './App.jsx'

createRoot(document.getElementById('root')).render(
  
    
  ,
)

// App.jsx
import { BrowserRouter as Router, Routes, Route } from 'react-router-dom';
import Navbar from "./components/Navbar/Navbar";

function App() {
  return (
    
  );
}

export default App;

// Navbar.jsx
import React, { useState } from 'react'
import './navbar.css'
import amazonLogo from '../../assets/amazon_logo.png'

const Navbar = () => {
  const [isResponsive, setIsResponsive] = useState(false)

  const toggleResponsiveMenu = () => {
    setIsResponsive(!isResponsive)
  }

  return (
    
      
        
      


      
        
        
      


      
        Products

        Cart

        

      


    

  )
}

export default Navbar;

Currently my app only has a navigation bar. I want users to be able to sign in by clicking the “Sign In” button on my navigation bar, after a successful sign in, the navigation bar updates the old “Sign In” button to a “Sign Out” button, and signed-in users can click it to sign out.

Proceed to the app client you made during the AWS setup, and scroll down to the “quick setup guide”. Their quick setup guide works, but there are some additional things needed to be added to our code which will be shown in my guide below.

Install the oidc-client-ts and react-oidc-context libraries.

npm install oidc-client-ts react-oidc-context --save

Newly modified main.jsx:

// main.jsx
import { AuthProvider} from "react-oidc-context";
import App from "./App";
import ReactDOM from "react-dom/client";
import React from "react";

const cognitoAuthConfig = {
  authority: "",
  client_id: "",
  redirect_uri: "http://localhost:5173/", // Point to your localhost
  response_type: "code",
  scope: "email openid profile",
  onSigninCallback: (_user) => {
    window.history.replaceState({}, document.title, window.location.pathname);
  },
};

const root = ReactDOM.createRoot(document.getElementById("root"));

// wrap the application with AuthProvider
root.render(
  
    
      
    
  
);

AuthProvider:
Wraps the app to provide OpenID Connect (OIDC) authentication (brain for your app that handles login and logout).

cognitoAuthConfig:
- authority: The AWS Cognito domain URL (It tells the app where to send authentication requests).

- client_id: A unique identifier for your app, issued by AWS Cognito.

- redirect_uri: The URL to which users are redirected after signing in or out.

- response_type: Indicates the OAuth flow being used (code means Authorization Code Flow). Basically a website’s way of asking for a secret key safely after you log in.

- scope: Tells the app what parts of your profile (like your email) it’s allowed to see.

The quick setup guide didn’t include:

onSigninCallback: This is called after a successful login to clean up the browser’s URL.

When a user signs in, they are redirected back to your application with a URL containing query parameters like:

http://localhost:5173/?code=abc123&state=xyz987

If the user refreshes the page while these parameters are still in the URL, this can cause problems like duplication processing; When the app reloads, the react-oidc-context library sees the ‘code’ and ‘state’ in the URL and tries to process them again.

Since the ‘state’ parameter was already consumed during the initial sign-in, the library can’t find a matching state in its storage, resulting in a common error like:

No matching state found in storage

The onSigninCallback function fixes this as it removes the ‘code’ and ‘state’ parameters from the URL immediately after they are processed. If the user refreshes the page now, there are no leftover code or state parameters for the library to reprocess. The library simply reads the user’s tokens (already stored) and continues as normal.

The app is then finally created and wrapped in the AuthProvider so every part of the app can know if you’re logged in or out.

Newly modified App.jsx:

import React from "react";
import { useAuth } from "react-oidc-context";
import Navbar from "./components/navbar/navbar";
import "./App.css";

function App() {
  const auth = useAuth();

  const signOut = async () => {
    // Remove the user from local session
    await auth.removeUser();
    
    // Then redirect to Cognito’s logout endpoint
    const clientId = "";
    const logoutUri = "http://localhost:5173/";
    const cognitoDomain = "";
    window.location.href = `${cognitoDomain}/logout?client_id=${clientId}&logout_uri=${encodeURIComponent(logoutUri)}`;
  };

  switch (auth.activeNavigator) {
    case "signinSilent":
      return Signing you in...
;
    case "signoutRedirect":
      return Signing you out...
;
  }

  if (auth.isLoading) {
    return Loading...
;
  }

  if (auth.error) {
    return Oops... {auth.error.message}
;
  }

  // Pass signOut to Navbar
  return (
     
      
        
        
      
    

  );
}

export default App;

useAuth:
Provides authentication-related state/methods (e.g., auth.isLoading, auth.removeUser). Tracks whether the user is logged in or out.

signOut Function:
- auth.removeUser(): Logs the user out locally (Removes the user session data from the browser.).
- Then redirects to Cognito’s logout endpoint completing the logout process on Cognito’s side.

auth.activeNavigator:
- signinSilent: Renders a message when signing in.
- signoutRedirect: Renders a message when signing out via redirect.

auth.isLoading:
Displays a loading message while the authentication process is ongoing.

auth.error:
Displays an error message if something goes wrong with authentication.

Finally, we pass the signOut function to the Navbar component so the sign out button can trigger logout when needed.

Newly modified navbar.jsx (Navigation bar)

import React, { useState } from "react";
import "./navbar.css";
import amazonLogo from "../../assets/amazon_logo.png";
import { useAuth } from "react-oidc-context";

// Intercept the passed signOut method from app.jsx
const Navbar = ({ signOut }) => {
  const [isResponsive, setIsResponsive] = useState(false);
  const auth = useAuth();

  const toggleResponsiveMenu = () => {
    setIsResponsive(!isResponsive);
  };

  return (
    
      
        
      


      
        
        
      


      
        Products

        Cart

        
          // if user is logged in
          {auth.isAuthenticated ? (
             // change sign in button to sign out
          ) : (
            
          )}
        

      

    

  );
};

export default Navbar;

- If the user is logged in (auth.isAuthenticated is true):
The navbar shows a Sign Out button.

- If the user is not logged in (auth.isAuthenticated is false):
The navbar shows a Sign In button that triggers the auth.signinRedirect() method to start the login process.

Finished!