[Basics] How to create a good ETL flow for you data warehouse
As a BI analyst i need to ensure that everyone (also me) are writing a good ETL flow with standard name convention and SQL operations.
Let’s begin this tutorial with a brief explanation of what is an ETL flow.
ETL stands for:
- Extraction
- Transformation
- Loading
Those are one of the most important phases for developing a data warehouse.
But what of each phases are for?
EXTRACT:
Is the action of raw data retrieval, we read data from one of the sources we need and we start upload the data into one of our data warehouse tables.
Usually in this phase is good to read all the data like this:
- If the source is from another database a good practice is to read the data (even numbers) as NVARCHAR you can do it like this:
- If the source is an EXCEL or .CSV file just format each column as a DT_WSTR even column with numbers.
- Usually i call my source table as table_name_source
Transform:
Is the phase where we modify the data with some aggregation or other operation
Now we have upload the data “as-it-is” into our data warehouse, is good to not proceed with modification in the upload phase we leave our data as raw as possible.
The second phase is called “transform”, now we can proceed to modify our source data directly from the table_name_source.
The _Extract table will contain some more columns as:
- OP:
Which will be our column where we decide if the row will be inserted or not - I_Date — U_Date — D_Date:
Those are Insert date, Update date and Deletion date (Just a logical deletion) - SC:
Source code, this will identify our row with an unique set of data - CS:
Is the checksum
The extract table is filled, at first, with all data we need (Eg. with some columns adding some operation like percent or multiplication.)
Once we have filled our extract table we proceed with some updates.
- For the I_Date:
We update the column with a simple getdate() - For U_date or D_date we update those columns only if needed by checking if the ID column (i haven’t specified it’s existence but is always needed usually is an IDENTITY column) and a combination of column are already in the final table with different data (U_date is filled), or is not in the final table (D_date is filled).
- SC:
Will always let us to identify the row in a unique manner, it is needed because this column will contain bunch of data from other column with a concatenation i’ll explain with code:
- CS:
Is a simple function for checksum. - OP:
The operation on the row “i”, “u”, “d”
Now i will resume all those infos in one .SQL file:
Now we have uploaded and transformed our data we pass through at the last staging area, the preload table.
In this table we duplicate our columns for example:
- The Vendor column will have EVendor and PVendor.
E stands for extract and P for Preload, this for all columns, why?
Because here we can add an another layer of transformation to out data, we transform for example the vendor name column into an idVendor column by joining the extract table with and another table (dimension table precisely) containing only the vendors name and adding to them and Id, so our data retrieval will be faster!
Also in the preload phase we transform all the number columns into integer columns (or decimal depending on your needs) and even the id columns will be in integer format.
Load:
We upload the data into the final table without modify anything
In this phase we do a raw upload of our data depending on the operation:
That’s it, i hope you found this little guide a bit helpful with the creation of a good workflow, adapt it on your needs and tell me if there are mistakes on the syntax or logic (we will discuss it here).