ELT Project London Bus Safety Performance — 2

Y_Lu
3 min readAug 25, 2022

--

In this part, we will divide data into different dimensions and store in different tables for future use. It is the normal practice in OLAP for analytical use.

More about OLAP and dimensional models: https://medium.com/analytics-vidhya/data-warehousing-schemas-961391b3f3de

Here, I will divide it into 4 dimensions(Date, incident place, bus group, incident type). I will create 4 tables (dim_date, dim_place, dim_company_cat, dim_incident_type) and store relative data.

  1. dim_date

There is one column of ‘Date of Accident’ in the original dataset and the format is yyyy-mm-dd. based on this data, we could analyze datetime from different aspects. Here, I will divide it into year, month, day, week, half_year, quarter.

As dividing date into dimensions is a normal practice, we could store it in a procedure for future use.

Create dim_date
Create date procedure

2.dim_place

Subregions of London
Create dim_place
Insert into value

3.dim_company_cat

Based on bus company, we can divide them into different categories for future analyzation. We could consider the company size, national or foreign company, listed or not…

Here, I couted the number of operation lines to decide the company size (1 big, 2 middle, 3 small)

check company size

bonus: (another way to create and insert data using subqueries)

4. dim_incident_type

After selecting distinct incident event type, we could find there are we could classify below events into different categories. Therefore, we could create a dimension for it.

check incident type

In this part, we create four dimensional tables ( dim_date, dim_incident_place, dim_company_cat, dim_incident_type) for analytical use.

--

--

Y_Lu

a data engineer who enjoys exploring data world and new teches