Untangling Databases: Data modeling with Dbeaver
You are searching answers for below challenges:
- Longing for a smooth query writing experience
- What the heck does foreign / secondary keys really do
- Tried data modeling atleast 2 multi-table datasets and frustrated with the process
- Afraid of ingesting data into database tables
All the above questions will be answered, and lot more interesting ideas are shared. Table of content for your reference.
- Requirements
- A detour on Whys
- Setting up In Memory DB connection in Dbeaver
- Connecting the IDE with the database
- Create tables and load the data inside the DB
- ER diagram : Wealth Indicators dataset:
- Horror Story Time
- Why datamodel the tables?
- Some challenges with in Memory Database :
- The final Leg
- Where to Next???
Requirements
- Two multi table datasets to practice, located in your local folder. I will be using world indicators dataset, (3 table dataset) for explaining. You can practice on the maintenance prediction dataset Kaggle Microsoft dataset, a 5 table dataset or the snldb dataset (look for output directory) a 11 table dataset created by scraping SNL data by
Hendrik Hilleckes if you are more adventurous. - Eager to use Dbeaver the most awesome database query IDE. Install it from here before reading the next sentence…
- Experiment with “in-memory” database called DuckDB which needs to be, you guessed it, installed before reading any further.
- Allocate 2.5 hours to understand and practice all the steps
A detour on Whys
Why DuckDB? Mainly because you don’t need to have PostgreSQL/ MySQL db running on your machine. No Python or Javascript here. Pure database querying. You will just need a working command prompt to wget/ curl the duckdb binary. (We are not using the Python’s Duckdb)
Why Dbeaver? It is Vs-code for db querying. Dbeaver lets you to connect with duckdb’s in memory database, and create tables on them. It lets you write SQL queries inside a query_script.sql and store it for later reference. Best part is, if you want to migrate the above csv files into a PostgreSQL db, all you need to do is change the data source inside DBeaver.
Why above datasets? I have data modeled all three of them on my Linux machine with Dbeaver and DuckDB. You must be able to go through the entire process without much issues. These data sets are not very challenging or amateurish, it has right level to improve your mastery.
Setting up In Memory DB connection in Dbeaver
Once you download the Dbeaver IDE (referred as ide from now on), you will need some time to orient yourself to the various options in ide’s menu and icons. To set up the DB connection, you need to look for the “plug” icon on the top left of the ide.
Click on it, you will be greeted with a dialog box to choose the database. Choose duckdb in it, and click next. That will lead to following dialog box. This is important, pay attention to red ellipses
In the path, type “:memory:” and then click on the “Test Connection” at the bottom. Ensure that you have internet connectivity. If the driver is downloaded, and installed correctly you must get a pop up as follows.
Once the above test is successful, then just click on the finish button at the bottom. Which will lead you back to root UI of ide. It will have the database connection, and you can open the tree under the connection. There will be no tables there
What if you face error in duckdb connectivity? Check your internet. Check the spelling of “:memory:” inside the path field. If the error is related to driver not being downloaded, then go to Window > Preferences. A below dialog box will be there, you need to be at Connections > Drivers > Maven.
In the above dialog, click on “Add” button and paste https://mvnrepository.com/ . Then trying to create the database connectivity. It must go through.
Connecting the IDE with the database
In the ide, go to SQL Editor > New SQL Script to open a brand new script. By default the script will not have any database connections associated with it. You have to assign the script a database connection first.
After successfully assigning a connection to the database. Your script will look like this.
Hold on. We did not use Duckdb. How are we connecting to database? If you thought about it, great job. Remember the driver that is downloaded during the database connection step, that driver will take care of creating the “In memory db” inside the ide
Then why did you force us to download duckdb? I had faced challenges with the ide crashing some times due to some unforeseen issues. At that time you can simply continue the tutorial using duckdb interface, by creating running the commands at duckdb prompt.
# At your commmand prompt or shell
$ wget https://github.com/duckdb/duckdb/releases/download/v0.6.1/duckdb_cli-linux-i386.zip
$ ls
duckdb
# preferebly move it to /usr/bin directory, or add the directory where you have
# duckdb to the system path
$ mv duckdb /usr/bin
or
$ export PATH=$PATH:/path/to/directory/having/duckdb/
# change directory to folder containing the downloaded dataset
$ ls
dimensions_country.csv dimensions_indicator.csv facttable.csv
$ ./duckdb
# that command will yeild
D.
# that D is the prompt of duckdb... Yeah, can be confusing to some.
Create tables and load the data inside the DB
We will work with the SQLScript file. I assume you are the ninja level Data-hacker and got the ide to create the connection. You are ready to fly. Please type out the below commands in the editor. Auto completion of dbeaver ide is phenomenal. I bet, you will never copy paste SQL query from this moment
/*Please **type** the below code into the SQL script and change the path*/
/*We are using the wealth indicator dataset from kaggle.*/
/*https://www.kaggle.com/datasets/robertolofaro/selected-indicators-from-world-bank-20002019?select=dimension_country.csv */
drop table if exists fact_table
CREATE TABLE fact_table AS SELECT * FROM read_csv_auto('/path/to/data/facttable.csv', header=True)
drop table if exists dimension_country
CREATE TABLE dimension_country AS SELECT * FROM read_csv_auto('/path/to/data/dimension_country.csv', header=True)
drop table if exists dim_indicator
CREATE TABLE dim_indicator AS SELECT * FROM read_csv_auto('/path/to/data/dimension_indicator.csv', header=True)
/*The above commands are executed after the in memory database is connected in
dbeaver
Note: if the dbeaver instance is closed the in_memory tables are lost
in the database.*/
select *
from fact_table ft
limit 5
/*output will be as below. I copy pasted the outputfrom the ide*/
Country Code|Indicator Code |2000 |2001 |2002 |
------------+-----------------+-----------------+-----------------+-----------------+
AFE |IC.BUS.DISC.XQ | | | |
AFE |IC.CRD.INFO.XQ | | | |
AFE |FS.AST.PRVT.GD.ZS| 74.9798926334367| 77.003129903379| 62.4323760940614|
AFE |EG.USE.ELEC.KH.PC| 780.702623963529| 743.916043970902| 769.080854071567|
AFE |EG.IMP.CONS.ZS |-31.3910700728529|-29.1363228434745|-32.9108840507047|
/*you can check for other tables also*/
If everything had worked for you then, take a look the time. How long did that take. Now try doing that on regular database. We can tick off “smooth query writing experience and data ingestion”. We are going to see the ide power when we start data modeling.
ER diagram : Wealth Indicators dataset:
Lets put a smile on that face of yours… Look for the tables under the “Database Navigator” on the side panel inside the ide.
Right click on that tables, and you will find option “View Diagram”. Cick that and you will get that smile I told earlier. The Entity — Relation diagram of the dataset.
That diagram provides complete picture of the dataset. Assume the datasets where, 10 ~ 15 tables are there. You need to model and analze them. How useful this ER diagram can be??? Now lets get on with our task of data modelling. The ide does not data model for you…
Horror Story Time
When I decided to write this piece, I was thinking data modeling is not required when analysing the data. I was analysing datasets which were “not data modeled”. I was joining 5 different tables, and the number of rows exploded to a 1,000,000!!! The max_rows was 870,000 rows. The final table was taking more than 15 minutes and the write was not completing…
Here is the query on the “maintenance prediction dataset”. Take a look at the telemetry.csv in that dataset. Thats when I doubled down and worked on data modeling.
select t.machine_id , t.volt ,t.rotate, t.pressure ,t.vibration ,
e.errorid ,f.failure , mc.comp , m."age", m.model
from telemetry t join errors e
on t.machine_id = e.machineid
right outer join failures f
on t.machine_id = f.machine_id
right outer join maint_comp mc
on t.machine_id = mc.machine_id
right outer join machines m
on t.machine_id = m.machine_id
A Good horror story, but daily activities will not get you into such situation. Take our 3 table dataset, that looks “daily” enough. You have the tables, and you have ide like dbeaver that makes joining the tables a breeze. Doing the following join, will complete the data engineering process. Is it really perfect?
Why datamodel the tables?
Ans : To gain confidence on the data inside the joined tables
# query explanation is not provided...
select ft."2000" , dc."Country Name" , di."Indicator Name" ,
ft."Country Code" , dc."Country Code"
from dim_indicator di left outer join fact_table ft
on ft."Indicator Code" = di."Indicator Code"
join dimension_country dc
on ft."Country Code" = dc."Country Code"
# output is truncated
2000 |Country Name |Indicator Name
-----------------+------------------------------+----------------------------------------
|Africa Eastern and Southern |Business extent of disclosure index (0=l
|Africa Eastern and Southern |Depth of credit information index (0=low
74.9798926334367|Africa Eastern and Southern |Domestic credit to private sector (% of
780.702623963529|Africa Eastern and Southern |Electric power consumption (kWh per capi
-31.3910700728529|Africa Eastern and Southern |Energy imports, net (% of energy use)
|Africa Eastern and Southern |Expense (% of GDP)
|Africa Eastern and Southern |Fixed broadband subscriptions (per 100 p
1.86083099700546|Africa Eastern and Southern |Fixed telephone subscriptions (per 100 p
3.35077349562722|Africa Eastern and Southern |GDP growth (annual %)
|Africa Eastern and Southern |Gini index
3.76910996437073|Africa Eastern and Southern |Government expenditure on education, tot
# Ensure the row count is not more than largest row count, in this case 8778
select count(*)
from dim_indicator di left outer join fact_table ft
on ft."Indicator Code" = di."Indicator Code"
join dimension_country dc
on ft."Country Code" = dc."Country Code"
row_count
8778
After joining you scroll and check the country_code from fact_table and dimension_country. They are matching. You also check by altering the join to left outer and right outer join. The final table created is rock steady. Is that not perfect?
Downsides of Not Data modeling: Did you realize that, instead of trusting the database server, you are manually checking the final joined table. Humans should not do the work machines have to do. Manual checking is against developer and team productivity, leads to unexpected bugs, under utilisation of processing power, that is at your service.
What is Data Modeling? Ans : A set of constraints fixed in a data base table before the data is copied into it. An example can clarify better.
In the ER diagram you see the lines from column to another column name. These lines signifies the constraint that is applied by one column on another column in another table.
Note that the data model is “baked” into the center fact_table before the data is pushed inside it. So if we have to data model the wealth_indicators fact_table, then we have to create table with those constraints. The code will look like this.
## Recreating the fact_table as fact_data with the foreign key constraints
CREATE TABLE fact_data(country_code varchar references dimension_country("country code"),
indicator_code varchar references dimension_indicator("indicator code"),
year2000 NUMERIC ,year2001 NUMERIC ,year2002 NUMERIC ,year2003 NUMERIC,
year2004 NUMERIC ,year2005 NUMERIC ,year2006 NUMERIC ,year2007 NUMERIC,
year2008 NUMERIC ,year2009 NUMERIC,
year2010 NUMERIC ,year2011 NUMERIC ,year2012 NUMERIC ,year2013 NUMERIC,
year2014 NUMERIC ,year2015 NUMERIC ,year2016 NUMERIC ,year2017 NUMERIC,
year2018 NUMERIC ,year2019 NUMERIC ,year2020 NUMERIC ,year2021 NUMERIC
)
Those foreign key clause along with the references clause create the constraint on the fact_data table. The same kind of constraints can be added on existing tables, inside regular SQL databases using the Alter table + add foreign key command.
Some challenges with in Memory Database :
If you run the above code, the ide will complain that country_code column is referring to a column in country_code table, which is not primary key nor is it unique.
That is because we created the in-memory table by simply importing the CSV. The columns in that table were created by duckdb. There is the another way of importing the data with COPY command as shown below.
create table primary_country(country_code varchar primary key unique,
country_name varchar)
create table primary_indicator(indicator_code varchar primary key unique,
indicator_name varchar)
COPY primary_country FROM '/path/to/data/dimension_country.csv' (DELIMITER ',', HEADER);
COPY primary_indicator FROM '/path/to/data/dimension_indicator.csv' (DELIMITER ',', HEADER);
The above method of creating tables first, and then loading the data into the tables might seem unwanted. The benefit will be apparent very soon. Once you have the above tables, then the fact_data table can be re-written
CREATE TABLE fact_data(country_code varchar references primary_country(country_code),
indicator_code varchar references primary_indicator(indicator_code),
year2000 NUMERIC ,year2001 NUMERIC ,year2002 NUMERIC ,year2003 NUMERIC,
year2004 NUMERIC ,year2005 NUMERIC ,year2006 NUMERIC ,year2007 NUMERIC,
year2008 NUMERIC ,year2009 NUMERIC,
year2010 NUMERIC ,year2011 NUMERIC ,year2012 NUMERIC ,year2013 NUMERIC,
year2014 NUMERIC ,year2015 NUMERIC ,year2016 NUMERIC ,year2017 NUMERIC,
year2018 NUMERIC ,year2019 NUMERIC ,year2020 NUMERIC ,year2021 NUMERIC
)
This time ide will accept the table creation, since the keys used in the table refer to primary keys inside Primary_country and Primary_indicator tables. With that the “Die is cast” for the data to flow into the table.
The final Leg
COPY fact_data FROM '/path/to/data/facttable.csv' (DELIMITER ',', HEADER);
When the above COPY command is executed, the data will be copied from the csv into the database table called fact_data. In case the “facttable.csv” has data that doesn’t confirm to the “constraints” set in the fact_data table, then ide/ database server will raise error and stop the copy process. And you can be rest assured the data inside the fact_data table is not having duplicate or erroneous data.
Where to Next???
Last section must alleviate your fear for “data ingestion” into the database table. Provide a working knowledge of foreign keys, and finally make the process of data modeling, as hot as “pouring hot metal into the die”. In order for the answer to really make sense, the above process has to be done by you, step by step.
In between the steps, modify the code and explore the ide, the SQL join and where commands. Query and see what happens to the data. Try working with the snl dataset, which is 11 table dataset, to really give you a strong grip on data modelling. After that, if you are learning data science, then do the same exercise using pyspark, and then try to create a linear regression model, or a prediction model.
My references and thanks to
- https://duckdb.org/docs/installation/index
- https://dbeaver.io/download/
- https://www.youtube.com/@DarshilParmar
- https://www.guru99.com/
- https://docs.rilldata.com/
- https://www.kaggle.com/datasets/
I hope my experience can be helpful to you. Thanks for reading…