Build a Tool for Data Analysis — Connect Data

Published in

GoodData Developers

9 min readJan 4, 2022

If you are interested in GoodData.CN, please contact us. Alternatively, sign up for a trial version of GoodData Cloud: https://www.gooddata.com/trial/.

The amount of data that analysts need to deal with today is staggering, and this brings opportunities for developers to build tools that can help with data analysis. Fortunately, there are many options for doing that.

In this series of articles, I will show you how to implement the solution using GoodData.CN — all that I will need is just docker and node.js.

I will start with data load to the PostgreSQL database. After that, I will connect data to the analytical solution, where I will analyze data and create data visualizations. Finally, I will create a web application to show my data visualizations.

Note: I am using generated data from my data generator tool. Feel free to use it as well. Remember that data contains dates — to have current dates, you need to run the script.

Online Shop Data Analysis

The tool I am about to build will analyze data from an online shop. Imagine that my company runs an online shop — I have many customers, orders, and products data in the PostgreSQL database, and I want to start analyzing the data.

What can I find out?

What is my actual revenue?
How many orders were placed yesterday or two months ago?
Who is my typical customer based on the best-selling product?

These data points can help me grow my customer base, conduct better marketing campaigns, etc.

I will analyze the e-commerce data from an IoT online shop. The data is stored in four tables — users, products, orders, and order_items. The following image shows the structure of the database:

You can see that the database structure is quite simple, but it offers many exciting data analysis opportunities.

I have several options how to analyze:

Create my own program/script to export data from the PostgreSQL database to CSVs and import the data to a spreadsheet for fundamental analysis.
Use PostgreSQL ODBC Driver to connect live data to a spreadsheet and, again, do fundamental analysis.
Use modern analytical tools, build a data application that I can embed into my data environment to process vast amounts of data at any time.

Ready-made analytical features in modern tools mean that using them is easier than doing the same analysis in a spreadsheet:

I can build a semantic layer above my data to help me with data management.
I can combine my data with data from other data sources.
Modern analytical tools are cloud-native and use an API-first approach — so you can integrate them into the existing data stack easily.

Start PostgreSQL and GoodData.CN Locally

The analytical solution will store data in the PostgreSQL database.

For data analysis, I will use GoodData.CN Community Edition (free for evaluation, development, and testing).
For monitoring and servicing of PostgreSQL database, I will use the PgAdmin tool.

To put everything together, I will use the docker-compose tool.

Create docker-compose.yml and docker.env files

I will store my credentials in the docker.env file:

POSTGRES_USER=adminPOSTGRES_PASSWORD=adminPOSTGRES_DB=iot-online-shop-productionPGADMIN_DEFAULT_EMAIL=admin@iot-online-shop.comPGADMIN_DEFAULT_PASSWORD=admin

These credentials are just for demo purposes, and you should not use similar in a production application.

It is time to create the docker-compose.yml file where all services will be configured. Then, with a single command docker-compose up, all services will be started.

Important: Before you start GoodData.CN gets, ensure that you are familiar with its terms of use (GoodData non-production license agreement). Then, in the environment variables for gooddata-cn-ce, you must add LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES; otherwise, this service will not run.

I set up images to correct ports and volumes. Both files are stored here: Analytical Solution.

Let’s run docker-compose up in the command line (in the folder where I have the docker-compose.yml file), and after a couple of minutes, the following result should appear:

Now, everything should be prepared to insert data into the PostgreSQL database. I can go to http://localhost:3000 to verify that GoodData.CN is running:

Insert data to the PostgreSQL

I have the data in the PostgreSQL database already. If you want to use the same data I do, get the CSVs and follow these instructions. If you’re going to read the article without implementation, feel free to skip this section.

Now I need to create a new server using the PgAdmin tool.

PgAdmin tool is running on http://localhost:8080 and credentials are these from docker.env:

PGADMIN_DEFAULT_EMAIL=admin@iot-online-shop.comPGADMIN_DEFAULT_PASSWORD=admin

The settings of the server should be following:

After I connected to the database, I can see iot-online-shop-production database in my list of databases:

Now is the time to load data. The data is stored in CSVs, and I need to import it to the PostgreSQL database. First, I need to create tables in the database, and I will do it using the query tool:

And then, I will execute the following commands:

After the tables are created, I can load the data into the PostgreSQL database using the PgAdmin tool:

Now, I need to add a CSV file. I will switch to Import in the Import/Export dialogue and click the three dots beside the Filename field:

It will open a new dialogue where I need to click the upload icon:

Then I will drag and drop my CSV file to the dialogue. The result should be as follows:

I will go back to the Import/Export dialogue and choose my uploaded file. Notice that I have to switch the input Header to Yes. As a delimiter, I will select “,”. If described changes stay unnoticed, the import will fail. The result should look like this:

When I click the OK button, the data should be inserted into the user table. I will do it for every table in the following order: users, products, orders, and order_items (I have to do it in this order because of foreign key constraints). Now, everything should be prepared for analysis! I can check if data is inserted using the following simple query in the query tool:

select * from users;

The output should be following:

Connect GoodData.CN to PostgreSQL

First of all, I need to connect the GoodData.CN platform to the PostgreSQL database, and I will do it using an authenticated API call. To do authenticated API call, I need to use a bearer token — bearer can be understood as “give access to the bearer of this token”. It is the same one printed to the command line after the gooddata service in docker-compose is started. Then I can use the following command:

Building a URL that will represent a connection to the database is necessary. The URL has the following format: jdbc:postgresql://host:port/database. As the host, I am using postgres since it is the name of the service in docker-compose.yml, the port is a standard PostgreSQL port 5432, and the database name is iot-online-shop-production. The username and password are the same ones as I used when I created a server in the PgAdmin tool. You can read more information about GoodData.CN data sources here: Data Sources.

Now is the time to create a brand new workspace in GoodData.CN. Workspace is an environment where you organize and create your analytics. It contains data sources, logical data model, dashboards, insights, and measures. You can read more about it here: Workspace.

After I log in to the GoodData.CN, I will see the following screen:

As I mentioned, it is necessary to create a workspace. If I click the Create Workspace button, it will open a dialogue where I fill in the name of my workspace. For example, iot-online-shop-analysis.

When the workspace is created, I can click the Connect Data button. It will open a screen where I can create a Logical Data Model (LDM). After clicking the Create model button, another screen is displayed where I can see my data source (I connected that data source from the command line):

I want to scan my PostgreSQL database to create a Logical Data Model. I will then click the Scan button, which will open a dialogue where I can configure the scan process. It scans the physical content of the database, and on top of that, it creates the Logical Data Model (only if generate datasets is checked):

The scan process creates this model:

You can see that the model is similar to the model of the PostgreSQL database. Each dataset is mapped to a PostgreSQL table, attributes/facts inside datasets are mapped to the columns. Primary keys of datasets and relationships between the datasets are derived from the database referential integrity. There are unique virtual date datasets (blue) representing a particular date/timestamp column. Each date dataset contains date granularities (day, month, etc.) as virtual attributes (pre-generated and added to every date/timestamp column). You can read more about the Logical Data Model here: Logical Data Model. Also, if you are very interested in Logical Data Model, you can check this GoodData University course the GoodData University Logical Data Model course.

For the clarity of the data, I will change the names of all Ids in the datasets to more suitable words. It is simple — I will double-click the row and change the name. This renaming will be beneficial during the data analysis, and it will be clearer which Id belongs to the dataset. The result is as follows:

It seems that data is prepared for analysis! Maybe you ask why the Logical Data Model is so important. While some analytics tools do not require a Logical Data Model to be defined, it usually means that every single data visualization requires you to write a separate query, for example, in SQL or a similar language.

With a predefined Logical Data Model, you define the mapping and relations only once. Then you can reuse the same objects multiple times for many different data visualizations. The model will ensure that you will not combine things that should not be combined, which is very useful for self-service analytics. If the data model is well-designed, you or other users (without analytical/database skills) do not need to think about connecting various data points every single time during data visualization creation. All the essential relations are already captured and embedded into the Logical Data Model by its creator.

Simple Data Visualization

Just for a demonstration, I will create a very simple data visualization. I want to know the average age of IoT online shop users. I will open the Analyze tab on http://localhost:3000 where GoodData.CN is running. There I will do a simple analysis:

You can see that I selected the Headline visualization and put fact Age onto the MEASURE (PRIMARY) panel. As the count function, I set Average, and the result is the average age of my customers. 🎉

Summary

In the first article, I created the docker-compose.yml file to start the PostgreSQL database, PgAdmin, and GoodData.CN. Then I used generated data and inserted it into the PostgreSQL database using PgAdmin.

These steps were necessary to execute my data analysis. After I prepared the database, I connected it to GoodData.CN and created a workspace and the Logical Data Model. To show data analysis ability, I created a very simple data visualization.

The following article describes more advanced data visualizations and will serve as a tutorial on how to build a React.js application to display these data visualizations;