Starting a Data Model with Repods

Record Evolution
11 min readMay 8, 2018

--

Author: Rafael Queiroz, FullStack Developer and IT-Infrastructure Management @ Record Evolution GmbH <rafael.queiroz@record-evolution.de>

Image by Hao Wang

Repods is a cloud data platform to create and manage data pods. Data pods are compact data warehouses with flexible storage, vCores, and memory as well as all the required tooling. You can manage personal data projects, work as part of a private team, or collaborate on open data in public data pods. This article shows how to create a data model with the help of the platform.

Table of Contents

Before We Start

Before creating a data pod, you need to be aware of the scope of the information you have and what you need for your analysis. The goal is to create a data model that closely reflects the business entities of the specific subject area, without focusing on how reports are going to be created or how we are going to fill this data model with the given data. You can start by answering the following questions: “What business entities are concerned with my area of interest? How are these business entities defined?”

Let us consider a hypothetical factory with machines that produce soft drink bottles. The available data refers to the machines and what they produce. So we would consider the entities Machines and Products. To simplify further, we can create the entity Production that represents the daily output capacity of our factory and is a relation of Machines and Products.

We have three different data sources providing information about the production of Bottles, Labels, and Bottle Caps respectively (each from a single file). Looking at the data, we can see that all the information is in place to implement our entities.

The following steps will be:

  1. Model the entities creating tables for each of them;
  2. Import the data from the source (raw files);
  3. Create data pipes to fill out the created model;
  4. Extract some reports from our model.

Let’s keep this in mind as we begin with our tutorial.

Getting Started

To initiate the process, create an account on http://repods.io or log in if you already have one. You can start by creating, copying or modifying an existing data pod, as well as by searching for existing public pods or by getting invited to a private or public pod. You can also take a look at a live version of the pod that we use for this tutorial at https://repods.io/markode/SodaBottlesFactory

Creating a New Data Pod

Let us create a new, empty pod for this tutorial. Click on New Pod in the upper left corner of your account:

Figure 1. Adding a new pod

Now we can specify the pod name, add a short description, and select a product plan. You can use Repods in many pricing options ranging from flat pricing to a usage-based price (from 0.15 Cores to 48 Cores). To experiment, you can also make use of the free pod option. Besides, you have the option to add a demo data set as a starting point for your practice.

Figure 2. Create a data pod

Starting the Pod

After you have created the data pod, you will see that your new pod has appeared on the list of pods in your account. If the pod has an orange square right next to the pod name, this indicates that the pod is currently stopped. To activate and enter the pod, you need to press the button Start:

Figure 3. Starting your pod

The first time you start the pod, the Repods platform will generate some isolated cloud resources for you, including your database, memory, storage, etc. In this way, a process that would normally take four weeks can be completed in less than one minute. This takes place only the first time you start the data pod. Later on, the process may even get faster.

Click the button Enter to open your pod. This is how your data warehouse looks like:

Figure 4. First view

Importing Data (Alt+1)

Since we have not imported any data yet, we can start importing files by clicking on File Import and then on Upload New File. You can also drag and drop directly into this panel. You can find the demo data we use in this tutorial here.

Figure 5. Checking files

Once the upload is done, you can click on one of the files from the list. This will trigger a pop-up window displaying the file contents. The first tab of this pop-up window shows the data as it is interpreted by the Repods internal system. Although the system is very good at automatically detecting the data structure, in some cases you might still be missing details. In our case, you can look at the second and third tab of the pop-up window to see the beginning and end of the file in raw format.

Figure 6. Import file details

Clicking on Next will take you to the next panel. Here you can adjust the file structure, clean some parts of the data, and apply some type conversions or transformations. Also, the featured import menu will help you inspect your data. We recommend that you read the help section on this panel (located in the top right corner).

Now, let’s have a look at the Package Time Scope sub-panel. Here we can tell Repods about the time scope to which our file (=the data package) is related. In our example, we have production data from mid-September to mid-November 2017. We can derive this time scope from the Date column in the imported file.

We can set Repods to take the beginning of the package from the minimum date entry and the end of the package from the maximum date entry in the file:

Figure 7. Time scope

Repods can recognize some date types and pre-select them for us. However, we recommend that you revise and select the appropriate ones.

If everything looks good, we can click on Load and see the imported data with its time scope. We can then repeat this process for the remaining files and end up with something like this:

Figure 8. Imported files

Now Repods creates a Raw Table that holds the imported file data prefixed with an “S_” to indicate that these are source tables. Figure 8 shows the imported file in the raw table as a grey stripe spanning from September 11, 2017 to November 15, 2017 on the time axis displayed above.

Repods lets you know if an error occurs during any stage of the process. We can always click on the raw table to make adjustments.

By the way: You can pan and zoom the time axis with your mouse by dragging and scrolling while hovering over the axis screen area. Just give it a try.

Creating a Model (Alt+3)

As discussed in the section Before we start, we now want to create the entities for our data model consisting of Machines and Products. For the sake of simplicity, we have created the artificial entity Production to express the relation between Machines and Products.

First, we need to create a new Core Table. In the Repods terminology, these tables express our entities and are the target for our pipes. Let’s create one such table with the name Production. To do so, go to the Core Data panel (Alt+3) and then click on Add Table.

Figure 9. Add table

When we create a Core Table, we need to specify the table type in the very beginning. If we want to track the content over time, you have to create a State Table. If the data describes a specific point in time, such as transactions, observations, or log events, you need to create an Event Table. In our case, each entity is concerning a single day, so we will create an Event Table.

The Columns are set up as shown in Figure 10:

Figure 10. Creating a data model

You might have noticed that we did not include the Manufacture Date here — this is because Event Tables have a default EVENT_DATE field, as we are going to see in the next step. Click on the Save button to create the new table and your table will get a prefix “T_”, indicating that this is a target table.

Creating Data Pipes (Alt+2)

With our Core Table created, let’s move on to the Pipes panel (Alt+2). To create a data pipe, click on Add Pipe:

Figure 11. Add pipe

Now you can fill out the information in the pop-up window. The Pipe Name indicates the data to be transformed. Then, we select the previously created table (Choose a Target Table) and the source to be loaded (Choose a Source). Once this is done, you can press Create:

Figure 12. Creating data pipes

After creating the data pipe, a pop-up window containing the draft of the pipe transformation will be displayed; some suggested SQL will be included too:

Figure 13. Pipe transformation

You can use the Pipe Transformation to apply any major transformation from the source data to the Core Table. This is done by using columns in the SQL SELECT statement with the same name as in the Core Table. An Event Core Table contains the standard field EVENT_DATE. You can use this column to put the data into a temporal context. To do this, you can fill this column with our Date column from the source (similar to what we did during the file import).

To identify our Product, we simply use the string ‘bottle’ as the name of this Product (as well as ‘label’ for the Labels and ‘caps’ for the Caps on the other pipes). Inside our original data, the Amount field is called Bottle (column that has the amount produced), so we must rename it to Amount (same applies to Labels and Caps):

SELECT 
"Date" AS EVENT_DATE,
'bottles' AS "Product", -- string in single quotes, e.g.: 'caps'
"Bottles" AS "Amount" -- column identifiers in double quotes
FROM
SC1039_S_BOTTLES r1 -- keep the auto-generated name from REPODS

Code Snippet 1. SQL Sample transformation

We can check if the transformation looks good by clicking on the green Play button located above the code panel. Now we click on Save to keep this transformation and click on the top rightmost icon (Free Flow). In case of errors, we can force the data to flow with the Force Pipe Flow button.

When the pipe starts to flow, the data is collected from the source according to the specified SQL statement and will be gracefully merged into the already existing data stock in the target Core Table. The execution of the merge function is represented by the little green square on the top of the Core Table icon. The merge function is a sophisticated functionality that ensures that all new data is entered without duplicates.

We can see the pipe data by clicking on the table T_PRODUCTION:

Figure 14. Pipe panel

If we set the pipes to Free Flow mode (indicated by the green color), every time new data arrives, the pipe transformation will be applied automatically.

Creating Reports (Alt+4)

Now we can start creating reports based on our data. Also, we can create Workbooks with more powerful transformations in this panel.

Reports

Let’s start with a simple report, showing the number of Labels, Caps, and Bottles produced from mid-October to mid-November 2017. First, we click on Create Report. We will see something like this:

Figure 15. Reports overview

We can expand the Production table to see its contents and select an operation. We can change the time range and the end time to the desired date. For our example, we first need to select the entities by clicking on the field Product and then on Detail (Group by):

Figure 15. Report detail

Now we can sum up the Amount by clicking on Amount and then Sum. An eye symbol indicates that the chart is (or not) displayed. If we are doing a Static report, we first have to press the button Execute and then click on the eye:

Figure 16. Chart visualization
Figure 17. Result visualization

And that’s it! Repods offers many more features so make sure you read the help section and feel free to play with the various tools.

Workbooks

You can find the Workbooks section in the Analysis panel. Workbooks offer some powerful but tooling but require more SQL knowledge. For a quick glimpse into this section, let’s write a simple SQL Query in the workbook card.

First, let’s click on Dashboard to return to the initial view. Now click on Create Workbook:

Figure 18. Workbook in action

We can insert the following query to observe some of the anomalies:

SELECT event_date, "Product", "Amount"
FROM T_PRODUCTION
WHERE "Amount" < 200 OR "Amount" > 800
ORDER BY event_date

Code Snippet 2. SQL Select sample for Workbooks.

The Dashboard panel provides an overview of all Workbooks and Reports.

Figure 19. Overview

In retrospect, the Repods platform is a powerful tool that gives you a lot of freedom to perform all sorts of data analysis — with development environment for ETL processes directly in the browser, SQL code for logic and one GUI for everything else, automatic layout of ETL dependencies, real-time display of ongoing processes in the process layout, and many other features.

Feel free to offer your feedback and explore this tool in more detail! See more on https://repods.io. You can also check a live version of this pod on https://repods.io/markode/SodaBottlesFactory.

--

--