Efficient Data Engineering with Smart Data Lake Builder

Published in

ELCA IT

14 min readSep 27, 2023

As data engineers, we’ve all embarked on an endless journey of finding tools to resolve our tasks more efficiently. Whether it means writing less code, worrying less about errors or just organizing our solutions in a better way, increasing our working efficiency allows us to focus on the essentials within our projects to achieve better results. In this article, we discuss some of the main features of Smart Data Lake Builder (SDLB), an open-source, Scala-based, data orchestration framework. SDLB is the perfect tool to implement modern Lakehouse and Data Mesh architectures. Many of the following features prove to be ideal for increasing efficiency

SmartDataLakeBuilder is like a robot in a digital warehouse, or also often used to build lakehouses — Smart Data Lake Builder, like a robot in a digital lakehouse

The open-source, Scala-based, data orchestration framework called Smart Data Lake Builder (SDLB) was initially developed by the Swiss Federal Railways (SBB CFF FFS) and ELCA Informatique to serve the following needs:

A declarative framework that includes metadata for data linage graph and data catalogs. This enables better automation for the creation of pipelines as a declarative description of data objects and their relationships
Early error detection and validation
Connectivity: out-of-the-box connectors for most of the current technologies.
High Customizability: Refers to custom Transformers that can be defined in-line as well as support for complex implementation of transformations, data objects and other features
Reusability: out-of-the-box data transformation capabilities. In particular we highlight the historization, deduplication, incremental loading and partitioning of data.
Human readable (Hocon), descriptive, and text based, thus well suited for version controlling
Vendor-neutrality and high portability: already deployed on various cloud and on-prem platforms
The declarative nature of the SDLB allows for an efficient integration into CI / CD pipelines
No installation needed: The SDLB is a Scala application that can be easily deployed on any machine that has the JRE
Modern Ecosystem: The SDLB uses modern, future-oriented open-source technologies such as Apache Spark as ETL-Processing Tool in order to guarantee long-lasting project lifecycles

Now that the framework has matured into a robust and modern solution, diverse companies make use of SDLB within various platforms and use cases.

The main aspects are discussed in more detail throughout the article and we keep the main focus on increasing data engineering efficiency.

Core concepts

Smart Data Lake Builder (SDLB) is developed for efficiently maintaining large and complex data pipelines. It allows users to configure their pipelines in a declarative manner. This means that users declare data sources and sinks (including their types), as well as possible transformations that happen in between. In other tools, you need to define them separately. One of the advantages of SDLB is, that you use the same structure to define your sources AND your sinks. Users mostly define WHAT the pipeline looks like, but not necessarily HOW it is built (we write “mostly” as the framework also offers the possibility of creating custom code). In general terms, the SDLB uses the following building blocks to describe the data pipelines:

Data Objects: They represent the data sources and sinks. SDLB comes with a series of predefined data objects out-of-the-box and it offers a template for building custom ones. For example, the CSVFileDataObject is used to read and write from and to CSV files.
Actions: They represent a link between two or more data objects, defining the inputs, outputs, type of action etc.
- Transformers: Actions can contain one ore more transformers, which are used to manipulate data. Transformers can be written as Scala, or SQL code as well as Python scripts.
Connections: They represent the entities needed to access (some) external resources such as databases.

Example

To illustrate these concepts with an example, we consider a user who wants to read two data sources, a CSV file “A” and a table “B” from a PostgreSQL relational database hosted locally. The user wants to join both data sources on a column “COL” and save the results into a Parquet file “C”. The logical plan for this pipeline would look as follows:

Similar to the figure above data objects and actions are shown. Here additionally different data object types and a SQL transformer in the action is shown. — *Example #1*

Note that each data object as well as the action have a specific type depending on the processing step required. For this use case, only one transformer is needed. To implement this pipeline, you only need to create the following configuration, written in HOCON (Human-Optimized Config Object Notation, a JSON superset):

connections {
    conn = {       
        type = JdbcTableConnection
        url = "jdbc:postgresql://localhost:8080"
        driver = org.postgresql.Driver
    }
}
 
dataObjects {
    A = {
        type = CSVFileDataObject
        path = "path/to/file.csv"
    }
 
    B = {
        type = JDBCTableDataObject
        table = {
            db = schemaName
            name = B
        }
        connectionId = conn
    }
     
    C  = {
        type = ParquetFileDataObject
        path = "path/to/file.parquet"
    }
}
actions {
    myAction = {
        type = CustomDataFrameAction
        inputIds = [A, B]
        outputIds = [C]
        transformer {
            sqlCode = "SELECT * FROM A, B WHERE A.COL = B.COL"
        }
    }
}

Note, how the configuration file even contains the connection to the relational database, so that absolutely no code for establishing a connection is needed. Running the SDLB .jar with this configuration file will execute the data pipeline for the user. This example already shows how efficiently these pipelines can be built. In order to gain further insights, let’s dive deeper into some of the features offered by the SDLB.

Key Benefits

Declarative dependency definition

As stated before, the SDLB enforces a declarative pipeline definition. More precisely, it is not the pipelines themselves that are declared, but rather the dependencies between data objects, i.e. how data objects are related to each other. This way, users can focus on the semantics of their data as well as the combinations and transformations that are required to get to the desired state instead of spending time programming (and debugging) each connection and data load. The compactness of the configuration files allows the users to have a good overview of their declared dependencies.

Re-use of resources and collaborative team work

Whether it is by means of reusing existing code or by adopting current standards, data engineers are always looking for ways to increase their efficiency by making use of existing resources. The metadata-driven implementation of SDLB strongly supports the re-use of written configurations as the data objects are declared only once and can be used several times in different data flows. The same applies to custom transformers (see “Predefined objects and customizability” below).

The structures in SDLB are very flexible, not only can building blocks (data objects, actions, etc.) be separated in different files and directories, but also used in various locations. Thus, e.g. team members can write their configuration files for the creation of separate pipelines while re-using common configurations. This is possible since SDLB merges all selected configuration files into one stream.

Two Hocon files are shown, which are interpretet as one.

Additionally, we can select sub-pipelines, by a so called “feed” attribute, specifiable in the actions objects. In the example below, running the SDLB with the option --feed secondPipeline would only compute myAction2, myAction3 and myAction5 resulting in the green data flow.

A tree of data objects and connecting actions, building an example pipeline. Each action has a property “feed”, which can be used to select a subset of the tree. — *Usage of feeds to declare two pipelines*

The Hocon format also allows easy replication and automation of the configuration files, even objects could be generated by scripts, e.g. for having hundreds or even thousands of similar data objects, actions and other elements. This can be particularly useful when ingesting a large number of data base tables into a Lakehouse. SDLB automation can significantly reduce development time in this case, compared to other tools such as SSIS

A further advantage is the code management. Hocon files as well as the underlying SDLB Scala code are well suited for version control systems like git and CI / CD integrations. This enables parallel development and reproducibility, and portability. There is no vendor lock-in, SDLB can be deployed on any environment with JRE and pipelines configuration files can simply be copied.

Out-of-the-box features and customizability

The SDLB provides several predefined data objects, actions (transformers) and connection types. Just to list some, SDLB provides (see also SDLB schema viewer):

data objects:
- open source storage layer: Delta Lake, Iceberg
- file based including: Parquet, CSV, Json, XML, Excel and Access
- JDBCTableDataObject for connecting to relational databases
- KafkaTopicDataObject for loading and storing data published within an Apache Kafka cluster
- AirbyteDataObject for utilizing one of the large list of Airbyte connectors, see an example implementation
actions:
- historization
- deduplication
connections:
- DeltaLake
- Hadoop Filesystem
- JDBC, Kafka, Splunk, Sftp
- Snowflake

These and many other predefined building blocks allow for an easy integration of data within a large and otherwise complex ecosystem.

By design, besides the out-of-the-box objects (see data objects above), the framework provides the flexibility to implement custom data objects, actions, transformers, connections etc. In general, such provided flexibility is often difficult to balance with efficiency, as more flexibility often means higher generalization and therefore more effort to implement concrete use cases. SDLB’s solution to this dilemma is, on one hand, prepared structures for custom implementations, e.g. SQL transformer where SQL code can be written directly in the Hocon description. On the other hand, users can create custom objects (data objects, actions, connections), e.g. for a custom webservice data source, within their implementation, without touching the SDLB core. Therefore, prepared generic Scala classes need to be extended with the specific reqirements.

We depict the first idea with an example of data transformers:

As seen in the example above, users can define simple SQL (or Scala) transformations directly in their configuration files. For more complex transformations, it can be useful to use more structured transformers instead of a single large SQL query. The SDLB offers the possibility to write the transformations using Scala or Python code. In order to do so, users just need to reference their custom transformer class. In the configuration below, the action myAction loads two data objects as sources, which are first processed with a SQL transformer. The interim result is fed to the next transformer, which is comprised of more complex Scala/Spark code in a separate file and is called MyCustomTransformer.

myAction = {
    type = CustomDataFrameAction
    inputIds = [A, B]
    outputIds = [C]
    transformers = [
        {
            type = SQLDfsTransformer
            code = { interimResult = "SELECT * FROM A, B WHERE A.COL = B.COL" }
        },
        {
            type = ScalaClassSparkDfsTransformer
            className = MyCustomTransformer
        }
    ]       
}

The corresponding data flow looks as follows:

An action with multiple transformers is shown. — *SDLB can combine SQL-based as well as Scala- or Python-based transformers in one single action*

Combining different transformer types can be a powerful tool to increase development efficiency. Users can easily use various transformation types depending on the complexity of the task. It should be mentioned, that the interim results between transformers are not persisted. Spark plan optimizations are applied across the different transformers of an action.

Feature Highlights

SDLB provides various features including a long list of connectors, predefined transformers, checkpoint/restart, streaming, etc. In the following we have a look at some handy features which are often needed and enabled very easily.

Incremental mode

Data typically should not be loaded from scratch all the time, only new data points should be processed. SDLB provides incremental modes, which either store a state of the last transaction or compares the max value of a specified column. Therewith, the amount of data transferred is limited and thus the runtime and compute costs are reduced effectively.

Assuming a HR database, containing a table with salary statements. Only new invoices should be transferred to the Lakehouse. The column invoice_date will be used to compare, filter and transfer only new values.

The incremental load is presented. First an initial load copies a first state of the table to the target. Then just the newly added line is processed and added/changed in the target table. — *The SDLB incremental mode in action*

In SDLB we only need to add the following lines to the related action:

executionMode = {
    type = DataFrameIncrementalMode
    compareCol = invoice_date
}

Historization

In order to keep track of data changes, as well as the validity of each record, SDLB provides a historization action that effectively assigns and handle a “valid_from” and a “valid_until” value to each data record (slowly changing dimensions type 2).

Let’s take for example an HR database from a software company. Anna gets a promotion on Januray 1st. The SDLB job is run on that day and notices that there is new data (Anna’s changed record). Since it runs on incremental mode, the job only loads the new record. The action that loads the new data is on historizing mode, which is why the old data is not deleted: only its validity values are changed. The columns “VALID_FROM” and “VALID_TO” are created automatically by the SDLB when using historizing actions.

For the above presented incremental load, the change of a tables with technical validation date ranges (valid_from and valid_to) is presented. The old line is invalidated, by setting the valid_to column. An new line with the new entries and an valid_from value is created. — *Historization can be used in combination with incremental loads*

Both the incremental loads and the data historization are simple configurations within the SDLB. This represents a big advantage against other orchestration tools, where typically some custom development is necessary, in which incremental load and clean historization becomes a more complex, error-prone and cumbersome task.

Partitioning

Partition-wise processing is one of the most efficient patterns for incrementally loading high-volume data. Specifying a partition is typically simple, but SDLB also verifies that partitions are handled well through the entire pipeline, checking for missing partitions, passing partition metadata through the pipeline as well as making sure that tables are not accidentally overwritten. The following example partitions the “EMPLOYEES”-table according to the job title:

The graphics represents how an large table with different persons and various job_title is partitioned (split) into multiple smaller tables each with a single value for the job_title — *Example of data partitions*

Adding the following configuration to the target dataObject achieves the desired goal:

partitions =  [JOB_TITLE]

Early validation and error detection

One way of reducing effort in the early development stages for any data project is being able to detect errors before carrying out the ETL jobs. The SDLB is comprised of four phases, each of them designed to detect errors early on. In other words, the SDLB tries to fail as early as possible if something is wrong. The phases for each execution are:

Parse configuration: Parses and validates the configuration files. This phase detects some formatting errors or missing attributes. It also warns for possible typos and suggests spelling corrections if it can.
DAG prepare: Among other checks, this phase tests connections and data objects that must exist.
DAG init: Since users only define the dependencies between data objects, it is not necessary to define the order in which the actions are performed. This phase creates and validates the whole lineage according to the directed acyclic graph (DAG) for the defined pipelines. This phase is especially powerful as it tests some aspects of data consistency and transformer functionality without actually moving any data. As an example, having a column name typo, the Hocon may be correct, but the pipeline would fail. The initial test without data checks consistency without the need of processing all the data. Thus, we see the error within seconds instead of possible hours.
DAG exec: Execution of actions. The data is effectively transferred in this phase (and only in this phase!).

All four phases create meaningful logs to better monitor the pipeline execution and easily find possible errors.

Data quality

Being a paramount topic in data governance, data quality must be continuously monitored and improved. Both tasks can imply a huge effort for data engineers, which is why building data pipelines with data quality methods is an important measure. The SDLB provides the following features to improve data quality and therefore reduce the required efforts:

Metrics: Every SDLB job collects runtime metrics for each action and (output) data object. The metrics are logged in a separate state file, which can be monitored in real time.
Constraints: Constraints can be defined on data objects to validate data on row-level. If one constraint validation fails, an Exception is thrown and the action stops (no further data is written to the output data object). For example, one could define that the price of a product in Japanese Yen should always equal or be bigger than the price in US-Dollars (see example below)
Expectations: Expectations are conditions that can be defined on entire data objects to monitor and validate a dataset after it has been written to the data object.

productsDataObject {
    ...
    constraints [{
      name = PriceInYenVSDollar
      description = "If PriceYen is smaller than PriceDollar, then there is an error in the data"
      expression = "priceInYen >= priceInDollar"
      errorMsgCols = [id, priceInYen, priceInDollar]
    }]
}

Vendor-neutrality and high portability

Due to the fact that Smart Data Lake Builder is a very portable Java (Scala) application it is well suited for local development processes and remote deployment on any cloud or on-prem system with a JRE: Deploying the built application as a .jar file allows for the integration into further platforms such as Databricks (see this blog post), without the need of setting up a Spark cluster and rather focusing on the dependencies within the data objects. There are already examples where it is deployed on Azure, AWS and Google cloud. Furthermore, SDLB could be used to build pipelines across different cloud or on-prem systems. Furthermore, SDLB is able to manage other instances on remote systems. For example, one part of the pipeline (collecting, encrypting data, transferring through firewalls etc.) is on-prem, which may be controlled by another SDLB instance in cloud, which then transforms the data and stores it into the Lakehouse. Where the whole pipeline can be managed with a single block of code / repository.

Execution Engines

SDLB is build very modular also in regards with its required libraries. For pure file transfer, only Java is necessary. When handling DataFrames and performing transformations, Scala / Spark is used as engine. But there can be also other engines implemented. Recently, Snowflake engine was introduced, now supporting pipelines into and from Snowflake, and transformations within this optimized environment.

Agents: On-Premise and cloud coordination

Especially on hybrid cloud environments, data typically cannot be pulled from on-prem systems from our processes in the cloud, since only outbound traffic is allowed. SDLB provides a feature called agents, where a main SDLB instance in the cloud can control an on-prem SDLB agent, which gathers and pushes data from on-premise to cloud.

Visualization and data catalogue

For very large projects, it is common for engineers to lose the overview of data dependencies as well as the meaning of the data they are moving. For this matter, the SDLB also has a browser-based visualizer tool that not only displays the existing configurations and dependencies of data objects and actions, but it can also act as a data catalogue for these elements as well.

The SDLB UI presents a list of data objects and the graphical representation of the pipeline. — *SDLB configuration Viewer*

The data catalogue is based on markdown files to document the semantic meaning of the data objects in question. This way, users can navigate through the pipeline lineage and discover the meaning of the data involved.

In the SDLB UI extensive documentation can be listed. Including tables with column names and their descriptions. — *SDLB Configuration Viewer, data object description*

Conclusion

In this article, we introduced the Smart Data Lake Builder (SDLB) as well as some if its main features. Compared to other frameworks, we believe that this meta-data-driven approach is very well suited for projects that benefit from a configuration-based orchestration. Among others, it can be integrated into the major public clouds and supports different execution engines, although it is primarily optimized for Apache Spark. Being open-source, the framework can be easily extendable, and it supports data transformation using SQL, Scala, Java and Python, which can be of an advantage for integrating it into large projects with heterogeneous teams.

Relevant links

Github repository for the framework
Smart Data Lake Official webpage
SchemaViewer as documentation for the configuration options