Datafy feature release Q4 2021

Published in

datamindedbe

6 min readJan 24, 2022

Today we want to share with you the features the Datafy team has been working in the last part of 2021. We also did a webinar about this which you can view here, after registering.

In this post we shall highlight the following features:

Notebooks: How we want to support data exploration with Datafy.
Streaming: Supporting real time use cases.
Improved DBT support: Run your DBT sql pipelines on Datafy with ease.
Continuous improvements: We commit to always keeping the software used up to date and the latest versions available for our users.

Notebooks

Our notebooks feature has been build with 2 specific goals in mind:

We want to enable data exploration on Datafy. When you start a project or want to add new data to an existing project, you often want to play with the data in an interactive environment.
When you have a project deployed and it breaks after running stable for weeks in production you want to easily debug the project.

The solution we came up with tries to helps you with both problems. To start using it you can check it out using the datafy notebook commands in your CLI.

When using datafy notebook create a new notebook will be launched. The screenshot below shows what that looks like.

An example of the new notebook UI, you can also open it full screen

The following functionality is available in the Datafy Notebooks.

We build a docker container containing your code, this means that your source code is accessible when writing your notebook files. This makes debugging or exploring from an existing project easier.
The docker container used is also fully customisable to taste. This means more libraries, dependencies… can be added to it if needed.
Notebooks use the same AWS IAM roles as a scheduled job in airflow, that uses the datafy operators. This means that notebooks can reuse the roles defined/created for your projects running on airflow.
Our notebooks support running spark, so you can process bigger data sets.
A notebook can run in the Web UI mode, or can connect to your local IDE. You can use what you are comfortable with.
When using the Web UI mode you can download the notebook files and check them in in git when done working. This allows you to share the notebooks, or start where you left off the next time.
Notebooks use the same datafy instances, and instance lifecyle (on-demand and spot), just like when scheduling on Airflow.

You can try out our getting started guides in our docs, we have one for the Web UI, and one for using your local IDE. More technical details are available here.

Streaming

We also launched streaming support in the second half of 2021. Streaming is great when you need to immediately respond to data.

This can be useful when doing anomaly detection on a critical system. An example of such a system might be a pump in a vertical farm. If the pump fails the plants might not get enough or too much water. Even a disruption of a couple of hours might impact our vertical farm. To help with this we could train a model to detect anomalies on our pump, and we can evaluate that model in real time using streaming.

Spark structured streaming UI. This application is running on Datafy

To support these kind of use cases we have included support for spark streaming. We started with adding support for spark streaming because:

We already had spark batch support
Adding spark streaming support allows our users to mix batch and streaming in one spark project.

To specify your streaming job add a streaming.yaml file to your project. For example:

streamingApplications:
- name: producer
  sparkSpec:
    numberOfExecutors: 1
    driverInstanceType: mx.micro
    executorInstanceType: mx.micro
    application: "local:///opt/spark/work-dir/src/pysparkstreaming/producer.py"
    applicationArgs:
      - --env
      - "{{ .Env }}"
    sparkMainVersion: 3
    awsRole: "pyspark_streaming-{{ .Env }}"

The above example shows off multiple features:

One or more streaming jobs can be added to your project. You can split up streaming jobs into multiple jobs with different responsibilities.
The same settings are available for spark streaming that are available in the DatafySparkSubmitOperatorV2
We added templating support to the streaming.yaml file, which allows you to use a different aws role when deploying to different environment.

If you want to try this out yourself you can use our spark getting started, or our pyspark getting started guides. The technical documentation can be found here.

Improved DBT support

DBT allows you to build data pipelines using SQL. It is a tool that makes building your data warehouse much more manageable, by using sql files that can be templated. If you want to test out DBT on datafy look at our getting started.

Dbt has been supported for a while but we wanted to make the usage more user friendly. In essence our previous Datafy DBT template contained two Airflow tasks, one to run your DBT models and one to test them. This resulted in you being able to run DBT jobs, however it was very hard to see which model failed. It was also impossible to only rerun just one model using Airflow. This caused issues if you had a lot of data and/or a lot of models.

In order to solve this issue, we created the DatafyDBTTaskFactory, which parses the dbt data and builds a nice airflow graph from it.

As shown in the image, this results in 2 airflow task for every model: one for creating the model and one for testing it. It has the same dependencies between models defined in DBT. This way you can easily see which model has failed and you can only restart that model when you have found out what went wrong.

We also added support for mounting secrets from AWS Secrets Manager and AWS Parameter store as environment variables. This means you no longer have to write a bash script to first fetch the password for your database and then execute DBT run. This is also available for the DatafyContainerOperatorV2 and the DatafySparkSubmitOperatorV2. For more information on how to use this look at the docs.

Continuous improvements

We continuously make big and small improvements to the user experience and regularly release new features. We want to highlight some of them explicitly:

Spark 3.2.0

We released support for spark 3.2.0, the release notes can be found here. We think that the following 2 features are particularly interesting:

Support for the pandas API layer, this replaces the Koalas library introduced by Databricks and natively includes the functionality in Spark.
Support for adaptive query execution by default, Spark will handle too small and too big partitions and will automatically coalesce or split them after certain operations. This will make jobs run more stable and potentially faster by reducing overhead.

Spark Magic Committer support

We introduced spark magic committer support for jobs that write a lot of data.

The normal hadoop committer will first write files to AWS S3 in an _temporary folder and copy these to the correct location when the job is finished. On a normal file system this is instant, on AWS S3 however this means copying the files again. This can be very slow when writing a lot of data.
The magic committer gets around this issue by writing to the correct location but only committing the files when everything is written. The files will then “magically” appear all at once. This needs a strongly consistent s3 which was released in December of 2020 by AWS.
To enable it just add the parameter s3_committer=”magic" to your DatafySparkSubmitOperatorV2. You can find more info in our documentation.

Airflow V1 deprecation

I want to remind you all that we deprecated Airflow v1 in September of 2021, the reason for this is that Airflow V1 is not supported anymore by the open source community. Therefore, we encourage our users to migrate to Airflow V2, which has all the latest features and security updates.
You can find the documentation on how to upgrade here. If you need any help or struggle with the upgrade don’t hesitate to contact us.

Conclusion

This concludes the highlight of Datafy features released at the end of 2021. We have the ambition to have a webinar and a blog post every quarter about the features/improvements released that quarter. Registering on the previous webinar makes it possible for us to notify you of futures webinars and blog post so don’t forget to register!