Byzer 101 — PART 1

Open Source 3-in-ONE Data Tool = Databricks Notebook + dbt + BigQuery ML

4 min readFeb 24, 2022

What is Byzer Notebook?

A simple analogy of this exciting open source project is a 3-in-one data tool that contains a blend of Databricks Notebook, dbt, and BigQuery ML.

Byzer Notebook was created by William Zhu and his team with one mission in mind —

to empower less specialized data practitioners with basic SQL skills to build big data ETL pipelines and machine learning pipelines with confidence, no longer requiring a deep knowledge of distributed computing intricacies.

Additionally, Byzer has its own programming language called Byzer-Lang. The beauty of Byzer-Lang is that it is similar to SQL but goes far beyond SQL. It really blends the best parts of SQL, Python, Java, Scala, Go, Jinjia, PySpark, etc, but still, keeps the nature of SQL — Simplicity. A SQL developer can grasp Byzer-Lang in a couple of hours and deliver a common ETL pipeline in half an hour.

Build ETL Pipelines and Machine Learning Pipelines,

REDEFINED.

Byzer Notebook is truly something that the more you get into it, the more you realize how much thought really has gone into the development of this.

In this blog series, you’ll learn how to use Byzer Notebook on your local machine to implement some common use cases, ranging from building ETL pipelines to developing a Fully Connected Neural Network model from scratch for the CIFAR-10 object classification dataset. You’ll see how Byzer can transform the way you store&analyze data, train ML and AI models today.

Without further ado, let’s get your hands dirty.

Prerequisites

Python 3.6.13 and newer
Conda — Anaconda or Miniconda
Linux, Mac or Windows

Note that Byzer Notebook will support Python on Windows soon.

Step 1: Install Visual Studio Code

Warning: Please select Light Color Theme for better user experience.

Step 2: Install Byzer Extension within VS Code

1> Download Byzer Extension

https://download.byzer.org/byzer/

2>Click on the Extensions icon in the Activity Bar on the side of VS Code or the View: Extensions command (Ctrl+Shift+X)

3> Click the three dots icon and choose Install from VSIX from the drop-down menu

4> Pick the downloaded VSIX file and then click install

Step 3: Download Byzer Example Project

1> Download and unzip Byzer Example Project

2> Open the downloaded files in VS Code

Step 4: Install Ray and Other Dependencies

1> Create and activate a new python virtual environment with Conda.

Warning: Try other CLI tools if the Terminal in VS Code crushes.

conda create -n dev python=3.6.13
conda activate dev

2> Install Ray and other dependencies in this new environment

pip install --upgrade \
    pyarrow==4.0.1 \
    "ray[default]==1.8.0" \
    aiohttp==3.7.4 \
    "pandas~=1.0.5" \
    requests \
    "matplotlib~=3.3.4" \
    "uuid~=1.30" \
    pyjava \
    opencv-python \
    pyecharts \
    matplotlib \
    seaborn \
    sklearn \
    keras \
    tensorflow

3> Start 1 node Ray runtime on your local machine

Ray is the next-gen distributed computing technology created by Apache Spark founders. Ray makes it effortless to parallelize single machine code. Following this tutorial, you will see how Byzer’s hybrid runtime — Ray and Apache Spark performs when running compute-intensive ML workloads on your local machine.

ray start --head

Step 5: Perform Your first Byzer Run

1> Open mlsql-lang-example-project-master/src/try_mlsql.mlsqlnb

2> Click the arrow button to run your first Byzer code cell

If you see the following error:

Error: Unable to create database default as failed to create its directory file:/<placeholder>/mlsql-lang-example-project-master/spark-warehouse

Try this command to create the missing folder:

mkdir -p /<placeholder - replace this part with your own path>/mlsql-lang-example-project-master/spark-warehouse

Please leave a comment here or join Slack to ask questions, get help, or discuss all things Byzer!
Last but not least, please share Byzer with data enthusiasts around you if you like this open-source project!

Don’t want to miss out on the updates?

Please help share this blog out and follow me on Medium for upcoming blogs.
Thanks!