Low-code big data preparation with Bumblebee

Published in

Bumblebee

6 min readMar 11, 2021

At Bumblebee, we believe that everyone should easily explore, wrangle and get insights from data no matter their technical knowledge, infrastructure available, or data size/format.

So, we created Bumblebee an open-source low-code web platform in which you can process your data until 20x using GPUs in an excel-like interface. Load data from files or remote services like S3, connect to databases, apply any of the +100 functions, merge multiple data sources, and save it back. All easily in an easy user interface.

Bumblebee can use any of multiples engines to process data. Right now you can use Pandas, Dask, cuDF, or Dask-cuDF. The engine that best fits your use case will depend on the data size and hardware available to you.

Let’s start

After reading this post you’ll have an idea of how to use and what you can do on Bumblebee. If you already know what you need to transform, we believe you’ll be able to navigate through the UI and experiment for a few minutes and feel you already know how to wrangle data.

Installing Bumblebee

First of all, let install Bumblebee. For this, you have multiple options, like installing it on your machine or on a server, do it from scratch, or in a Docker container. Below there a link to a couple of blog posts we create about installing Bumblebee in Digital Ocean from scratch and installing Bumblebee using Docker.

How to install Bumblebee on Digital Ocean

medium.com

How to run Bumblebee in a Docker Container

With docker everything is easier. We just need to have docker and pull the Bumblebee image.

medium.com

With Bumblebee installed let’s start uploading a dataset. For this, you can simply click on Load from and use the file field to select a file from a local machine. Click in the Preview Button to get a data file preview so you can check that the data was loaded correctly. Just in case that Bumblebee could not detect the correct encoding you can make tiny tweaks to make it work.

You can also load from a previously configured connection(more in another post) or simply from a URL, here’s an example of loading from an S3 bucket:

Loading crime.csv from a private S3 connection

If you want to try the previously shown dataset here’s a direct link you can paste: https://bumblebee.nyc3.digitaloceanspaces.com/crime.csv

Once loaded we’ll have all the data available for us to analyze in a profile and in the shape of a massive table.

Profiling your data

When analyzing data is important to get a quick overview of the dataset we’re working on before transforming anything in it. For this, with every column, you get a quality bar, a frequency or histogram chart, and some stats that will help you understand the data. Let’s deep dive.

The data quality bar

This bar tells us how many values of a dataset match the profiled datatypes, how many don’t match, and how many values are missing.

A column showing that there’s a certain amount of mismatches between more than 8k values

When clicking this bar you’ll be able to filter them by dropping or removing the matching rows or replacing all the matching values.

Also, if there were any null values those would show up on this bar in gray color.

Histograms and frequency

Every column on your dataset will have an overview of the values on the top of the table and on the details section, shown when selecting one or more columns.

To get a detailed view simply selecting a column will give you more data.

Here we can see some more data like how many unique values are there in a column, the exact amount of matches and mismatches, a bigger view of the frequent values, and the top frequent patterns.

Columns overview

Bumblebee has a secondary view which can be helpful when dealing with datasets with several columns.

To filter columns from the table view we can select a bunch of them and mark them as hidden or hide them one by one. Also, you’ll be able to filter them by their type.

This is useful to clean up the view on the table section.

Profiled datatypes

Every column in your dataset will have a datatype that expands the commonly used numeric and string ones.

A good example would be a column with email addresses, internally it’s a series of strings but Bumblebee shows us all the mismatches depending on if the values match all the constraints an email address has.

Some supported types are:

Integer values
Decimal values
String values
Boolean values
Object values
Array values
Dates
Phone numbers
URLs
Social security numbers
Zip codes
Credit card numbers
US States
Genders
Http codes

This is useful to count and filter all the not matching data in that column which will be showed to us in a data quality bar.

Wrangling the data

Bumblebee has +150 operations to operate over string, numbers, and date types. Let's see here some of the basic ones.

Replacing values

You can use the replace operation to replace one or more values to a given value. It can also be used to search for a string or a word inside of the values of the column.

Splitting and merging

You can also split the values of a column into various pieces a column passing a character as the separator.

Similarly, you can merge also using a separator, in this case, we’re using “, ”.

Extracting string

You can extract a number of

Here’s an example of a substring operation.

Getting the first n characters of the values

Similarly, there are more operations like transforming to lowercase, to uppercase, to proper case, removing accents, removing special characters, extract strings, padding, and more.

Mathematic operations

For numeric columns, there are operations like abs, round, exp, mod, pow, floor, ceil, ln, and log and all the trigonometric functions, like sin, cos, tan, asin, acos, atan, and all their hyperbolic variations. Also, you can calculate statistics like median, mean kurtosis, skew, and mad.

Also, you can make more advanced calculations using values from other columns as shown:

Advanced formulas using the “set” operation

Filtering using the charts

You can also join, append, filter and sort rows, and more. We’re constantly adding new features including more operations. We will talk about this operation in further posts.

Check how you can use the frequency bar to filter out rows just with a few clicks.

You can also select a range of values instead of just one, and instead of removing them you can keep or replace them.

And finally saving the wrangled data

Bumblebee allows us to download the dataset in a single file (use this wisely) or simply saving to where it came from, like an S3 bucket or simply your local filesystem (the one running the python backend).

Wrap up

We showed some of the basic Bumblebee features like profiling, exploring and processing string, and numeric data.

Bumblebee has many other functions. You can also join, append, filter and sort rows, and more. We’re constantly adding new features including more operations. We will talk about this operation in further posts.

To finish, If you want to know more about Bumblebee please join our slack channel or go to our website https://hi-bumblebee.com/.