AWS Glue DataBrew

AWS Glue DataBrew — A no-code visual data preparation tool for data scientists.

Workfall

Published in

The Workfall Blog

8 min readNov 25, 2020

The next generation visual data presentation tool
Let’s explore!

AWS Glue is a serverless managed service that prepares data for analysis through automated ETL processes. This is a simple and cost-effective method for categorizing and managing big data in the enterprise. It provides organizations with a data integration tool that formats information from disparate data sources and organizes it in a central repository, where it can be used to inform business decisions.

AWS Glue reduces the time it takes to analyze and present data from a couple of months to few hours. To make this magic possible, AWS Glue provides code-based and visual interfaces.

To enhance the ETL ability using AWS Glue, AWS has recently announced AWS Glue DataBrew service, a no-code visual data preparation tool to help users clean and normalize data without writing code.

The DataBrew works with any CSV, Parquet, JSON, or .XLSX data stored in S3, Redshift, or the Relational Database Service (RDS), or any other AWS data store that is accessible from a JDBC connector.

In addition to this, it supports more than 250 pre-built transformations to automate data preparation tasks to handle challenges like anomalies in the data, standardizing data formats, or correcting invalid values, etc. After the data is prepared, you can immediately use it for analytics and machine learning.

Let’s explore this service by doing a quick exercise which we have segregated into three parts:
Create a S3 bucket & upload a sample .CSV file
Provide DataBrew read permissions to S3 bucket.
Create a DataBrew project to visually explore, understand, combine, clean, and normalize data in a dataset.

1. Create a S3 bucket & upload a sample .CSV file

In this exercise, we will work with S3 bucket and .CSV file. We have one sample data file which we are going to upload in the S3 bucket and are going to give public access to this data file.

So Let’s go to the AWS console and quickly create a bucket named “workfallbucket” and do necessary changes in permissions and policy.

(Note: AWS Glue DataBrew service is available in the following regions only. So make sure that before starting this exercise, the region and bucket being created are in the same region)

Lets proceed with the Asia Pacific (Tokyo) region for this exercise, and create this bucket under the same region as shown in the following image:

Now, let’s make our bucket public by unselecting the “block all public access” checkbox. Click on Save changes. (Refer the following image)

let’s upload a data file as an object and make this file public by applying the required policy setting.

Click on the file name and you will get following screen

Now if you will try to access Object URL of the file, it will show you following error message

Let’s do required changes in the file policy to access it publicly. To do this, go to the permission tab of the bucket and edit the bucket policy. You can generate required policy using policy generator or write in the policy window.

Now as you can see in the following image, our file is publicly available and we will be able to use this file for data analytics.

we have to make sure that the user has permission to use DataBrew. In the Access permissions, select a IAM role which provides DataBrew read permissions to my input S3 bucket. Only roles where DataBrew is the service principal for the trust policy are shown in the DataBrew console. To create one in the IAM console, select DataBrew as trusted entity

2. Provide DataBrew read permissions to S3 bucket.

To provide DataBrew read permission to read S3 buckets, you need to create a role which provides DataBrew read permissions to my input S3 bucket. As shown in the image below, create a new role named DataBrewuser and apply two permissions as shown:

3. Create a DataBrew project to visually explore, understand, combine, clean, and normalize data in a dataset

Now its time to use DataBrew service to visualize the data. AWS DataBrew service is currently available in the following regions:

US East (N. Virginia)
US East (Ohio)
US West (Oregon)
Europe (Ireland)
Europe (Frankfurt)
Asia Pacific (Tokyo)
Asia Pacific (Sydney)

So before proceeding, we need to make sure that we choose Asia Pacific (Tokyo) as a region because we have created a bucket and uploaded a file in the same region.

From your AWS management console, find and select AWS Glue DataBrew service as shown in the following image:

You will get the following. Let’s create our first DataBrew project by clicking on the Create Project button.

You will get the following screen. Enter the project name and choose New dataset from the Select a dataset options. (If you don’t have any dataset, you can proceed this exercise by selecting Sample files option also)

You can give the dataset name and choose a bucket which contains a data file as shown in the following image

Next, choose the role which we have created at the beginning of the exercise and click on the create project button.

It will take some time, and once the dataset is ready, you will get the following screen.

As you can see in the above image, the Grid view is the default when we create a new project. In this grid view, we can see the data as it has been imported. For each column, there is a summary of the range of values that have been found. For numerical columns, the statistical distribution is given.

In the Schema view (as shown in the following image), we can drill down on the schema that has been inferred, and optionally hide some of the columns. In the following image, we can see all columns of the data. (You can hide few columns and go back to grid view, you will not able to see data of hidden columns)

In the Profile view (as shown in the following image), we can run a data profile job to examine and collect statistical summaries about the data. This is an assessment in terms of structure, content, relationships, and derivation. When the profile job has succeeded, we can see a summary of the rows and columns in our dataset, how many columns and rows are valid, and correlations between columns.

You can also check Data Lineage as shown in the following image:

Let’s quickly work with how to split a column into multiple columns. In this dataset we have a column named Order Date, which I have split based on delimiter as shown in the below image:

After applying split functionality, this column will get divided into three columns as shown in the below image.

You can easily rename the columns by clicking on the columns heading, as shown in the following image:

Let’s try how group functions work. Click on the Group icon in the grid view, you will get the group screen. to see country wise total revenue, choose the columns and functions as shown in the below image:

Now let’s run the job to save these data in S3 bucket in two different formats as shown in the below image:

Select role and then click on Create and Run Job button, you will able to see following image:

Once this job is finished, you will able to see two folders containing relevant file into each inside your selected bucket as shown in the image below:

So with this quick exercise, we have set up your first BataBrew project and how to do data analytics easily.

With the evolution of technology along with great tools for analysis available in the market currently, AWS DataBrew is arguably one of the easiest tools to prepare data for analytics, ML and BI. Users can get the right insights for business without writing, maintaining and updating code, simplifying the analysis and getting accurate results.

Hope this information is helpful. We will keep sharing more about how to use new AWS services. Stay tuned!

For any further queries, feel free to post your comments, we are happy to help!

Meanwhile …

Keep Exploring -> Keep Learning -> Keep Mastering

This blog is part of our effort towards building a knowledgeable and kick-ass tech community. At Workfall, we strive to provide the best tech and pay opportunities to AWS-certified talents. If you’re looking to work with global clients, build kick-ass products, and also make big bucks doing so, give it a shot at workfall.com/partner today.

AWS Glue DataBrew

AWS Glue DataBrew — A no-code visual data preparation tool for data scientists.

Written by Workfall