Preparing Data using SageMaker Data Wrangler Part 1

Published in

Impelsys

6 min readMar 28, 2024

Introduction:

Preparing data for machine learning models is always challenging. When it comes to healthcare data, it’s more challenging because there is a high chance of inconsistency in the data. You have to deal with missing fields, duplicate rows, invalid columns, and many more. The quality of your data will determine the accuracy of your model. So, what can we do to improve the quality of our data?

To execute a data preparation task, you need a good understanding of any of the programming languages available, but here, we will explore a cloud-based data preparation tool called SageMaker Data Wrangler, which is provided by Amazon Web Services.

The primary advantage of this tool is that it is a “no-code” data preparation tool and with this data preparation will no longer be challenging. Yeah, you read that right. SageMaker doesn’t need any code to do data preparation tasks; it provides built-in functions for data preparation.

But what if our requirement is not listed in the built-in function? SageMaker Data Wrangler provides a solution for this problem as well. You can write your function using PySpark, Pandas, or Python. Let’s dive into the details.

Objective

The objective of this and the upcoming article will be to use the Medical Students dataset, which can be obtained from Kaggle. Please go to https://www.kaggle.com/datasets/slmsshk/medical-students-dataset and download your copy. Using this dataset, we will clean, plot, and prepare the data using SageMaker Data Wrangler.

Workflow

The below image represents a high-level overview of our workflow.

Follow the below steps to achieve this workflow.

Upload the dataset to S3 Bucket
Create or Open a Sagemaker Canvas
Import the Sample Dataset to Sagemaker Canvas
Apply Data Analysis Methods
Apply Data Transformations
Apply Data Processing to Complete Dataset

This article covers the first four steps in our workflow. The other two steps we explore in part two of this article.

1. Upload the Data to S3 Bucket

Navigate to the S3 section in the Amazon console, create a new bucket for storing our dataset, and give a name to the S3 bucket.
After creating the S3 bucket, click on upload, and drag and drop the dataset into the S3 bucket.
Click on upload at the bottom of the page, and ensure that data is uploaded to the S3 bucket.

2. Create or Open a SageMaker Canvas

Here, we will use SageMaker Canvas, a no-code ML platform that allows for preparing, training, and deploying ML models without writing any code.

Go to SageMaker in the AWS Console and select Canvas in the left-side panel. otherwise, open the existing canvas.
Click Create a canvas, if you don’t have one already; otherwise, open the existing canvas.

Once the Canvas is successfully loaded, head over to Data Wrangler in the side panel and Select “Data flows”

Click on “Create a data flow”, Give the flow a name, and click “Create”.

3. Import the Sample Dataset to SageMaker Canvas

Once, the data flow is created and the page gets loaded successfully, click “Import data” and select Dataset type as Tabular.
Once the page is loaded, click the Data Sources drop-down and select Amazon S3 as the source. Select the s3 bucket where you uploaded the dataset.

Select the “medical_students_dataset.csv” inside the S3 bucket and click “Import data”.

Now, SageMaker Canvas will prepare a data preview for the dataset that we imported, by selecting a sample from it.

Ensure all the columns are reflecting and click “Import data.”

4. Apply Data Analysis Methods

SageMaker provides a variety of in-built data analysis methods to understand the quality and quantity of our data. Let’s explore a few of them in this article.

a. Data Quality Report

Firstly, create a data quality report.
Once the data is imported successfully, click on Analysis and select “Data Quality and Insights Report” as the “Analysis Type”.
Give an analysis name to the report and select “Diabetes” as the “Target column” and “Classification” as the “Problem Type”.
Click on Create to generate the data quality report.

In a few minutes, SageMaker will create a data quality report for us. It will give an overview of our dataset, including the number of rows, duplicate rows, missing values, etc. Scroll down in the quality report to get more insights.

b. Histogram

Next, create a histogram for the age column in our dataset.
After exploring the quality report, click on “pre-process-student-data. flow” at the top. It will take you back to the data flow and give you a virtual representation of the transformations we have applied till now.
Click on the plus icon right to the “Data types” and select “Add Analysis”.

Select “Histogram” as the Analysis type and X-axis as “Age” and give a name to the histogram.

Click on Preview to see the generated histogram. If you wish to keep the histogram with the workflow, click Add. Or if you intend to use this for understanding purposes, you can clear it once it has been previewed.

c. Scatter Plot

Now create a Scatter plot between height and weight.
Select Add Analysis and Scatter plot as the Analysis type and select Height on the X-axis and Weight on the Y-axis.

Click on Preview to see the generated Scatter plot.

d. Feature Correlation

Finally, create a feature correlation matrix, which will give the value of how a feature(column) is related to another feature(column).
Click on Add Analysis and select “Feature Correlation” as the Analysis type and “linear” as the Correlation type.

Click on preview to see the Feature Correlation Matrix.

We have successfully explored a few of the analysis methods; feel free to explore other analysis methods provided by the SageMaker Canvas.

Conclusion

We have successfully imported the data and applied a few data analysis methods. So far, we have completed the first four steps in our workflow. Remember that we have explored only a few of the inbuilt analysis methods in the SageMaker Data Wrangler. Keep this as a base, and feel free to explore other methods as well. In the next part, we will also apply the data transformation methods to this sample dataset.