Preparing Data using SageMaker Data Wrangler Part 1

Musthafa Vakkayil
Impelsys
Published in
6 min readMar 28, 2024

Introduction:

Preparing data for machine learning models is always challenging. When it comes to healthcare data, it’s more challenging because there is a high chance of inconsistency in the data. You have to deal with missing fields, duplicate rows, invalid columns, and many more. The quality of your data will determine the accuracy of your model. So, what can we do to improve the quality of our data?

To execute a data preparation task, you need a good understanding of any of the programming languages available, but here, we will explore a cloud-based data preparation tool called SageMaker Data Wrangler, which is provided by Amazon Web Services.

The primary advantage of this tool is that it is a “no-code” data preparation tool and with this data preparation will no longer be challenging. Yeah, you read that right. SageMaker doesn’t need any code to do data preparation tasks; it provides built-in functions for data preparation.

But what if our requirement is not listed in the built-in function? SageMaker Data Wrangler provides a solution for this problem as well. You can write your function using PySpark, Pandas, or Python. Let’s dive into the details.

Objective

The objective of this and the upcoming article will be to use the Medical Students dataset, which can be obtained from Kaggle. Please go to https://www.kaggle.com/datasets/slmsshk/medical-students-dataset and download your copy. Using this dataset, we will clean, plot, and prepare the data using SageMaker Data Wrangler.

Workflow

The below image represents a high-level overview of our workflow.

Fig 1: Workflow

Follow the below steps to achieve this workflow.

  1. Upload the dataset to S3 Bucket
  2. Create or Open a Sagemaker Canvas
  3. Import the Sample Dataset to Sagemaker Canvas
  4. Apply Data Analysis Methods
  5. Apply Data Transformations
  6. Apply Data Processing to Complete Dataset

This article covers the first four steps in our workflow. The other two steps we explore in part two of this article.

1. Upload the Data to S3 Bucket

  • Navigate to the S3 section in the Amazon console, create a new bucket for storing our dataset, and give a name to the S3 bucket.
  • After creating the S3 bucket, click on upload, and drag and drop the dataset into the S3 bucket.
  • Click on upload at the bottom of the page, and ensure that data is uploaded to the S3 bucket.
Fig 2: Data in S3 Bucket

2. Create or Open a SageMaker Canvas

Here, we will use SageMaker Canvas, a no-code ML platform that allows for preparing, training, and deploying ML models without writing any code.

  • Go to SageMaker in the AWS Console and select Canvas in the left-side panel. otherwise, open the existing canvas.
  • Click Create a canvas, if you don’t have one already; otherwise, open the existing canvas.
Fig 3: Open SageMaker Canvas
  • Once the Canvas is successfully loaded, head over to Data Wrangler in the side panel and Select “Data flows”
Fig 4: Data Flows
  • Click on “Create a data flow”, Give the flow a name, and click “Create”.

3. Import the Sample Dataset to SageMaker Canvas

  • Once, the data flow is created and the page gets loaded successfully, click “Import data” and select Dataset type as Tabular.
  • Once the page is loaded, click the Data Sources drop-down and select Amazon S3 as the source. Select the s3 bucket where you uploaded the dataset.
Fig 5: Data Source
  • Select the “medical_students_dataset.csv” inside the S3 bucket and click “Import data”.

Now, SageMaker Canvas will prepare a data preview for the dataset that we imported, by selecting a sample from it.

  • Ensure all the columns are reflecting and click “Import data.”
Fig 6: Data Preview

4. Apply Data Analysis Methods

SageMaker provides a variety of in-built data analysis methods to understand the quality and quantity of our data. Let’s explore a few of them in this article.

a. Data Quality Report

  • Firstly, create a data quality report.
  • Once the data is imported successfully, click on Analysis and select “Data Quality and Insights Report” as the “Analysis Type”.
  • Give an analysis name to the report and select “Diabetes” as the “Target column” and “Classification” as the “Problem Type”.
  • Click on Create to generate the data quality report.
Fig 7: Create Analysis

In a few minutes, SageMaker will create a data quality report for us. It will give an overview of our dataset, including the number of rows, duplicate rows, missing values, etc. Scroll down in the quality report to get more insights.

Fig 8: Data Quality Report

b. Histogram

  • Next, create a histogram for the age column in our dataset.
  • After exploring the quality report, click on “pre-process-student-data. flow” at the top. It will take you back to the data flow and give you a virtual representation of the transformations we have applied till now.
  • Click on the plus icon right to the “Data types” and select “Add Analysis”.
Fig 9: Add Analysis
  • Select “Histogram” as the Analysis type and X-axis as “Age” and give a name to the histogram.
Fig 10: Create Histogram
  • Click on Preview to see the generated histogram. If you wish to keep the histogram with the workflow, click Add. Or if you intend to use this for understanding purposes, you can clear it once it has been previewed.
Fig 11: Histogram preview

c. Scatter Plot

  • Now create a Scatter plot between height and weight.
  • Select Add Analysis and Scatter plot as the Analysis type and select Height on the X-axis and Weight on the Y-axis.
Fig 12: Create Scatter plot
  • Click on Preview to see the generated Scatter plot.
Fig 13: Scatter Plot preview

d. Feature Correlation

  • Finally, create a feature correlation matrix, which will give the value of how a feature(column) is related to another feature(column).
  • Click on Add Analysis and select “Feature Correlation” as the Analysis type and “linear” as the Correlation type.
Fig 14: Create Feature Correlation
  • Click on preview to see the Feature Correlation Matrix.
Fig 15: Feature Correlation preview

We have successfully explored a few of the analysis methods; feel free to explore other analysis methods provided by the SageMaker Canvas.

Conclusion

We have successfully imported the data and applied a few data analysis methods. So far, we have completed the first four steps in our workflow. Remember that we have explored only a few of the inbuilt analysis methods in the SageMaker Data Wrangler. Keep this as a base, and feel free to explore other methods as well. In the next part, we will also apply the data transformation methods to this sample dataset.

--

--