Streamlining Data Insights: Automating Visualization Reports with SageMaker Studio Interactive Session and Notebook Jobs

Bruno Pistone
6 min readFeb 7, 2023

--

Author: Bruno Pistone

As a machine learning developer or data scientist, it can be challenging to navigate the complex landscape of tools and technologies available for building and deploying ML models. That’s where Amazon SageMaker Studio comes in, offering a powerful and streamlined experience for all your ML needs.

One of the key features of SageMaker Studio is its integration with Jupyter notebooks, providing a fully managed and scalable environment for working with popular ML frameworks. Furthermore, last year AWS took things to the next level with the introduction of two new capabilities for SageMaker Studio notebooks: Serverless Interactive Sessions with AWS Glue and Apache Spark & Ray, and Scheduled Notebook Jobs.

With these new tools, data scientists can centralize their work within a single ML platform and make it easier for other business users to visualize reports produced in an automated way. With serverless interactive sessions, you can use Apache Spark and Ray to easily process large datasets without worrying about cluster management. With scheduled notebook jobs, you can execute Jupyter notebooks on a schedule, making it easy to keep your data and reports up-to-date.

In this blog, we’ll demonstrate how you can use these new capabilities in SageMaker Studio to generate forecast datasets and visualize them inside the platform. So, whether you’re a seasoned ML pro or just starting out, Amazon SageMaker Studio has everything you need to streamline your work and bring your ML models to life.

Walkthrough overview

For this blog, we used the datasets publicly available on Kaggle related to electricity price forecasting by using past values of the electricity price as well as those of another features related to energy generation and weather conditions.

We’ll take you through the process in three steps:

  1. Generating a time series dataset from raw data using Studio Notebooks and Apache Spark Interactive Sessions
  2. Automating the creation of aggregated time series with a Scheduled Notebook Job
  3. Visualizing insights and results with a dedicated Notebook.

By the end of this blog, you’ll have a complete understanding of how you can use SageMaker Studio to unlock valuable insights into reports for data forecasting tasks.

Interactive distributed data processing in Studio Notebooks

In SageMaker Studio, we can easily create notebooks and adopt the most popular framework for Machine Learning in few clicks, by selecting the chosen image from the list of built-in ones provided by SageMaker.

SageMaker Studio — Setup Notebook Environment

By connecting the Studio Notebook to AWS Glue Interactive Sessions, we enable the usage of Apache Spark, an open-source, distributed computing system that can process large amounts of data quickly and efficiently.
We select the image SparkAnalytics 2.0 with Kernel Glue PySpark, and wait while SageMaker Studio is managing the instance for working with our notebook.

The second step is to create and connect to an Interactive Session with AWS Glue. We can do it directly in the created notebook with a simple command line in a new cell:

%session_id_prefix ts-electricity-forecasting-
%glue_version 3.0
%idle_timeout 60
%%configure
{
"—enable-spark-ui": "true",
"—spark-event-logs-path": "s3://<BUCKET_NAME>/<BUCKET_PREFIX>/logs/"
}

By using the magic commands, we can provide our configuration to the created session, such as the idle timeout for the session, and where we want to store the event logs generated by Apache Spark.
The full list of commands available can be visualized by running the following cell:

%help

This is the only thing we should do in our notebook; no cluster management, no additional modules installation. We have a serverless cluster with Apache Spark configured in just one click.

Photo by NASA on Unsplash

We can start coding in our notebook and focus on what we want to extract from our data. And if you are familiar with Apache Spark, first thing is to create the session:

spark = SparkSession.builder \
.config("spark.sql.legacy.timeParserPolicy", "CORRECTED") \
.getOrCreate()

df_e = spark.read.csv(
f"s3://{bucket_name}/{bucket_prefix}/data/input/energy_dataset.csv",
header=True
)
df_e.show(10)
Spark Dataframe

We can code with ease and without the hassle of managing and interacting with other services. The interactive session scales automatically to match the size of your data, allowing you to focus solely on your code. This makes SageMaker Studio a powerful tool that streamlines your coding experience.

If we provided an idle timeout for starting our session, it will be automatically closed after that period. But we can also terminate it by executing the stop command:

%stop_session

You can also find the example notebook in this GitHub repository for reference.

Schedule the notebook execution

Let’s imagine the following scenario: I’ve developed my code for generating the aggregated time series forecasting dataset, and every week new data will be available new internal reporting.

We can automatically generate new versions for our dataset every week, by scheduling a Notebook job in few clicks.
Starting from our Studio notebook, we have to click on Create a notebook job in the menu on top of it.

Create a notebook job

We are landing automatically in the Notebook Jobs tab, where we can configure the jobs with several parameters available, for example we can configure parameters for providing the input source location for our data.

But the most important thing is the definition of the scheduler. SageMaker Studio gives us several options for defining the scheduler:

Notebook Job Tab

We have defined the scheduler, but what about next? We only have to click Create, and Amazon SageMaker will create everything for us.

Scheduled Notebook Jobs Tab

Data Visualization and Analysis

In SageMaker Studio, we can create a dedicated notebook for reporting the latest updates on our data, by using different graphs.

For example, let’s create a report for the actual electricity price at a daily/weekly schedule.

Actual electricity price

Or an histogram graph for the actual electricity price:

Actual electricity price — Histogram

We can also schedule the creation of this report, by creating a Notebook Job for the selected one every week after the execution of the previous one.

We don’t have to change service or tool, now we have a complete set of assets directly in one platform. We can export the results, or share with colleagues directly from the SageMaker Studio interface.

You can also find the example notebook in this GitHub repository for reference.

Conclusions

The steps detailed in this blog are demonstrating how we can reduce the effort on automating the generation of insights and reports for our data with Amazon SageMaker, generate value from our data by focusing on coding transformations and manipulations scripts. As a next step, I can use these data to train Machine Learning models, generate predictions and create new insights, and restart the flywheel, respecting the iterative principle of Machine Learning.

If you want to test the notebooks, please look at the GitHub repository linked to this blog.

If you want to learn more about Amazon SageMaker, visit AWSome SageMaker on GitHub to find all the relevant and up-to-date resources needed for working with SageMaker.

--

--

Bruno Pistone

Senior Gen AI/ML Specialist Solutions Architect at AWS - All opinions are my own