Apache Spark and BigQuery with AWS Sagemaker Studio

Ramon Marrero
Analytics Vidhya
Published in
11 min readApr 20, 2021

--

Extend the capabilities of Sagemaker Studio container images with new libraries.

In the following post, you will learn how to extend the Sagemaker Studio Spark container image to incorporate additional libraries and interact with Google Cloud Services such as BigQuery. We will then create a notebook to retrieve data from a BigQuery table using Amazon Sagemaker Studio.

SageMaker Studio (Image by author)

Introduction

On December 3, 2019, AWS introduced Amazon SageMaker Studio as The First Fully Integrated Development Environment For Machine Learning. According to AWS, Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.

Amazon SageMaker Studio lets you manage your entire ML workflow, providing features that improve the overall ML engineering experience. It offers SageMaker Notebooks to let you easily create and share Jupyter notebooks without having to manage infrastructure. SageMaker Experiments to organize, track and compare ML training and model evaluation jobs or data processing jobs run via SageMaker Processing. Amazon SageMaker Debugger to analyze complex training issues, and receive alerts. SageMaker Autopilot to build models automatically with…

--

--

Ramon Marrero
Analytics Vidhya

Head of Data Engineering | AWS Community Builder | AWS Certified Solutions Architect | Google Cloud Certified Professional