Enabling Data Science Teams of Any Size

Tony Fontana
99P Labs
Published in
3 min readApr 6, 2023

As our data science team transitioned to open source and free tools, we realized that our AWS EMR was the final item we needed to let go of. Although it allowed for a fast startup of a spark cluster, it posed more challenges than benefits. Apart from its exorbitant cost, we faced numerous technological hurdles, including difficulties upgrading and keeping current versions of spark, the inability to use unsupported applications, and the complex setup process for new users.

The onboarding process for EMR was found to be cumbersome and slow. To access the EMR, users had to first acquire AWS credentials, and then use them to generate a key for accessing the SSH server, which was locked with only one entry for SSH. The process of configuring compute resources for Spark jobs was challenging, and importing Python packages that were not supported by AWS proved to be very difficult. Furthermore, the process of importing individual datasets to the EMR and running Spark commands from the terminal was time-consuming, precluding quick visualizations. Lack of versioning and Git support further complicated the process of creating and maintaining scripts, leading to unnecessary struggles with collaboration among team members. This entire process of onboarding a new user could take up to a week, and required significant effort to bring them up to speed.

We resolved to address these issues by determining our team’s needs and creating a platform that we could control, resulting in an onboarding process of 20 minutes or less. Our data science team primarily utilized Spark, Trino, and Python/Pyspark for their data analytics, and we opted to run these tools on our Kubernetes cluster. By shifting Trino to our platform a year earlier, we had already achieved remarkable improvements in speed and cost over EMR.

New Spark Workflow

With the collaboration of our data science team, we created a diagram that illustrated our new workflow. To submit spark jobs, users could host their python environment and run Jupyterlab in a docker container on their local machine. We provided them with a preconfigured docker image that automatically connected to our platform and spark cluster using Sparkmagic and Lighter/Livy. This simplified the process for users, who only had to pull and run the docker image to spawn a Jupyterlab app on their machine, open it in their browser, and create a new Pyspark notebook that automatically connected to our spark cluster.

Our new workflow allows data science teams to configure their spark jobs and compute resources with options to request memory and CPU from the cluster and customize spark settings. Sparkmagic enables users to create visualizations computed on the spark cluster, which was not supported by AWS EMR.

The benefits of this new workflow include:

  • Streamlined setup process without the need for configuration or permissions.
  • Simplified scripting through a visual notebook and terminal access.
  • Version control of scripts, increased ease of collaboration via git integration
  • Access to a comprehensive catalog of example code from previous projects.
  • Immediate access to datasets via our Spark cluster.
  • Rapid visualization through Pyspark, facilitating quick insights.
  • Quick and easy setup that takes only 20 minutes.
  • Unlimited scaling potential for future growth.
  • Compatibility with any Python package.

Our deployment of this stack on our platform is completely customizable and scalable and costs only a fraction of what we paid for EMR. Built with open source tools, we are excited to see how our data science team will utilize this new workflow. For more information, please contact us at support@99plabs.com.

--

--