Setting Up a Python Data Engineering Development Environment on Windows

Paulo de Jesus
5 min readApr 18, 2023

--

Are you starting your journey into the world of data engineering but not sure what you need to get started? Are you uncertain about signing up for free trials with cloud providers, and you wish you could just try things out on your local laptop?

In this blog I will show you how to set up a basic Python-based development environment, so that you can start playing around with some of the popular technologies, all on your local Windows machine. The development environment will include Visual Studio Code, Java, Python, Spark, and Delta Lake.

Here are the steps that we will be following:

  • Installing Visual Studio Code (Latest stable version)
  • Installing Java 11 (version 11.0.18)
  • Installing Python 3 (version 3.10.11)
  • Copying Winutils files (version 3.2.2)
  • Installing PySpark (version 3.3.2)
  • Installing Delta-Spark (version 2.3.0)

NOTE: The trickiest part of setting up the environment is making sure that all of the versions are compatible with one another. At the time of writing, the above combination was all compatible. The other tricky part is making sure that the additional files are copied and the environment variables are set up for things to work correctly.

Installing Visual Studio Code

Visual Studio Code is a powerful development tool, with support for almost every programming language under the sun. You can download the installation package of the latest version here for free, and then follow the installation instructions. There shouldn’t be any compatibility issues with regards to Visual Studio Code.

Select the Windows option

Installing Java

You can download the installation package for Java from the official Oracle website here.

NOTE: Take note of the installation path of Java as you will need it soon to set up environment variables.

Select the Windows installer

Next, the Java environment variables need to be set up. To do this go to System Properties and select the Environment Variables button:

Under System variables click the New button and add a new variable called JAVA_HOME with a value of the path where Java was installed:

Example of what the path could look like

Once that system variable has been created, find a system variable called “Path” in the list and click the Edit button. Then add a new line with the following value:

NOTE: Also check to see if Oracle created another folder in Program Files (or wherever it was installed). If it did then add that path to the “Path” system variable as well:

Installing Python

You can download the installation package for Python here.

NOTE: Take note of the installation path of Python as you will need it soon to set up environment variables.

Select the Windows Installer

As with the Java installation, you need to add two lines to the “Path” system variable. This is based on the installation path of Python:

Copying Winutils files

These files are required by the Spark configuration and need to be manually copied into a folder and then referenced in system variables.

First create a folder on your local drive. I created one called “C:\Program Files\Hadoop\bin”. The name of the path isn’t really important, but you need to download two files from here and copy them into the “bin” subfolder. The files in the link have been complied as version 3.2.2:

Once the files are in the folder go back to environment variables, and create a new system variable called HADOOP_HOME with the value of the “bin” folder’s parent you created:

Example of what the path could look like

Then add the following line to the “Path” system variable:

Installing PySpark (Python Spark library)

Open a command prompt and run the following:

pip install pyspark==3.3.2

Go to environment variables again, and under System variables click the New button and add a new variable called PYSPARK_PYTHON with a value of the full path to the Python executable file:

Full path to python.exe

Next add another system variable called PYSPARK_SUBMIT_ARGS with the following value:

Installing Delta-Spark (Python Delta Lake library)

Open a command prompt and run the following:

pip install delta-spark==2.3.0

Using Delta-Spark

Delta-Spark requires a bit of additional configuration when running pyspark. Here are two ways of configuring it:

In Visual Studio Code:

from pyspark.sql import SparkSession

builder = SparkSession.builder.appName("MyDeltaLakeApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

From the command line:

pyspark --packages io.delta:delta-core_2.12:2.3.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Summary

That’s it, if you’ve installed everything and set up the environment variables correctly, you should now have a working python development environment with spark and delta lake capabilities on your local Windows machine.

--

--

Paulo de Jesus

I enjoy solving problems with technology. When I'm not doing that I also enjoy doing DIY, and brewing my own beer.