Setting Up a Python Data Engineering Development Environment on Windows
Are you starting your journey into the world of data engineering but not sure what you need to get started? Are you uncertain about signing up for free trials with cloud providers, and you wish you could just try things out on your local laptop?
In this blog I will show you how to set up a basic Python-based development environment, so that you can start playing around with some of the popular technologies, all on your local Windows machine. The development environment will include Visual Studio Code, Java, Python, Spark, and Delta Lake.
Here are the steps that we will be following:
- Installing Visual Studio Code (Latest stable version)
- Installing Java 11 (version 11.0.18)
- Installing Python 3 (version 3.10.11)
- Copying Winutils files (version 3.2.2)
- Installing PySpark (version 3.3.2)
- Installing Delta-Spark (version 2.3.0)
NOTE: The trickiest part of setting up the environment is making sure that all of the versions are compatible with one another. At the time of writing, the above combination was all compatible. The other tricky part is making sure that the additional files are copied and the environment variables are set up for things to work correctly.
Installing Visual Studio Code
Visual Studio Code is a powerful development tool, with support for almost every programming language under the sun. You can download the installation package of the latest version here for free, and then follow the installation instructions. There shouldn’t be any compatibility issues with regards to Visual Studio Code.
Installing Java
You can download the installation package for Java from the official Oracle website here.
NOTE: Take note of the installation path of Java as you will need it soon to set up environment variables.
Next, the Java environment variables need to be set up. To do this go to System Properties and select the Environment Variables button:
Under System variables click the New button and add a new variable called JAVA_HOME with a value of the path where Java was installed:
Once that system variable has been created, find a system variable called “Path” in the list and click the Edit button. Then add a new line with the following value:
NOTE: Also check to see if Oracle created another folder in Program Files (or wherever it was installed). If it did then add that path to the “Path” system variable as well:
Installing Python
You can download the installation package for Python here.
NOTE: Take note of the installation path of Python as you will need it soon to set up environment variables.
As with the Java installation, you need to add two lines to the “Path” system variable. This is based on the installation path of Python:
Copying Winutils files
These files are required by the Spark configuration and need to be manually copied into a folder and then referenced in system variables.
First create a folder on your local drive. I created one called “C:\Program Files\Hadoop\bin”. The name of the path isn’t really important, but you need to download two files from here and copy them into the “bin” subfolder. The files in the link have been complied as version 3.2.2:
Once the files are in the folder go back to environment variables, and create a new system variable called HADOOP_HOME with the value of the “bin” folder’s parent you created:
Then add the following line to the “Path” system variable:
Installing PySpark (Python Spark library)
Open a command prompt and run the following:
pip install pyspark==3.3.2
Go to environment variables again, and under System variables click the New button and add a new variable called PYSPARK_PYTHON with a value of the full path to the Python executable file:
Next add another system variable called PYSPARK_SUBMIT_ARGS with the following value:
Installing Delta-Spark (Python Delta Lake library)
Open a command prompt and run the following:
pip install delta-spark==2.3.0
Using Delta-Spark
Delta-Spark requires a bit of additional configuration when running pyspark. Here are two ways of configuring it:
In Visual Studio Code:
from pyspark.sql import SparkSession
builder = SparkSession.builder.appName("MyDeltaLakeApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
From the command line:
pyspark --packages io.delta:delta-core_2.12:2.3.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Summary
That’s it, if you’ve installed everything and set up the environment variables correctly, you should now have a working python development environment with spark and delta lake capabilities on your local Windows machine.