Starting out as a Data Engineer (and how to apply() a regex replace on an entire DataFrame in Spark)
If you want to go straight to the really simple solution, head over to the Performing a Regular Expression (Regex) replace operation on a DataFrame section, near the bottom of the post.
You’re probably wondering how I got here…
Recently, my boss asked me to do something really simple.
“Hey, JL! I want you to take a look at the code our consultants are trying to fix. You’re really only supposed to remove the double-quotes on our data but they say it’d take a few weeks.”
And so begins my rabbit-hole of trying to look for a way to apply a regex replace function on an entire DataFrame.
Some background about me and my job
So, how can I start with Data Engineering?
Anyway, if you are starting out with Data Engineering and Big Data stuff, just read up and practice the different things you can do with Spark (PySpark, Pandas-On-Spark stuff) and, from what I can tell, you’d have a great start already. Databricks platform has a community edition too so you can use that. This is just a stepping off point though. Maybe, if I have more time in the future, I can get more in-depth with what resources to look at.
So, if you start searching on Google right how “How to perform regex operation on DataFrames”, you’d actually see a lot of answers and they seem like really easy solutions. But, I may have “over-read” the search results and started seeing reasons why you shouldn’t use stuff like apply(), applymap(), transform() because of performance issues. I guess being anxious about it was reasonable, it was going to be my first code snippet implemented on our Production Pipeline and my boss placed a big emphasis on being faster than what the consultants can do. So, I started trying to implement a lot of really complex code based on stuff I saw from StackOverflow and doing a lot of crazy stuff over the course of 2 days. In the end, I went back to just using apply() (apply() seems to be a lot faster than applymap(), so just take note of that). Why did I go back to using apply()? Well, my solution was overengineered-as-heck, and it wasn’t even fast. Plus, I was having trouble using apply() (and applymap() and transform() for that matter). I could see that apply() wouldn’t give me the appropriate output and it was because I was using the replace() method on a Series (the solution for this is in the next section).
Performing a Regular Expression (Regex) replace operation on a DataFrame
Credits to Tambayan404 discord community and my best-friend’s kuya (Filipino word for ‘brother’) for helping my figure out that I needed to pass the str property and helping me understand the replace method better. (Thanks kuya Carding)
Take note that I ingested my data and it was initially a Spark Dataframe (df), so I also converted into a pandas-on-spark DataFrame (psdf) in order for the .apply() method to work.
#importsimport pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession# Creating a spark dataframe from the CSV file# File location and type# Note, replace the  with actual stuff
file_location = "[fill this with your file location]"
file_type = "CSV"# CSV options
infer_schema = "[true or false]"
first_row_is_header = "[true or false]"
delimiter = ","# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)# Creating a pandas-on-spark dataframe from spark dataframe
psdf = ps.DataFrame(df)
display(psdf)# Perform an apply on the pandas-on-spark dataframe# Instead of simply passing x to the lambda, you must access the .str property of the Pandas Series that is taken as a parameter by the lambda function
# For beginning and end of string only
psdf2 = psdf.apply(lambda x: x.str.strip(‘\”’))
# For whole string
psdf3 = psdf.apply(lambda x: x.str.replace(r’\”’, ‘’, regex=True))display(psdf2)
Running this code and displaying psdf2 and psdf3, you should see that psdf2 no longer has the double-quotes at the beginning and end of the strings in the entire DataFrame. While in psdf3, you should see that all double-quotes are eliminated entirely. Of course, you can replace the lambda function with whatever you would like to apply the transformation on the DataFrame (and save it to a new one!).