3 Steps to Clean and Ingest Flat File Dates
We may have AI and holograms now, but the business world still lives on spreadsheets. So, after we ingest terabytes of data in record time, we will be told our new report will be meaningless without this spreadsheet out on the file system.
Here we will review three steps to take a flat file and standardize date columns and remove invalid dates.
- Import libraries.
- Set date column list.
- Call function.
Import Libraries
This function uses parse from dateutil. The other libraries are standard.
from datetime import timedelta, datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
from dateutil.parser import parse
Set a list of columns to be processed
For this demo, we will hard code the list. You will set the list by inspecting column names or a metadata table.
Function
The function will accept the dataframe and a list of columns to process. Because of syntax restrictions the steps are broken out into multiple statements and a sub-function. Luckily, Spark’s lazy evaluation will optimize it during runtime.
The function:
def string_to_date(df_date):# Default invalid dates
try:
cleaned_date = parse(df_date)
except ValueError:
cleaned_date = datetime(1900,1,1)
return cleaned_date
udf_string_to_date = udf(lambda i: string_to_date(i),returnType=DateType())def clean_flat_file_dates(cdf, date_col=[]):
# Process date columns
# Columns will be converted to date
# Invalid dates will return NULL
for c in date_col:
cdf = cdf.withColumn(c, udf_string_to_date(cdf[c].cast(StringType())))
cdf = cdf.withColumn(c, cdf[c].cast(DateType()))
cdf = cdf.withColumn(c,
when(cdf[c] <= '1900-01-01', to_date(lit(None), "yyyy-MM-dd")).
otherwise(to_date(lit(cdf[c]), "yyyy-MM-dd")))
return(cdf)
Test Case
We create a dataframe with multiple date formats — one row with valid dates and one row with invalid dates. As you have seen in the function, we trap errors when parsing and return null. The year 1900 is used as the default earliest date because it meets most use cases. If you are dealing with dates before that (e.g., old buildings) you can adjust the date to as early as 1582–10–15 due to Spark 3.x using the Proleptic Gregorian calendar
Code to build the sample dataframe and run the function
query = """select '2021-03-08' as standard_date,
'2021-03-08' as hyphen_date,
'03/08/2021' as slash_date,
cast(20210308 as int) as int_date,
'March 8, 2021' as text_date
union all
select '2021-03-32' as standard_date,
'2021-03-32' as hyphen_date,
'03/32/2021' as slash_date,
cast(20210332 as int) as int_date,
'March 32, 2021' as text_date"""
df = spark.sql(query)
print("Source file")
display(df)
print(df.dtypes)
col_to_clean = ['standard_date', 'hyphen_date','slash_date', 'int_date', 'text_date']
df = clean_flat_file_dates(df, col_to_clean)
print(" ")
print("Cleaned file")
display(df)
print(df.dtypes)
Output
You can see the different formats are converted to a standard format, the column types are now date, and the invalid dates are null.
Conclusion
It does not look like we are going to get away from flat files anytime soon. Adding this routine to your toolbox will help reduce data prep time. If you find it useful or have recommendations to expand it, let us know in the comments.
Originally published at https://robertjblackburn.com on March 8, 2021.