3 Steps to Clean and Ingest Flat File Dates

Bob Blackburn
Hitachi Solutions Braintrust
3 min readMar 8, 2021

We may have AI and holograms now, but the business world still lives on spreadsheets. So, after we ingest terabytes of data in record time, we will be told our new report will be meaningless without this spreadsheet out on the file system.

Here we will review three steps to take a flat file and standardize date columns and remove invalid dates.

  1. Import libraries.
  2. Set date column list.
  3. Call function.

Import Libraries

This function uses parse from dateutil. The other libraries are standard.

from datetime import timedelta, datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
from dateutil.parser import parse

Set a list of columns to be processed

For this demo, we will hard code the list. You will set the list by inspecting column names or a metadata table.

Function

The function will accept the dataframe and a list of columns to process. Because of syntax restrictions the steps are broken out into multiple statements and a sub-function. Luckily, Spark’s lazy evaluation will optimize it during runtime.

The function:

def string_to_date(df_date):# Default invalid dates
try:
cleaned_date = parse(df_date)
except ValueError:
cleaned_date = datetime(1900,1,1)
return cleaned_date

udf_string_to_date = udf(lambda i: string_to_date(i),returnType=DateType())
def clean_flat_file_dates(cdf, date_col=[]):

# Process date columns
# Columns will be converted to date
# Invalid dates will return NULL

for c in date_col:
cdf = cdf.withColumn(c, udf_string_to_date(cdf[c].cast(StringType())))
cdf = cdf.withColumn(c, cdf[c].cast(DateType()))
cdf = cdf.withColumn(c,
when(cdf[c] <= '1900-01-01', to_date(lit(None), "yyyy-MM-dd")).
otherwise(to_date(lit(cdf[c]), "yyyy-MM-dd")))

return(cdf)

Test Case

We create a dataframe with multiple date formats — one row with valid dates and one row with invalid dates. As you have seen in the function, we trap errors when parsing and return null. The year 1900 is used as the default earliest date because it meets most use cases. If you are dealing with dates before that (e.g., old buildings) you can adjust the date to as early as 1582–10–15 due to Spark 3.x using the Proleptic Gregorian calendar

Code to build the sample dataframe and run the function

query = """select   '2021-03-08' as standard_date, 
'2021-03-08' as hyphen_date,
'03/08/2021' as slash_date,
cast(20210308 as int) as int_date,
'March 8, 2021' as text_date
union all
select '2021-03-32' as standard_date,
'2021-03-32' as hyphen_date,
'03/32/2021' as slash_date,
cast(20210332 as int) as int_date,
'March 32, 2021' as text_date"""

df = spark.sql(query)
print("Source file")
display(df)
print(df.dtypes)

col_to_clean = ['standard_date', 'hyphen_date','slash_date', 'int_date', 'text_date']
df = clean_flat_file_dates(df, col_to_clean)

print(" ")
print("Cleaned file")
display(df)
print(df.dtypes)

Output

You can see the different formats are converted to a standard format, the column types are now date, and the invalid dates are null.

Conclusion

It does not look like we are going to get away from flat files anytime soon. Adding this routine to your toolbox will help reduce data prep time. If you find it useful or have recommendations to expand it, let us know in the comments.

Originally published at https://robertjblackburn.com on March 8, 2021.

--

--

Bob Blackburn
Hitachi Solutions Braintrust

Principal Azure Data Platform Engineer, Certified Azure Data Engineer, volunteer firefighter/EMT