It’s a date: Python ❤ Excel

How to deal with those pesky end users

Carl Follows
Version 1
8 min readSep 4, 2023

--

Shredding data always bring challanges, but when the data is poor quality dates stored in Excel this bring an extra dimension of complexity. Let’s take a look at how Python can improve data quality for you and your organisation.

Everyone loves Excel

Excel is a great tool for playing with data and through this strength it has become ubiquitous in businesses for analyses and reporting.

Data Engineers however hate Excel, because of this flexibility.

  • It allows mixing of different date types: like “to be confirmed” in a column that otherwise contains dates.
  • It conflates values with presentation: try writing 1E2 in a cell, it is interpreted as 1e+2 = 100.
  • It assumes too much: numbers can’t have a leading 0, not even phone numbers.

But when you combine these problems with the usual joys of dates, then you get something special. End users love the flexibility of Excel whilst senior managers want data integrated, standardised and centrally available for visualisation by lots of users. This is the data engineers challange.

Calendar showing January with multiple languages
Photo by Waldemar on Unsplash

Storing Dates

The standard way to present dates differs in almost every region, but the English DD-MM-YYYY v.s. American MM-DD-YYY is one of the biggest headaches. For instance 02-04-2024, is this February or April?

To handle this problem Excel (like most programs) stores dates in a internal structure separate to presentation. Specifically Excel stores them as decimal numbers with the whole element representing days since 0 Jan 1900 and the fraction holding hours; so 6am on 1st Jan 1900 is 1.25. You can easily see this when you change the format of a date to Number, a common mistake in large spreadsheets containing lots of information maintained by different people copying data around.

But what happens if the date is before 1900, well this is out of range for the Excel data type, so it stores the textual representation of a date instead.

Engineers love Python

Since around 2015 Python has been one of the go-to languages for data engineering and vendors have been busy integrating it into many of their offerings. Whilst big data platforms, like Microsoft Synapse Workspace, favour PySpark (the Python API for Apache Spark) for scalability. Some of the older libraries, like pandas, have a wealth of functionality that should not be underestimated, particularly when dealing with smaller scale challange of shredding Excel.

For example: loading a sheet from an Excel file into a dataframe, storing each column of data in an object datatype within the dataframe.

import pandas as pd
df = pd.read_excel('abfss://filesystemname@accountname.dfs.core.windows.net/file.xlsx', sheet_name='Sheet1')

Note: Whilst my examples are written on Microsoft Synapse Workspace, most should be easily transferable to other runtimes.

All too easy eh?

Well yes if your dates are all stored correctly, but what about the following spreadsheet with the reigns of recent(ish) English Monarchs containing a few mistakes. Straight away you can see dates before 1900 are formatted differently, hinting that they are stored differently (text instead of date). But there are a variety of other problems hidden in this data.

Excel sheet showing recent English Monarchs and their reigns
Excel sheet showing recent English Monarchs and their reigns

Let’s Shred

Let’s start by excluding standard non-sensical text values by listing them in the na_values parameter of the read_excel function, helping to translate them to a Not Applicable value. Depending on the content of your Spreadsheet this could mean data loss and you may need to import these text value into a second column, but I’m going to assume not for now.

na_values = 'Ongoing'

How about if we then just use the simple (but powerful) pandas to_datetime function to create a new column storing the converted value.

df['ReignedFrom_FIX'] = pd.to_datetime(df_raw['ReignedFrom'], errors='coerce')
dataframe after using the pandas to_datetime() function
dataframe after using the pandas to_datetime() function

This shows that many of the original fields are correctly parsed. However, there are some problems with the earliest dates, the numeric end of George VI and the typo in the start of Charles III which have all resulted in errors. These errors are stored as (N)ot-(A)-(T)ime values.

This demonstrates that the Python timestamp datatype clearly uses a different internal date structure to Excel: specifically a count of nanoseconds stored using a 64 bit integer, with a valid range of 1677–09–21 00:12:43 to 2262–04–11 23:47:16. So the early dates and the typo for Charles III result in dates out of this valid range hence the NaT error code.

This is not a fault in Python, any language has to balance precision and the range of possible values against size of storage on disk. Whilst it’s reasonable to assume the years 1677 to 2262 will cover most scenarios, typos like mine can result in unusual values that we may still want to retain without losing the value entered.

Divide and Conquer

Now we realise that the values in Excel are stored differently (numeric, text and date) we need to created an approach that will deal with each of these in turn. This means splitting the dataframe and applying different operations to each component. We will also need to decide how to store our data in a way that can handle the large range of values. Since we are only interested in the date part we could use the IS0 8601 standard of YYYY-MM-DD, which can also be written YYYYMMDD and as such conveniently stored as an integer that can hold a very wide range of dates. Indeed if we can’t convert the supplied value into this format then we must consider if it is a date at all and do we need to just retain the original textual value?

Before we get started let’s convert the storage datatype from object into string to allow us to use regular expressions and import the datetime library that will help with date parsing.

import datetime as dt
df['ReignedTo'] = df['ReignedTo'].astype(str)

Numeric

The numeric data can be identified by the Regular Expression (\d*\.)?\d+$ which we can use with the pandas str.match function and the dataframe.loc property to create a new dataframe containing only the numeric data.

Since we only need the date component (time is not important in this scenario) we can convert the number to an integer and apply a transformation to add this number of days to Excel’s beginning of time.

Now we have a date where the year, month and day aspects can be extracted and combined into the ISO structure, with the complete code as follows.

df_rt_int = df.loc[df['ReignedTo'].str.match(r'(\d*\.)?\d+$', na=False)]
df_rt_int['ReignedTo'] = df_rt_int['ReignedTo'].astype(int)
df_rt_int['ReignedTo_FIX'] = df_rt_int['ReignedTo'].apply(lambda x:dt.date(1900,1,1)+dt.timedelta(days=x-2))
df_rt_int['ReignedTo_ISO'] = df_rt_int['ReignedTo_FIX'].apply(lambda x:x.year*10000 + x.month*100 + x.day)

Why do I deduct 2 from the integer? I can’t use a date of the 0th Jan so using the 1st Jan requires a deduction of 1. The 2nd deduction is due a historic bug/feature in Excel considering 1900 as a leap year.

Running the above code results in a dataframe as follows:

dataframe of corrected numeric data
dataframe of corrected numeric data

Notice that the index numbering of this new dataframe (18) remains the same as the original which will help us immensely when recombining.

Text

For the text values I’ve assumed the consistent format of the 3 character month name, so they can be identified by the Regular Expression \d{1,2}\s\w{3}\s\d{2,4} filtering as before to produce a dataframe containing only those records.

The pandas str.extract function can now be used to extract the 3 component of the string into a new dataframe containing those 3 fields, then the to_datetime function allows us to convert the month name into a numeric value. Now we have the year, numeric month and day these can be combined into the ISO structure similar to previously.

df_rt_txt = df.loc[df['ReignedTo'].str.match(r'\d{1,2}\s\w{3}\s\d{2,4}', na=False)]
df_rt_txt = df_rt_txt['ReignedTo'].str.extract(r'(?P<day>\d{1,2})\s(?P<month_name>\w{3})\s(?P<year>\d{4})')
df_rt_txt['month_num'] = df_rt_txt['month_name'].apply(lambda x:pd.to_datetime(x, format='%b').month)
df_rt_txt['ReignedTo_ISO'] = df_rt_txt.apply(lambda row:(int(row['year'])*10000 + int(row['month_num'])*100 + int(row['day'])), axis=1)
dataframe of corrected textual data
dataframe of corrected textual data

The extract function produced a dataframe containing only the captured date elements thereby loosing our original fields, however (like with numeric) the index has been retained, so we don’t need to worry about this for now.

Date

The date values can be treated in a very similar fashion to the text values.

It may seem counter-intuitive to use Regular Expressions to shred valid dates. Remember though that in the original data the typo in the start of Charles III of 2822 is a valid date for the Excel data type, but not valid in the Python data type. So without using this Regular Expression approach we may lose such future dates.

df_rt_ymd = df.loc[df['ReignedTo'].str.match(r'\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}', na=False)]
df_rt_ymd = df_rt_ymd['ReignedTo'].str.extract(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\s+\d{2}:\d{2}:\d{2}')
df_rt_ymd['ReignedTo_ISO'] = df_rt_ymd.apply(lambda row:(int(row['year'])*10000 + int(row['month'])*100 + int(row['day'])), axis=1)
dataframe of corrected date data
dataframe of corrected date data

Bring it back together

Above I’ve demonstrated the code for a number of scenarios in the ReignedTo field (creating dataframes df_rt_int, df_rt_txt & df_rt_ymd). You may have other formats and these will likely vary depending on the regional settings (locale) where your python is running.

In my sample I would need to repeat the exercise for the ReignedFrom field creating additional dataframes df_rf_int, df_rf_txt & df_rf_ymd.

The dataframes can now be concatenated back together, and since they retained their original indexes during the divide and conquer, it’s trivial to join the results back to the original dataframe.

df_rf = pd.concat([df_rf_int[['ReignedFrom_ISO']], df_rf_txt[['ReignedFrom_ISO']], df_rf_ymd[['ReignedFrom_ISO']]])
df_rt = pd.concat([df_rt_int[['ReignedTo_ISO']], df_rt_txt[['ReignedTo_ISO']], df_rt_ymd[['ReignedTo_ISO']]])

df = df.join(df_rf[['ReignedFrom_ISO']])
df = df.join(df_rt[['ReignedTo_ISO']])
dataframe of corrected data
dataframe of corrected data

Quality Check

Those keen of eye may have spotted an uncorrected ReignedTo date for William III since this had the month name in long format which would require a slightly different Regular Expression to resolve.

With any complex data cleansing like this there will often be new types of errors than need checking and coding for. So I would recommend you implement an explicitly check to highlight problems, for example:

df.loc[df['ReignedTo_ISO'].isna()]

And finally why is ReignedTo_ISO showing a decimal place whilst ReignedFrom_ISO is an integer? That’s thanks to the NaN value being a float which can’t exist in a integer datatype, forcing the datatype of the rest of the column to change. Whether this matters, as always, depends on what you’re going to do with your data next.

Power of Python

Above I have shown how a modern data platform, like Microsoft Synapse Workspace, allows Data Engineers to break complex problems into solvable parts. Taking advantage of rich Python libraries to pull conflated data into strongly typed datasets without losing original meaning.

Whilst users will continue to abuse the flexibility of Excel, departmental data can now be centralised, to enable a more data driven organisation.

About the Author:
Carl Follows is a Data Analytics Solution Architect at Version 1.

--

--

Carl Follows
Version 1

Data Analytics Solutions Architect @ Version 1 | Practical Data Modeller | Builds Data Platforms on Azure