How to extract all Date variants from real-world data using python.

Joachim Kuleafenu
Analytics Vidhya
Published in
4 min readJun 28, 2021


Your practical definitive guide to mining all Date variants from both structured and unstructured data using data mining techniques in python.

Photo by insuring Yoon on Unsplash

Whether you are a Data Scientist or an Analysts, you may one day encounter a problem that has to do with date and time.

Lifes becomes easier when the date is well cleaned and structured in a single column of a data frame like 06/18/2021, 09/27/2000. Unfortunately, it doesn’t always happen that way.

How do you extract dates when you get something like 7/20 or even in a sentence like ‘He came to the Hospital on 2/21 at dawn’

Before we get overwhelmed, let get started!!

Project Goal
Our goal is to correctly identify all of the different date variants encoded in a dataset and to properly normalize and sort them.

Things to learn
1. Perform basic operations on medical data using Pandas.
2. Extract all date variants using python regular expressions ( Regex).
3. Sort the dates in ascending chronological order according to some outlined rules.
4. Summary

Below are some of the date variants we may have to deal with:

  • 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  • Mar-20–2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  • 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  • Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  • Feb 2009; Sep 2009; Oct 2010
  • 6/2008; 12/2009
  • 2009; 2010

Let’s learn some useful regular expressions to prepare us for the journey

[]: matches one of the sets of characters within []

[a-z]: matches one of the range characters a,b, …,z

a|b: matches either a or b, where a and b are strings

() : Scoping for operators

\: Escape character for special characters (\t, \n, \b)

\b: Matches word boundary

\d: Any digit, equivalent to [0–9]

\w: Alphanumeric character, equivalent to [a-zA-Z0–9_]

\s: Any whitespace

*’: matches zeros or more occurrences

+’: matches one or more occurrences

?: matches zero or one occurrence

{n} :exactly n repetitions,n≥0

{n,} :at least n repetitions

{,n} :at most n repetitions

{m,n} :at least m and at most n repetitions

Some useful functions,string,flags=0) : This function searches for the first occurrence of the RE pattern within a string with optional flags.

re.match(pattern, string, flags=0) : This function attempts to match the RE pattern to string with optional flags.

  1. Import libraries

NB: re is the python regular expression

2. Loading data

3. Date Extraction

i. We extract dates with the following format: 04/20/2009; 04/20/09; 4/20/09; 4/3/09

ii. Extract dates with formats # Mar-20–2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;

re.I Performs case-insensitive matching.

re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).

iii. Extract dates with format # 6/2008; 12/2009

iv. Extract dates with the format; # 2009; 2010

v. Concatenate all the series

vi. Let's put all the above together in a single function

4. Reformating the extracted dates

Although we have extracted our dates, there are some outlined rules to put the date into good shape.

Rule #1
Assume all dates in xx/xx/xx format is mm/dd/yy

Rule #2
Assume all dates where year is encoded in only two digits are years from the 1900s (e.g. 1/5/89 is January 5th, 1989). Let’s do this conversion.

Rule #3
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009). Let's do this:

Rule #4
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010). Let's do this:

Rule #5
Watch out for potential typos as this is a raw, real-life derived dataset.

5. Converting date in words to numbers

We need to sort the data into a specific chronological way. Because of that, all dates written in words would have to be converted to numbers. eg. January as 1, February as 2 … December as 12.

Let us write some function to tackle this.

Now let’s Combine the entire code parts into one whole

By calling the function date_sorter() , all the dates would be extracted from the text and be arranged in ascending order of magnitude as shown below.

6. Summary

  • we learnt the required regular expression for this project.
  • We introduced some useful re functions thus match and search.
  • All the date variants were extracted from the given document.
  • With specific rules, the data was reformated
  • All the dates in words were converted to digits for sorting
  • The extracted dates were then sorted in ascending order of magnitude.



Joachim Kuleafenu
Analytics Vidhya

Software Engineer. I build smart features with Machine Learning techniques.