How to extract all Date variants from real-world data using python.
--
Your practical definitive guide to mining all Date variants from both structured and unstructured data using data mining techniques in python.
Whether you are a Data Scientist or an Analysts, you may one day encounter a problem that has to do with date and time.
Lifes becomes easier when the date is well cleaned and structured in a single column of a data frame like 06/18/2021, 09/27/2000
. Unfortunately, it doesn’t always happen that way.
How do you extract dates when you get something like 7/20
or even in a sentence like ‘He came to the Hospital on 2/21
at dawn’
Before we get overwhelmed, let get started!!
Project Goal
Our goal is to correctly identify all of the different date variants encoded in a dataset and to properly normalize and sort them.
Things to learn
1. Perform basic operations on medical data using Pandas.
2. Extract all date variants using python regular expressions ( Regex).
3. Sort the dates in ascending chronological order according to some outlined rules.
4. Summary
Below are some of the date variants we may have to deal with:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20–2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20
2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
Let’s learn some useful regular expressions to prepare us for the journey
[]: matches one of the sets of characters within []
[a-z]: matches one of the range characters a,b, …,z
a|b: matches either a
or b
, where a
and b
are strings
() : Scoping for operators
\: Escape character for special characters (\t, \n, \b)
\b: Matches word boundary
\d: Any digit, equivalent to [0–9]
\w: Alphanumeric character, equivalent to [a-zA-Z0–9_]
\s: Any whitespace
‘*’: matches zeros or more occurrences
‘+’: matches one or more occurrences
?: matches zero or one occurrence
{n} :exactly n repetitions,n≥0
{n,} :at least n
repetitions
{,n} :at most n
repetitions
{m,n} :at least m
and at most n
repetitions
Some useful functions
re.search(pattern,string,flags=0)
: This function searches for the first occurrence of the RE pattern within a string with optional flags.
re.match(pattern, string, flags=0)
: This function attempts to match the RE pattern to string with optional flags.
- Import libraries
NB: re
is the python regular expression
2. Loading data
3. Date Extraction
i. We extract dates with the following format: 04/20/2009; 04/20/09; 4/20/09; 4/3/09
ii. Extract dates with formats # Mar-20–2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
re.I Performs case-insensitive matching.
re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).
iii. Extract dates with format # 6/2008; 12/2009
iv. Extract dates with the format; # 2009; 2010
v. Concatenate all the series
vi. Let's put all the above together in a single function
4. Reformating the extracted dates
Although we have extracted our dates, there are some outlined rules to put the date into good shape.
Rule #1
Assume all dates in xx/xx/xx
format is mm/dd/yy
Rule #2
Assume all dates where year is encoded in only two digits are years from the 1900s (e.g. 1/5/89 is January 5th, 1989). Let’s do this conversion.
Rule #3
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009). Let's do this:
Rule #4
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010). Let's do this:
Rule #5
Watch out for potential typos as this is a raw, real-life derived dataset.
5. Converting date in words to numbers
We need to sort the data into a specific chronological way. Because of that, all dates written in words would have to be converted to numbers. eg. January as 1, February as 2 … December as 12.
Let us write some function to tackle this.
Now let’s Combine the entire code parts into one whole
By calling the function date_sorter()
, all the dates would be extracted from the text and be arranged in ascending order of magnitude as shown below.
6. Summary
- we learnt the required regular expression for this project.
- We introduced some useful re functions thus
match
andsearch
. - All the date variants were extracted from the given document.
- With specific rules, the data was reformated
- All the dates in words were converted to digits for sorting
- The extracted dates were then sorted in ascending order of magnitude.