How much information can you extract from timestamps?

Published in

Discovering Scikit-learn and Pandas

5 min readNov 7, 2019

Timestamps contain a huge amount of information for your machine learning models. From timestamp, you can model behaviors and report for insights, extract trends and forecasts, analyze temporal time-series and explore for anomalies in your dataset. Unfortunately, timestamps cannot be used straight away, since they’re neither a numerical feature nor a categorical: to use them in the machine learning models, we’d need to feature engineer them.

From timestamps, a big number of features can be created, most of them are ordinals and categoricals, but there are some numerical too. Mind that we won’t be exploring automatic feature generation in this post like you would do with PolynomialFeature in Scikit-learn*, but we’ll focus mostly on human-curated features with real meaning for human lives and/or companies and financial markets.

The features we’ll extract have a direct connection with our patterns and seasonality trends; moreover, we’ll try to extract country-agnostic features so you’ll be able to use them everywhere at scale.

Warning: not all the features may be relevant for your project and bring additional information into your ML algorithm, therefore always use a feature-selection/elimination approach to cherry-pick the best ones.

Let’s analyze the feature engineering task of extracting features from a timestamp with an example: in our dataset one event happened in the US at the following timestamp: 2019–10–20 18:23:36.0000 -04

Warning: mind that sometimes the timezone is not explicit, and should be derived/guessed from the location of the datapoint (from the GPS coordinates, the IP, a feature containing the location, …)

The first operation is to take your timestamp to UTC timezone. This will simplify the analysis, especially when your data contains timestamps from multiple timezones. After doing so, remember to keep the country (or the timezone offset) as a separate feature in your model, since it’s very likely that it will affect the pattern and the distribution of your data (morning peaks is at 8AM UTC time in EU, at 12PM UTC time in the US and so on).

In this example, we’ll use pandas to illustrate how to derive the features. Let’s start with the conversion to UTC. Documentation is available here <link>

From timestamp_utc (and the country) we can now feature engineer many other features which may help in detecting patterns and seasonality trends in your data. Here are 11 of them very simple to derive:

Age, i.e. distance between a set date, like today, and the timestamp: timestamp_utc — pd.Timestamp.utcnow() The distance can be the number of days, months, years, but also hours, seconds, milliseconds, etc. Age is an integer or floating-point numerical feature.
The Year, timestamp_utc.year It is an integer numerical feature.
The Month in the year, timestamp_utc.month It is an integer, usually between 0 and 11 (or between 1 and 12), or a string (January, February, …). It’s an ordinal/categorical feature.
The Day of the month, timestamp_utc.day Integer, usually between 1 and 31. Ordinal/categorical feature.
Percentage within the month, timestamp_utc.day/timestamp_utc.daysinmonth A floating-point feature, with values between 0.0 and 1.0. This feature carries almost the same information of the previous, but it’s normalized between 0 and 1 for all the months. It’s a numerical feature.
Day of the week, timestamp_utc.dayofweek It is an integer, usually between 0 and 6(or between 1 and 7), or a string (Monday, Tuesday, …). It’s usually treated as a categorical feature. From this feature can be derived a working_days vs weekend_days boolean feature.
Day in the year, timestamp_utc.dayofyear It is an integer, between 0 and 365 (or 0 and 366) otherwise it’s a floating-point between 0.0 and 1.0. It’s an ordinal/categorical feature.
Percentage of days within the year, timestamp_utc.dayofyear/(365+(1 if timestamp_utc.is_leap_year else 0)) It’s a floating-point value, between 0.0 and 1.0 and it’s a numerical feature. This feature carries almost the same information of the previous, but it’s normalized between 0 and 1 for all the months. It’s a numerical feature.
The Quarter of the year, timestamp_utc.quarter It’s an integer, between 0 and 3, or between 1 and 4. It’s an ordinal/categorical feature.
The Week in the year, timestamp_utc.week It is likely an integer value, between 0 and 52 or 1 and 53. It’s an ordinal/categorical feature.
Type of day (weekday/public holiday). This feature requires the country, for example in the US checkout pandas.tseries.holiday.USFederalHolidayCalendar. It’s a binary feature (0 or 1). For other countries, see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-holiday

Some features are marked as both ordinal and categorical because they really sit in between the two. For example, the day of the month is either ordinal, but also categorical since the distance between the 2nd and the 3rd of March and the distance between the 31st of March and the 1st of April should be the same. If you’re unsure, or if the day of the month is one of the most important features in your model, then I’d strongly suggest to treat it as a categorical one.

Show me the code

Here it is, this Colab gist contains the code for extracting the features shown above for a couple of timestamps, one from an event from the US, and another for an even based in the UK. It extensively uses the .apply() functionality of Pandas, and shows how to create a new Holiday calendar for the UK.

Google Colaboratory Gist

Extract features from timestamp: code with demo

colab.research.google.com

The final output is the following:

Output Pandas dataframe with all the features derived from the first two columns

In the next post, we’ll see how to deal with categorical, ordinal and numerical features, and which options you have available for feeding these features into your ml algorithm. Moreover, we’ll see how to deal with “circular features”, as the day of the month.

[*] you can apply PolynomialFeatures from these features :)

If you liked this post, check out our book on Python Data Science essentials, https://www.amazon.com/Python-Data-Science-Essentials-practitioners/dp/178953786X/

How much information can you extract from timestamps?

Show me the code

Google Colaboratory Gist

Extract features from timestamp: code with demo

Written by Alberto Boschetti