Population Lifecycle Tracking

Using the Python data science toolkit to track trends in lifecycles of any group

Eric Ness

Published in

When I Work Data

8 min readApr 12, 2019

Introduction

There are many lifecycles that happen in populations. A lifecycle is a process with a beginning, a middle and an end that population members transition through. These can include customer, biological, and equipment lifecycles. Lifecycles are any process with a beginning, a middle and an end that population members transition through. For example, equipment lifecycles start with the purchase, maintain during operation and end with the disposal of the equipment.

During the course of a lifecycle there are metrics that measure the state of the population members. For customer lifecycles it could be number of orders, for equipment, it could be temperature. These metrics vary in a potentially infinite number of patterns. While it can be interesting to examine the journeys of individual population members, the most powerful insights will come through seeing the patterns in movement in the population as a whole.

Discovering patterns among the lifecycles of all of the members of the population can be challenging. One challenge is that the cycles start and end at different times and have different durations. Therefore, comparing members of the population at a particular point of time isn’t helpful because members will be at different stages in their lifecycle. A system is needed to align all of the lifecycles so that they can be compared in a uniform way.

Analyzing Customer Activity

This story looks at a method of comparing the metrics of population members across their lifecycle regardless of start times, end times and durations. This example illustrates tracking customer activity on a car review website. Customers come to the website to research cars. They are active for a period during the buying process then disappear once they have made a purchase. Although this example is focused on customer activity, the same methods can be used to analyze the members of any population.

In order to perform the analysis, we’ll need customer activity data. For this fictional example we’ll generate fake data. The generated data starts low, increases, then descends back to zero. There is also noise added to the generated data. For details on the method of data generation visit the code on GitHub. Here is what customer lifecycles could look like:

Each of these customers has an activity value for each day during their lifecycle. The patterns are quite erratic due to random variation so it is difficult to see any patterns across all of the lifecycles. The first step in comparing the lifecycles is to align the start times and end times.

Standardizing Timelines

The start times and end times are aligned using standardization. Each lifecycle will start at time 0 and end at time 1. Then customers can compared based on correlated times in their lifecycle. The function in the snippet below performs this standardization regardless of if the granularity of data is days, hours, minutes or seconds.

The input of the function is a pandas.Series with a DateTimeIndex. The function converts each datetime in the index to a timestamp of the number of seconds since the epoch. It then uses sklearn's MinMaxScaler to convert the seconds values into a scale from 0 to 1. Finally it reindexes the time_series using these standardized values. Let’s look at the function in action:

2018-01-01     0
2018-01-02     1
2018-01-03     2
2018-01-04     3
2018-01-05     4
2018-01-06     5
2018-01-07     6
2018-01-08     7
2018-01-09     8
2018-01-10     9
2018-01-11    10
2018-01-12    11
2018-01-13    12
2018-01-14    13
2018-01-15    14
2018-01-16    15
2018-01-17    16
2018-01-18    17
2018-01-19    18
2018-01-20    19
Freq: D, dtype: int640.000000     0
0.052632     1
0.105263     2
0.157895     3
0.210526     4
0.263158     5
0.315789     6
0.368421     7
0.421053     8
0.473684     9
0.526316    10
0.578947    11
0.631579    12
0.684211    13
0.736842    14
0.789474    15
0.842105    16
0.894737    17
0.947368    18
1.000000    19
dtype: int64

The datetimes in the index have all been converted to floats. The graph shows what the customer lifecycles look like after standardize_timeline is applied.

The lifecycles are now all synchronized and have the same length which makes them easier to compare. Now that the lifecycle timelines are standardized, let’s look at how to generate a function that can accurately approximate the changes of activity over time.

Approximating Customer Activity with Splines

In order to compare lifecycles, the activity value needs to be compared at any point along the timeline. One way to do this is to create a function that will closely approximate the activity values for an individual customer. The activity values are highly variable and a single linear or quadratic function won’t be able to fit the data closely. Cubic splines can be used in order to generate a function flexible enough to match the original data.

Splines approximate the shape of even highly variable data sets. They divide the x-axis into multiple intervals in which they fit a polynomial. Since each polynomial has to fit only a small chunk of the entire range it can match the data closely. Here is a graph that shows two splines fit to a data set.

https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html#d-interpolation-interp1d

The yellow line represents a linear spline. This will fit each region of data with a straight line. The linear spline exactly fits the data here, but won’t do a great job of interpolating the values between real data points. For example, the linear spline is flat for the region between 7 and 8, but the true data values are likely above this flat region.

The dashed green line represents a cubic spline. Cubic splines are fit with polynomials of degree 3 so they can more closely model the curves of complex functions. The cubic spline in the graph shows a much smoother approximation of the data.

Interpolating Values

The flexibility of splines can make them difficult to calculate by hand. Fortunately, scipy has a CubicSpline class that will perform all of the calculations. You need to provide the data points that you want to fit and the class will generate the function for you.

After fitting a spline to each customer’s lifecycle, we interpolate a specific number of regularly-spaced points along the timeline. The function interpolate_time_series will perform this work for us.

This function fits a cubic spline to the time series that is passed in. It will then use this spline to estimate the values of n_points evenly spaced points on the timeline between 0 and 1. To get an idea of what this looks like in action, the snippet below creates 20 data points on a line. interpolate_time_series fits a spline to this line and then estimates the values of 12 points evenly spaced along this line. Most of the interpolated points lie at points along the timeline where there is no actual data.

0.00000     0.00000
0.09091     1.72729
0.18182     3.45458
0.27273     5.18187
0.36364     6.90916
0.45455     8.63645
0.54545    10.36355
0.63636    12.09084
0.72727    13.81813
0.81818    15.54542
0.90909    17.27271
1.00000    19.00000
dtype: float64

After completing this simple sanity check, check how a cubic spline will fit a real customer’s activity. The activity for the customer below has already been standardized to the timeline 0 to 1. A cubic spline was fit using 20 interpolated points. The spline fits the activity values closely even though a low number of interpolated points were used. A higher number of interpolated points would fit the activity values even more smoothly.

Tracking Quartiles

Once a spline is fitted for a single customer, it is a simple matter to expand this to all customers. Then for each customer in our population, we have a spline that can interpolate values.

Now we’re ready to examine how the population as a whole moves through its lifecycle. We want to find the most common path that a typical account takes. One way to do this is to take the quartile values at each interpolated point along the timeline. This will track how the top, median and lower members of the customer population are moving. By analyzing the movements of the quartiles of the population, the randomness of individual customers will be smoothed out and we can see how the whole population behaves.

This chart shows the movement of the quartiles in the generated activity data. The 50th percentile line shows that the typical account increases its activity until halfway through its lifecycle and then gradually decreases its activity down to zero in the second half of the lifecycle. The 25th and 75th percentile show the same patterns with lower and higher peaks respectively.

Conclusion

Knowing typical activity patterns helps us serve our customers on the car research site. We can customize their experiences based on their activity levels on the site. While their activity is increasing they are still exploring options and we can guide them to lesser-known models that they may be interested in. Once their activity starts to decrease, they are narrowing down their options and looking in detail at a few models. At this point in their journey we could help them with tips on how to find and negotiate the best deal.

Tracking a population’s movement through a lifecycle is applicable in many domains beyond just tracking customer activity. It can be applied in any area where population members go through a cycle with offset beginnings and ends and variable duration.

Code

All of the code for this story is available on Github.

References

Introduction to Statistical Learning

"As a former data scientist, there is no question I get asked more than, "What is the best way to learn statistics?" I…

www-bcf.usc.edu