Analysis of Blood Donor Behavior using Hidden Markov Model
As all of us are already familiar with, blood donation has been a vital part of medical treatment in every part of the world. Despite that fact, more often we find a hospital or clinic having blood shortages.
To increase the donation rate generally, engaging the non-remunerated donors has been proven as one of the effective solutions. Accordingly, each blood organization has to establish a specialized and customized strategy for each donor demographic to engage them.
P.S non-remunerated is defined as an act of donating blood based on our own free will without receiving any payment.
So, in this post, I would like to share my research on analyzing blood donation behavior to figure out the pattern and demographic characteristics associate with each pattern, using Hidden Markov Model, and why I use such a technique.
The data used in this analysis is the visitation history of 2000 individuals from a blood donation center. The visitation history of each individual spans from January 2010 - December 2014. The calculation later is based on the number of visitation per individual which is merged into half per year. Hence, the time sequence will be l=10 (t₁,₂,…,₁₀). I will name the merged-visitation record as the trajectory data (Trᵢ = ₁,…,₂₀₀₀) and will also encode the merged number of visitation for each individual to simplify the later analysis. Along with merged visitation numbers, demographic-related variables such as gender, age, blood type, etc. Complete information on the demographic variable is shown below:
Let’s take a look at the demographic composition for each variable through Fig 2 below.
From Fig 2, we know that the portion between gender is comparatively balanced, thus the inference we obtain might be able to be generalized. The age composition is dominated by people in their 20’s followed by people in their 30’s. By this, I know that most people who donated were in their active age.
To make it more clear below is the sample visualization of the data. (I used “TraMineR” package in R to create the trajectory graph).
As it is shown in Fig 3, number 1,2,3 (vertically shown) represents three individuals (individual 1, individual 2, and individual 3) with their visitation trajectories. The bar for each individual (represented by T1, T2 to T10) shows the time sequence over 5 years. Different color in each bar and codes (“M”, “N”, “O”) represents a different number of visitation, as shown in Fig 4 as below.
So, for example from Fig 3, it can be inferred that individual 1 would only donate once in the first period and later he didn’t come back at all. It can also be inferred that individual 2 was more frequent in donating blood as compared to two other individuals.
For this type of data (trajectory data) and purposes (analyzing donation behavior), I applied the clustering technique to extract more general information/patterns when individuals donate their blood. However, since the data is somewhat different from what we are familiar with (e.g numeric data & categorical data), thus I applied Hidden Markov Model (HMM) in calculating the distance before proceeding to the clustering step using K-Means.
P.S. For numeric data or categoric data, researchers usually applied Euclidean distance to calculate how different one data to another. Euclidean distance is based on two-dimensional coordinates (x,y) which only accommodates 1 aspect of data (the number/category itself).
Why using HMM? Since HMM could accommodate the time-varying aspect on the data and since the dynamic on how individuals donor over time holds an important aspect besides the data itself, thus make HMM a perfect alternative solution for the analysis purpose and data type. The next question will be, how “the distance” using HMM will be calculated?
Basically, the HM model from each trajectory of individuals is created, then the probability density from the model will be utilized to determine meaningful distance between each HM model. Fig 5 shows the brief workflow of the calculation.
As for the clustering method, I applied the basic method of K-medoids by setting up different k (k = 2,…,n). Then, k that has the most optimum criterion will be chosen. The criterion itself will be based on Silhouette Index (the highest score), The Dunn Index (the highest score), DB Index (the lowest score)
Firstly, let’s see the general pattern of visitation based on several conditions.
Based on Fig 6, it can be seen that generally female and male donors show a relatively similar pattern, in which most of them only donated in the first visitation and didn’t come back (shown by green area). It can also be inferred that apparently only a little group who donates regularly in both genders (shown by blue area).
Fig 7 shows a similar pattern as in Fig 6, where there is a continuous decrease of return rate in each period, indicating the high drop-out rate and the majority consist of people who only donated once.
What if we take a look at the “real” grouping based on HMM clustering?
After the first run using PAM with different k, k=2 was found to be the most optimum one. Figure 8 shows the Multidimensional Scaling for the result.
There are 2 apparent clusters, we know that cluster 1 has more group members rather than cluster 2. We could also imply that members from cluster 1 have an exact behavior (shown by Figure 8 located exactly in one location). On the other hand, cluster 2 is more separately spread, which might be the indication of more varied donation behavior.
Nevertheless, the result from the first run still lacks a more detailed demography individual. Hence, clustering for the second run is required by using only data from cluster 2. On the second run, clusters with k=3 were found to be the maximum result. Fig 9 shows the comparison of different criterion result for the first and second run clustering. Meanwhile, Fig 10 shows the Multidimensional scaling of the second run.
Consequently, we must be wondering what differentiates every 4 clusters above, and what is the composition of them. Therefore, I visualize each cluster using “TraMineR” to investigate the donation pattern for each cluster. It turns out that cluster 1 (1st cluster) consists of people with exactly similar donation patterns, they only donate once and never return afterward. Cluster 2 (2nd cluster) represents people with a “repeat-donors” pattern, in which most of them donate regularly. Cluster 3 (3rd cluster) consists of people with a “drop-out” pattern since they only donate 2 or 3 times within early period and never return. Meanwhile, cluster 4 represents a wider range of donation patterns as compared to individuals in cluster 3, but still less than donation behavior in cluster 2, thus I named it as “lapsed donor”. Figure 11 shows the pattern visualization from each cluster.
Let’s take a look at the demographic characteristic of each cluster shown in Fig 12.
Based on Fig 12, we could infer that throughout all clusters, female dominates the people who donated their blood except cluster 1, where the majority are male donors. This fact tells us that male donors tend to not engage in donations regularly as compared to females. From the characteristics of age, we know that the highest composition comes from individuals around age 20–29.
To sum up, there are four blood donation patterns generally: where people could donate regularly (repeat donor), only donate once (one-time donor), donate several times and never return (drop out donor), and occasionally donate (lapsed donor). Moreover, clustering using HMM is also found out to be effective in exploring and extracting different patterns of blood donation combined with its characteristics as well.
- G. B. Schreiber et al., “First-year donation patterns predict long-term commitment for first-time donors,” Vox Sang., vol. 88, no. 2, pp. 114–121, 2005.
- P. L. H. Yu, K. H. Chung, C. K. Lin, J. S. K. Chan, and C. K. Lee, “Predicting potential drop-out and future commitment for first-time donors based on first 1.5-year donation patterns: The case in Hong Kong Chinese donors,” Vox Sang., vol. 93, no. 1, pp. 57–63, 2007.
- S. Ghassempour, “Clustering Longitudinal Health Data Using Hidden Markov Models by,” University of Western Sydney, 2014.
- S. Ghassempour and F. Girosi, “Clustering Health Trajectories Using Hidden Markov Models,” pp. 1–6, 2013.
- S. Helske, J. Helske, and M. Eerola, “Analysing Complex Life Sequence Data with Hidden Markov Modelling,” in LaCOSA II: Proceedings of the International Conference on Sequence Analysis and Related Methods, 2016, no. June, pp. 209–240.