Curious about Birth Counts?
Using machine learning regression models to predict the number of births per location by period
Overview:
During the last post Population Pop(when it was determined that the world’s fertility is in decline, but that’s not the end of the world), one conclusion reached from looking at the data was that fertility declines and population still rises but the number of births seemed to operate on a different system period to period(Fig.1). A question developed about what sort of things determined the number of births period to period. An answer could be found through machine learning. Using various regression models(algorithms), I was able to create a few models that could predict the number of births period by period with relatively high accuracy. Working backwards from the best working model, I will conclude which information is key in determining the number of births each period.
Data Scientist Translator:
- Dataset/Dataframe: a table of observations belonging to various features
- Model: an algorithm that uses deductions from data to make predictions
- Train/Val/Test split: in machine learning, a way of testing the performance of a model is to apply it to data that wasn’t used to train the model. Holding out data in the form of Validation and Test sets helps inform the data scientist that the model is useful for more than the dataset used to create it
- X matrix/Y vector: X matrix is a dataframe containing the data of the feature(s) to be used on a model. Y vector is a series representing the target feature.
- Shapiro Rank: Shapiro-Wilk assesses the normality of the distribution of observations based on the correlation between the data and the features. Rank utilizes an algorithm that takes into account a single feature at a time (e.g. histogram analysis).
- Function: convenient way to store code so that it is reusable for future applications
- Mean Squared Error: the average squared difference between the estimated values and the actual value, by squaring the values mse gets a better understanding of the differences between values
- Root Mean Squared Error: the average difference between the estimated values and the actual value
- Mean Absolute Error: an average of the error we can expect from the prediction
- R² Score: shows how well the true values are fit to the predicted values
- Adjusted R²: adds penalty to R-squared to reveal impact of independent variables on correlation between true values and predicted values
- Explained Variance Score: takes the differences between the actual values(total variance) and subtracts the differences between the predicted values(residual variance). When the explained variance is equal to R-squared, the mean error is zero.
- Max Error: worst case error between the predicted value and the actual value
- Mean Error: average error from erroneous assumptions in the learning algorithm
- Partial Dependence Plots: also referred to as PD plots, shows the minor effect of a feature(s) on the model’s predictions
Stat Trekking:
Let’s take a look at the data…
Feature Key:
- Total Fertility: live births per woman(or number of successful births per woman)
- NRR : net reproduction rate (in other words the number of surviving daughters per woman)
- Births: number of births(thousands)
- LEx: Life expectancy at birth for both sexes combined (years)
- LEx Male: Male life expectancy at birth (years)
- LEx Female: Female life expectancy at birth (years)
- Infant mortality rate: infant deaths per 1,000 live births
- Under-five mortality: deaths under age five per 1,000 live births
- Crude death rate: deaths per 1,000 population
- Deaths: number of deaths, both sexes combined (thousands)
- Deaths Male: number of male deaths (thousands)
- Deaths Female: number of female deaths (thousands)
- NMR: net migration rate per 1,000 population
- Net Migrants: net number of migrants(thousands)
- Sex ratio: male births per female births)
- MAC: Female mean age of childbearing (12–51 years of age)
- Pop total: total population(thousands)
- Pop density: population by square kilometre(thousands)
- Fertility Declined: indicates if total fertility is in declined(1 = yes)
Data:
Statistics: the first frontier. Data science is made up of voyages of statistical enterprisers. Its continuing mission: to explore strange and unique data, to seek out new facts to aid civilization…to boldly know what no one has known before. A major difference between trekking the stars and trekking stats is that if someone out for discovery in the stat trekking world veers too far from the Prime Directive(for the sake of non-Trekkies, we’ll define this as a set of rules to prevent malfeasance and consequence) the results can be catastrophic or flawed (e.g. correlation isn’t proof of causation, it could be proof of data leakage).
As every journey needs a starting point, and the journeying with data science is no different, I began my quest after forming my question with using the remaining dataset from the Population Pop post. After realizing the dataset was a bit from perfect I transformed it to suit my purpose, first creating a column which indicated if fertility was in decline, and then creating a subset of observations that weren’t missing data. From there I wrangled the data with a function, cleaning up column names, removing features that may prove problematic like ‘Location’(removed due to high volume) and ‘Crude Birth Rate’ (removed due to data leakage) so that the X matrix for all datasets resembled that as seen in Fig.2.
Baseline:
An important step in machine learning is developing a baseline. A baseline is usually the simplest model you can create to make predictions. It is important to have a baseline because it helps, when creating a model, to have data to compare it to. More often than not, this involves using the mean of the Y vector to make predictions in regression models, or using the mode class to make predictions in classification models. I opted to use a simple Linear Regression model using one feature,Total Population. The Linear Regression model seemed a better starting point than using the mean since the dataset consisted of locations with relatively low birth counts like Oceania(Fig.1) and those with high birth counts like the World in its entirety. Linear Regression doesn’t require normal distribution, and is unlikely to be affected by the skewed data(Fig.3).
Figure 4 shows the visual projection of the model, while Figure 5 gives us a sample of both the true target values and the baseline predicted values. The naked eye implies that the baseline model doesn’t perform very well. However, relying on the naked eye to form conclusion is a less than reliable method. To confirm this, the next step in our data journey is knowing what diagnostic processes (or metrics) we will apply to the models to check performance. If you’ve peeked at the key terms, you already have some idea as to what metrics I chose to use. However, if you’re still lost about what those terms mean, I will explain and show you how I chose to use them.
Considering the definitions of the metrics, however well understood, the regression metrics measures improvement in models
I feel it important to mention that of the metrics available for regression models(e.g. R-squared score, mean squared error, max error, etc.) it is not typical behavior to use multiple metrics(at once), as there is a potential for overkill(a.k.a counter-productivity). I chose to use multiple metrics to provide a level of certainty for the model’s performance by seeing how it performs with multiple metrics. I created a function that would report the results of the various chosen metrics as such:
Interpreting Regression Metrics:
Looking at the results of the regression metrics the first thing that stands out is the baseline having an R² score of 0.84. The R² score is designed in a way that it will increase with the number of predictors(features) regardless of how the new predictors impact the model. As the baseline model only contains 1 predictor, and the number of features available to be added to a model is 18, merely assuming, however arbitrarily, that adding a predictor naturally increases the R² by 0.001 using a model with all the features is automatically an improvement on the baseline.
Sounds like easy work? It shouldn’t. Remembering the purpose of the baseline, as a comparative for improvement, one is left with the belief that there is little room for improvement. When considering that the data “prime directive” suggests that the R² of a baseline is 0. Regardless of the regression method I use, the highest amount of improvement I can make on the baseline is 15.99%. This is where the adjusted R² comes into play. Unlike the R² score, the adjusted R² is capable of decreasing when a predictor is less than satisfactory, and will. The mission with these metrics is finding a model that improves in both R² and adjusted R², with features that positively impact the model.
The next thing that stands out in the regression metrics(Fig.6) is the difference in explained variance and R². This indicates that the mean error is not zero. Paired with the explained variance being greater than R², I believe the simple model assumes a little too much. This is confirmed when weighing in the results of the max error, mean squared error, root mean squared error, and mean absolute error. However, looking at those results I’d say it speaks again to how little room there is for improvement. Mission Accepted.
Operation: Linear Lasso
Lasso regression is a model that uses feature selection and regularizes data to minimize the weight of features to limit over-fitting the training data to make the model more comprehensible and precise.Stat date cell 32: I admit to being a little suspicious of my data, which is why I decided a good linear model to test against the baseline would be a Lasso regression model. Imagine my disappointment when my results were:
Using the same function to produce regression metric scores, I developed another to produce a visual comparison table(Fig. 7). The short version…first attempt to achieve mission was unsuccessful, mistakes were made. Longer end of it, the only improvement of the model is that the max error decreased, but not significantly enough to be of any substance. It wasn’t hard to know that I was going to need a better model, so after gathering some tuning advice from a GridSearchCV, I adjusted a few hyper-parameters, added a few more features(the first attempt was as simple as the baseline in only having population total as its feature) .
With confidence and excitement ablaze I fit my new model to the data(Fig.8) with expectation of great improvement to reveal…
the predictions were:
Okay, so maybe “great” improvement was asking a bit much, but for the sake of rationalization….First, the max error increased, which was less than ideal. Second, the mean error has not descended to zero, though it did improve a bit. Now, let’s go back to that 15.99% possible best improvement in R². If that’s my scale for improvement, the new model improves on the old one by about 38%. Looking at the various mean scores, as each one has decreased, there is another mark for the improvement of the new Lasso model.
Unfortunately, these achievements does not complete the mission.
Operation: Dancing leaves
I will admit to being a bit biased towards tree-based models. Tree-based models work by forming a set of conditions to run the data through and making predictions from the results. The best thing is that tree-based models are flexible and can be used for regression and classification. As my problem is dealing with regression I opted to use light gradient boosting DART regression(a sort of super tree model) to attempt to achieve my mission.
The results in Fig. 13 and Fig. 14 are precisely why I’m biased towards decision trees, albeit with a little boost. Also, the model is pretty consistent:
Checking the R² across multiple versions of the data in a technique known as cross validation speaks further to the consistency of the model, of course it also shows the effect of the model’s bias as it doesn’t always perform close to the mean(Fig.15)
The results for the most part convinced me because I trust the results of the regression metrics I used. The mission certainly seemed like a success, but there were still some steps to be taken. Everyone knows the best part of a mission…debriefing, and sometimes that includes visualizations! But first, let’s give the naked eye a chance on that light gradient boosting model.
First glance you can definitely see the errors, however they’re not too bad when compared to the results of earlier models:
Visualizing the predictions with the actual values of the test data you can see how the predictions closed in on them.
Conclusion:
Fig. 20 shows you population’s role in the model. Fig. 19 shows that population plays a pretty important role in the light gradient boosting model.
To the question of what sort of things(features) determine the number of births period to period the true answer to that question is very broad. From the data obtained from the UN the most important feature, unsurprisingly is the population total(Fig.19). Funnily enough, the way those features impact the data by importance(a.k.a. permutation importances) are very different:
Fig. 21 shows that the most influential feature is the number of deaths of males in the period. Which initially struck me as odd, but when considered that there is a trend with births spiking near times of war(e.g. Baby Boom) which would influence the number of deaths of males as well. The feature importances and Permutation importances agree that features like knowing if the fertility rate is in decline, and the percentages of males to female were not beneficial to the model.
Mission Accomplished
We set out on a journey to find a model that could predict births within a reasonable margin of error. The light gradient boosting model delivers those results. Of course, now there’s another question. You’ve got all this information, and a great model…what can be done with it? What can you do with a model that predicts birth counts?
If one was collecting world information data and struggled just so to get all the information you needed. In this scenario, someone is having trouble getting numbers birth counts and needed a rough estimate to complete their data. The model could be used to fill that missing data. And there are more ways to apply the model.
Building a bridge from Data Science to Fictional Writing:
Imagine being a writer with the goal of realistic world-building. There aren’t many resources to aid with this goal and it can be pretty exhaustive doing all the research and math required, dependent on what kind of world you’re building. A predictive model such as that built in this post could certainly be used to aid the process. For example:
This map shows us the thriving kingdom of Galtrea, some data has been gathered but birth counts was never taken into account. However the number of births for a particular year in the story is something that is needed. Let’s say this is the data that’s been collected:
Using an LGB model similar to the winning model above we applied the information about Galtrea’s year 1167 to the model…
Voila! In the year 1167, approximately 253,464 babies were born. One of which was the main character of the story Galtrea belongs to. This gives me, the writer, new information to utilize while creating the reality of the fictional world.
There are many things to learn about the machine learning process, the first being that there’s no such thing as perfect, but there is definitely a possibility of coming close, you just have to know how to learn, shape, and work with the information you have. Who knows how far we can go with it? Well, I suppose if we get the right data and apply the right model we could know.
(Edited: visualizations and predictions were updated to match current data. Someone forgot to apply a random state to their train_test splits. Note: always remember to do that…it’s helpful)