Curious about Birth Counts?

Meghan Felker
13 min readAug 28, 2020

--

Using machine learning regression models to predict the number of births per location by period

Image by Meghan Felker

Overview:

During the last post Population Pop(when it was determined that the world’s fertility is in decline, but that’s not the end of the world), one conclusion reached from looking at the data was that fertility declines and population still rises but the number of births seemed to operate on a different system period to period(Fig.1). A question developed about what sort of things determined the number of births period to period. An answer could be found through machine learning. Using various regression models(algorithms), I was able to create a few models that could predict the number of births period by period with relatively high accuracy. Working backwards from the best working model, I will conclude which information is key in determining the number of births each period.

Fig. 1 line graph portraying births from the fifties to the double twenties

Data Scientist Translator:

  • Dataset/Dataframe: a table of observations belonging to various features
  • Model: an algorithm that uses deductions from data to make predictions
  • Train/Val/Test split: in machine learning, a way of testing the performance of a model is to apply it to data that wasn’t used to train the model. Holding out data in the form of Validation and Test sets helps inform the data scientist that the model is useful for more than the dataset used to create it
  • X matrix/Y vector: X matrix is a dataframe containing the data of the feature(s) to be used on a model. Y vector is a series representing the target feature.
  • Shapiro Rank: Shapiro-Wilk assesses the normality of the distribution of observations based on the correlation between the data and the features. Rank utilizes an algorithm that takes into account a single feature at a time (e.g. histogram analysis).
  • Function: convenient way to store code so that it is reusable for future applications
  • Mean Squared Error: the average squared difference between the estimated values and the actual value, by squaring the values mse gets a better understanding of the differences between values
  • Root Mean Squared Error: the average difference between the estimated values and the actual value
  • Mean Absolute Error: an average of the error we can expect from the prediction
  • R² Score: shows how well the true values are fit to the predicted values
  • Adjusted R²: adds penalty to R-squared to reveal impact of independent variables on correlation between true values and predicted values
  • Explained Variance Score: takes the differences between the actual values(total variance) and subtracts the differences between the predicted values(residual variance). When the explained variance is equal to R-squared, the mean error is zero.
  • Max Error: worst case error between the predicted value and the actual value
  • Mean Error: average error from erroneous assumptions in the learning algorithm
  • Partial Dependence Plots: also referred to as PD plots, shows the minor effect of a feature(s) on the model’s predictions

Stat Trekking:

Let’s take a look at the data…

Fig.2 sample of the first five rows, and last five rows of the data used

Feature Key:

  • Total Fertility: live births per woman(or number of successful births per woman)
  • NRR : net reproduction rate (in other words the number of surviving daughters per woman)
  • Births: number of births(thousands)
  • LEx: Life expectancy at birth for both sexes combined (years)
  • LEx Male: Male life expectancy at birth (years)
  • LEx Female: Female life expectancy at birth (years)
  • Infant mortality rate: infant deaths per 1,000 live births
  • Under-five mortality: deaths under age five per 1,000 live births
Fig. 3 Shapiro Ranking of the twenty features remaining after wrangling data
  • Crude death rate: deaths per 1,000 population
  • Deaths: number of deaths, both sexes combined (thousands)
  • Deaths Male: number of male deaths (thousands)
  • Deaths Female: number of female deaths (thousands)
  • NMR: net migration rate per 1,000 population
  • Net Migrants: net number of migrants(thousands)
  • Sex ratio: male births per female births)
  • MAC: Female mean age of childbearing (12–51 years of age)
  • Pop total: total population(thousands)
  • Pop density: population by square kilometre(thousands)
  • Fertility Declined: indicates if total fertility is in declined(1 = yes)

Data:

Painted image of the U.S.S. Enterprise from Star Trek: The Original Series

Statistics: the first frontier. Data science is made up of voyages of statistical enterprisers. Its continuing mission: to explore strange and unique data, to seek out new facts to aid civilization…to boldly know what no one has known before. A major difference between trekking the stars and trekking stats is that if someone out for discovery in the stat trekking world veers too far from the Prime Directive(for the sake of non-Trekkies, we’ll define this as a set of rules to prevent malfeasance and consequence) the results can be catastrophic or flawed (e.g. correlation isn’t proof of causation, it could be proof of data leakage).

As every journey needs a starting point, and the journeying with data science is no different, I began my quest after forming my question with using the remaining dataset from the Population Pop post. After realizing the dataset was a bit from perfect I transformed it to suit my purpose, first creating a column which indicated if fertility was in decline, and then creating a subset of observations that weren’t missing data. From there I wrangled the data with a function, cleaning up column names, removing features that may prove problematic like ‘Location’(removed due to high volume) and ‘Crude Birth Rate’ (removed due to data leakage) so that the X matrix for all datasets resembled that as seen in Fig.2.

Baseline:

Fig. 4 Baseline Linear Regression Plot

An important step in machine learning is developing a baseline. A baseline is usually the simplest model you can create to make predictions. It is important to have a baseline because it helps, when creating a model, to have data to compare it to. More often than not, this involves using the mean of the Y vector to make predictions in regression models, or using the mode class to make predictions in classification models. I opted to use a simple Linear Regression model using one feature,Total Population. The Linear Regression model seemed a better starting point than using the mean since the dataset consisted of locations with relatively low birth counts like Oceania(Fig.1) and those with high birth counts like the World in its entirety. Linear Regression doesn’t require normal distribution, and is unlikely to be affected by the skewed data(Fig.3).

Fig.5 10 Samples of true values and baseline predictions

Figure 4 shows the visual projection of the model, while Figure 5 gives us a sample of both the true target values and the baseline predicted values. The naked eye implies that the baseline model doesn’t perform very well. However, relying on the naked eye to form conclusion is a less than reliable method. To confirm this, the next step in our data journey is knowing what diagnostic processes (or metrics) we will apply to the models to check performance. If you’ve peeked at the key terms, you already have some idea as to what metrics I chose to use. However, if you’re still lost about what those terms mean, I will explain and show you how I chose to use them.

Considering the definitions of the metrics, however well understood, the regression metrics measures improvement in models

I feel it important to mention that of the metrics available for regression models(e.g. R-squared score, mean squared error, max error, etc.) it is not typical behavior to use multiple metrics(at once), as there is a potential for overkill(a.k.a counter-productivity). I chose to use multiple metrics to provide a level of certainty for the model’s performance by seeing how it performs with multiple metrics. I created a function that would report the results of the various chosen metrics as such:

Fig. 6 Photo of Regression Metrics function results annotated to exhibit how the results are interpreted

Interpreting Regression Metrics:

Looking at the results of the regression metrics the first thing that stands out is the baseline having an R² score of 0.84. The R² score is designed in a way that it will increase with the number of predictors(features) regardless of how the new predictors impact the model. As the baseline model only contains 1 predictor, and the number of features available to be added to a model is 18, merely assuming, however arbitrarily, that adding a predictor naturally increases the R² by 0.001 using a model with all the features is automatically an improvement on the baseline.

Sounds like easy work? It shouldn’t. Remembering the purpose of the baseline, as a comparative for improvement, one is left with the belief that there is little room for improvement. When considering that the data “prime directive” suggests that the R² of a baseline is 0. Regardless of the regression method I use, the highest amount of improvement I can make on the baseline is 15.99%. This is where the adjusted R² comes into play. Unlike the R² score, the adjusted R² is capable of decreasing when a predictor is less than satisfactory, and will. The mission with these metrics is finding a model that improves in both R² and adjusted R², with features that positively impact the model.

The next thing that stands out in the regression metrics(Fig.6) is the difference in explained variance and R². This indicates that the mean error is not zero. Paired with the explained variance being greater than R², I believe the simple model assumes a little too much. This is confirmed when weighing in the results of the max error, mean squared error, root mean squared error, and mean absolute error. However, looking at those results I’d say it speaks again to how little room there is for improvement. Mission Accepted.

Operation: Linear Lasso

Lasso regression is a model that uses feature selection and regularizes data to minimize the weight of features to limit over-fitting the training data to make the model more comprehensible and precise.Stat date cell 32: I admit to being a little suspicious of my data, which is why I decided a good linear model to test against the baseline would be a Lasso regression model. Imagine my disappointment when my results were:

Fig. 7 Regression Metrics Comparison results for two regression models (A is the baseline, B is the Lasso model) , green rows highlights increase, red highlights decrease

Using the same function to produce regression metric scores, I developed another to produce a visual comparison table(Fig. 7). The short version…first attempt to achieve mission was unsuccessful, mistakes were made. Longer end of it, the only improvement of the model is that the max error decreased, but not significantly enough to be of any substance. It wasn’t hard to know that I was going to need a better model, so after gathering some tuning advice from a GridSearchCV, I adjusted a few hyper-parameters, added a few more features(the first attempt was as simple as the baseline in only having population total as its feature) .

Fig. 8 photo of x train data sample for hyper-tuned Lasso Regression model

With confidence and excitement ablaze I fit my new model to the data(Fig.8) with expectation of great improvement to reveal…

Fig. 9 Regression Comparison Results for the Validation data from the Baseline(A) and new Lasso(B) model

the predictions were:

Fig. 10 sample of predictions from new lasso model

Okay, so maybe “great” improvement was asking a bit much, but for the sake of rationalization….First, the max error increased, which was less than ideal. Second, the mean error has not descended to zero, though it did improve a bit. Now, let’s go back to that 15.99% possible best improvement in R². If that’s my scale for improvement, the new model improves on the old one by about 38%. Looking at the various mean scores, as each one has decreased, there is another mark for the improvement of the new Lasso model.

Fig. 11 Scatter plot displaying true values and predicted values, their difference

Unfortunately, these achievements does not complete the mission.

Operation: Dancing leaves

Fig. 12 Example of how a decision works from the light gradient boosting model

I will admit to being a bit biased towards tree-based models. Tree-based models work by forming a set of conditions to run the data through and making predictions from the results. The best thing is that tree-based models are flexible and can be used for regression and classification. As my problem is dealing with regression I opted to use light gradient boosting DART regression(a sort of super tree model) to attempt to achieve my mission.

Fig. 13 Regression Metrics for Validation data comparing baseline model to the hyper tune light gradient boosting model

The results in Fig. 13 and Fig. 14 are precisely why I’m biased towards decision trees, albeit with a little boost. Also, the model is pretty consistent:

Fig. 14 Regression Metrics for Test data comparing baseline to LGBM

Checking the R² across multiple versions of the data in a technique known as cross validation speaks further to the consistency of the model, of course it also shows the effect of the model’s bias as it doesn’t always perform close to the mean(Fig.15)

Fig. 15 Cross Validation Scores for the Validation and Test data on the LGB model

The results for the most part convinced me because I trust the results of the regression metrics I used. The mission certainly seemed like a success, but there were still some steps to be taken. Everyone knows the best part of a mission…debriefing, and sometimes that includes visualizations! But first, let’s give the naked eye a chance on that light gradient boosting model.

Fig. 16 Sample of the Predictions from the LGB model of the test data

First glance you can definitely see the errors, however they’re not too bad when compared to the results of earlier models:

Fig. 17 Table of samples from the actual birth counts compared to the predictions from the various models

Visualizing the predictions with the actual values of the test data you can see how the predictions closed in on them.

Fig. 18 Actual birth counts against LGBM predictions for test data

Conclusion:

Fig. 19 LGBM Feature Importances(test)
Fig. 20 PD plot showing the working relationship between the population total and number of births(test)

Fig. 20 shows you population’s role in the model. Fig. 19 shows that population plays a pretty important role in the light gradient boosting model.

To the question of what sort of things(features) determine the number of births period to period the true answer to that question is very broad. From the data obtained from the UN the most important feature, unsurprisingly is the population total(Fig.19). Funnily enough, the way those features impact the data by importance(a.k.a. permutation importances) are very different:

Fig. 21 Permutation Importances for LGBM Regression

Fig. 21 shows that the most influential feature is the number of deaths of males in the period. Which initially struck me as odd, but when considered that there is a trend with births spiking near times of war(e.g. Baby Boom) which would influence the number of deaths of males as well. The feature importances and Permutation importances agree that features like knowing if the fertility rate is in decline, and the percentages of males to female were not beneficial to the model.

Mission Accomplished

We set out on a journey to find a model that could predict births within a reasonable margin of error. The light gradient boosting model delivers those results. Of course, now there’s another question. You’ve got all this information, and a great model…what can be done with it? What can you do with a model that predicts birth counts?

If one was collecting world information data and struggled just so to get all the information you needed. In this scenario, someone is having trouble getting numbers birth counts and needed a rough estimate to complete their data. The model could be used to fill that missing data. And there are more ways to apply the model.

Building a bridge from Data Science to Fictional Writing:

Imagine being a writer with the goal of realistic world-building. There aren’t many resources to aid with this goal and it can be pretty exhaustive doing all the research and math required, dependent on what kind of world you’re building. A predictive model such as that built in this post could certainly be used to aid the process. For example:

Map of fictional Earth-like world pointing out the Kingdom of Galtrea

This map shows us the thriving kingdom of Galtrea, some data has been gathered but birth counts was never taken into account. However the number of births for a particular year in the story is something that is needed. Let’s say this is the data that’s been collected:

Galtrean data for year 1167

Using an LGB model similar to the winning model above we applied the information about Galtrea’s year 1167 to the model…

updated observation with newly predicted births column

Voila! In the year 1167, approximately 253,464 babies were born. One of which was the main character of the story Galtrea belongs to. This gives me, the writer, new information to utilize while creating the reality of the fictional world.

There are many things to learn about the machine learning process, the first being that there’s no such thing as perfect, but there is definitely a possibility of coming close, you just have to know how to learn, shape, and work with the information you have. Who knows how far we can go with it? Well, I suppose if we get the right data and apply the right model we could know.

(Edited: visualizations and predictions were updated to match current data. Someone forgot to apply a random state to their train_test splits. Note: always remember to do that…it’s helpful)

Notebook

Portfolio

--

--