CODEX

Logistic Regression with Spark

Vanessa Fotso
CodeX
Published in
6 min readMar 14, 2021

--

As I am diving into Spark, in this post, I will be analyzing the Low Birth Weight dataset.

The csv file containing the dataset analyzed here can be found in my repo : https://github.com/vanessuniq/Logistic-Regression-With-Spark

The following is the code for my analysis:

Logistic Regression on Low-Birth-Weight Data

Objective

Low birth weight is a significant worry among infants since it makes their bodies be more fragile, making it harder for them to eat, inhale, grow, keep up their internal heat level, or battle infections. The objective of this analysis was to assess components which sway the low birth weight in new-born babies and raise the awareness among the pregnant mothers. Low birth weight can be evaluated over the maternal, nourishing, and financial components, taking significant properties on Low birth weight as it can prompt many disorders observed in infants. A logistic regression model was used in this analysis to recognize the compelling factors in anticipating Low birth weight using a medical record of eligible mothers.

Dataset Description

The low-birth-weight dataset is composed of 189 observations representing child births at the Baystate Medical Centre, Springfield, Massachusetts during 1986 (Hosmer, Lemeshow, & Sturdivant, 2013). The dataset has 9 variables, with the variable of interest being the birth weight of newborn children, a binary variable describing the risk of children born with LBW (0 represents a normal birth weight of at least 2.5kg, and 1 indicates a low birth weight under 2.5kg). The predictor variables include the mothers’ age (numeric), the race (categorical with 3 levels), the smoking status during pregnancy (categorical with 2 levels), history of premature labor (categorical), history of hypertension (categorical with 2 levels), the presence of uterine irritability in moms (categorical with 2 possible outcomes), and the history of physician visits during the first trimester. The ID variable serves as a unique identifier and is irrelevant to the study.

Data Exploration

The data exploration reveals that the dataset is heavily imbalanced, with about 31% of observations being infants born with low weight and 69% with normal weight.

+ — -+ — — -+

|LOW|count|

+ — -+ — — -+

| 1| 59|

| 0| 130|

+ — -+ — — -+

Table1: Low Birth Weight Frequency Table

Additionally, while analyzing the mothers’ race against the occurrence of low birth weight, it was found that 42% of black women had children born with low birth weight, while the observation was 24% for white and 37% for other races. This lead to think that black women are more at risk of having children with low birth weight.

Figure 1: Low Birth Weight by Mothers’ Race

Similarly, women who smoke during pregnancy tend to have higher risk compared to those who do not. The data shows 40% of women who smoked gave birth to children with low birth weight, while the percentage was considerably lower in non-smoking women (25%).

Figure 2: Low Birth Mothers’ Weight by Smoking Status

Last, the age variable was found to be well distributed across the dataset, with the average age being about the same in both class (LOW and normal weight); however, the case of low birth weight was mostly observed in women between 20 and 25 years old.

Figure 3: Age Distribution for Women having children with LOW

Data Modeling

The dataset was not preprocessed prior building the model. Given the small size of the dataset, it was partitioned in 60% (108 cases) training and 40% (81 cases) test set. This was done so that the model would have enough blind data to test its efficiency. Additionally, a random seed was applied to ensure that the analysis is reproducible. Finally, the model was built with a max iteration of 50 and the regParam parameter set to 0.001to optimize the results.

Results

The model can properly classify children with low birth with 76.1% on the training set and 67.9% on the test set. The accuracy has greatly decreased (by about 8%) between the two sets. This result suggest that the model has been overfitted. This may be since there is not enough data to train and test the model and the disproportion between the low-birth-weight cases and the normal weight cases. The precision as well has drastically dropped from 69.2% on the training set to 44.4% on the test set, and the recall going from 50% to 33.3%. This shows that the derived model poorly performed at identifying actual cases of low birth weight (< 2.5kg) in newborns. Likewise, the F score sharply decreased from 58% in the training set to 38% on the test set, as it is dependent on the recall and precision values. Furthermore, the model showed an area of 0.4 under the PR curve, which is due to weak recall and precisions values observed, and an area of 0.58 under the ROC curve, which is not far from the minimum value of 0.5, but distant from the maximum value of 1.0. This again confirmed the model failing to identify relevant cases in the data.

Figure 4: ROC Curve of Low Birth Dataset

Conclusion

A logistic regression model was built to predict the risk factors linked to low childbirth weight. Mother’s race and smoking status were found to be strong predictors of the prevalence of low-birth-weight infants. However, the model built fell short with an accuracy of 67.9% and a F score of 38% on the test set. The results demonstrates that the model is not only unable to classify unknown data, but it can barely recognize true instances of low birth weight in infants. Logistic regression fell short here due to the dataset being considerably small and having a heavily unbalanced target variable (130 normal cases vs 39 LOW cases). Due to the small size of the dataset, increasing the number of cases in the test set might not solve the problem of overfitting. One can attempt to scale the target variable to balance its distribution.

Preprocessing the dataset before running the model may improve the model’s accuracy. Logistic regression is affected by missing values and non-normal distribution. Plus, the input variables must be meaningful to the target variable to produce acceptable results. One way to evaluate the relationship of all variables to the target variable could be to run a correlation matrix. Cleaning the dataset and using only significant inputs could greatly improve the accuracy of this model. another limitation can be the broadness of some categorical variables (like simply used yes or no for smoking variable is not enough granular). Several other machine learning algorithms including random forest, decision tree, support vector machine or Naïve Bayes could be a good candidate to analyze this dataset. Decision trees can be suitable for this analysis as it can better handle small datasets by splitting data using boundaries parallel to the input axes in a recursive fashion until convergence is reached. Additionally, decision trees can handle unprocessed dataset by assigning weight at each node. Alternatively, Naïve Bayes can be used here as it requires less training time, which is appropriate for a small test set with relatively small input variables. The Naïve Bayes algorithm converges quicker than other discriminative algorithm like logistic regression. Given the simplicity of the Naïve Bayes algorithm, the problem of overfitting can be avoided here.

References

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). New York, NY: Wiley.

--

--

Vanessa Fotso
CodeX

Health IT Software Engineer with broad technical exposure and passion for learning.