Using Multivariate Feature Implementation to predict HIV Rates in Third World Countries

Published in

Analytics Vidhya

4 min readSep 29, 2019

For my module 2 assignment I found a data set on world development indicators from the world bank (https://datacatalog.worldbank.org/dataset/world-development-indicators). This data set was massive and had hundreds of indicators for hundreds of countries ranging from 1960 to 2018 and had over 400,000 rows of data. I noticed that many indicators involved HIV rates and decided to analyze the percentage of people infected with HIV from 1980–2018 in third world countries.

After doing some research and finding out which countries had robust data I narrowed down my analysis to the countries of Lesotho, Botswana, Zimbabwe, Zambia, Malawi and Uganda. These countries have rates that are all among the highest in the world.

How the data frame looks after running previous code

My next task was to narrow the data frame down even more to just the countries that I wanted and the HIV percentage of the population.

The first thing I noticed was that there is no data prior to 1990 and in order to conduct a more complete analysis I needed to find a way to estimate those numbers. In the 1980’s HIV rates worldwide were extremely high and I can assume that they would be high for these countries as well.

I decided to plot the data that I had so far just to better understand the data that I was dealing with.

The strategy that I took to fill this missing data came to me from one of the readings posted to canvas (https://scikit-learn.org/stable/modules/impute.html). Multivariate feature imputation models each feature (percentage) with missing values as a function of other features, and uses that estimate for imputation. This was my first experience with machine learning and although I know that I made some mistakes along the way I really enjoyed the process of better understanding how to use machine learning algorithms.

How the iterative imputer module from sklearn works is that it is fitted with an multi-dimensional array ex: [1,2] [3,4] [5, NaN] and then is passed another array with NaN values and makes an estimate as to what that value would be.

The first step is to convert the data from each country into an array structured like this [year, percent][year, percent] and have the algorithm predict the percentage based on the year and previous percentages.

Once I had the arrays it was now time to pass them into the IterativeImputer algorithm.

I then needed to update my data frame with the calculated estimates

Now that I had my updated data it was time to re create the plot with the new data.

Right away it is apparent that the algorithm was not very accurate. I expected this as I am still relatively new to sklearn and machine learning algorithms. But the numbers that the algorithm produced validated the fact that HIV rates in the 1980’s were reasonably higher than today’s standards. It seems that HIV rates in the third world are declining and this can be accredited to increase in global healthcare spending and access to medical facilities.

Using Multivariate Feature Implementation to predict HIV Rates in Third World Countries

Written by Hayden Poore