Using Multivariate Feature Implementation to predict HIV Rates in Third World Countries

Hayden Poore
Analytics Vidhya
Published in
4 min readSep 29, 2019
Photo by Bret Kavanaugh on Unsplash

For my module 2 assignment I found a data set on world development indicators from the world bank (https://datacatalog.worldbank.org/dataset/world-development-indicators). This data set was massive and had hundreds of indicators for hundreds of countries ranging from 1960 to 2018 and had over 400,000 rows of data. I noticed that many indicators involved HIV rates and decided to analyze the percentage of people infected with HIV from 1980–2018 in third world countries.

After doing some research and finding out which countries had robust data I narrowed down my analysis to the countries of Lesotho, Botswana, Zimbabwe, Zambia, Malawi and Uganda. These countries have rates that are all among the highest in the world.

Loading in and looking at data
Cleaning data frame
How the data frame looks after running previous code

My next task was to narrow the data frame down even more to just the countries that I wanted and the HIV percentage of the population.

Creating a rates data frame by country
Rates data frame

The first thing I noticed was that there is no data prior to 1990 and in order to conduct a more complete analysis I needed to find a way to estimate those numbers. In the 1980’s HIV rates worldwide were extremely high and I can assume that they would be high for these countries as well.

I decided to plot the data that I had so far just to better understand the data that I was dealing with.

Code to create graph
Graph

The strategy that I took to fill this missing data came to me from one of the readings posted to canvas (https://scikit-learn.org/stable/modules/impute.html). Multivariate feature imputation models each feature (percentage) with missing values as a function of other features, and uses that estimate for imputation. This was my first experience with machine learning and although I know that I made some mistakes along the way I really enjoyed the process of better understanding how to use machine learning algorithms.

How the iterative imputer module from sklearn works is that it is fitted with an multi-dimensional array ex: [1,2] [3,4] [5, NaN] and then is passed another array with NaN values and makes an estimate as to what that value would be.

The first step is to convert the data from each country into an array structured like this [year, percent][year, percent] and have the algorithm predict the percentage based on the year and previous percentages.

Create arrays for algorithm

Once I had the arrays it was now time to pass them into the IterativeImputer algorithm.

Predicting percentages

I then needed to update my data frame with the calculated estimates

Updating data frame
Head of updated data frame

Now that I had my updated data it was time to re create the plot with the new data.

Updated graph

Right away it is apparent that the algorithm was not very accurate. I expected this as I am still relatively new to sklearn and machine learning algorithms. But the numbers that the algorithm produced validated the fact that HIV rates in the 1980’s were reasonably higher than today’s standards. It seems that HIV rates in the third world are declining and this can be accredited to increase in global healthcare spending and access to medical facilities.

--

--

Hayden Poore
Analytics Vidhya

Information and Data Science Student with a minor in Business Analytics at the University of Colorado Boulder. Currently seeking a Summer 19 Internship.