Find relation between food Sub groups and ckd mortality: CKD: Chronic Kidney Disease

The code shows the relation between food sub groups and CKD mortality Food Sub Group concept is used as are used by CDC/USDA/USRDS

Dietray Guidelines are reviewed every couple of years and published https://health.gov/our-work/food-nutrition/previous-dietary-guidelines/2015

A background database was there where some data pre-processing and adjustments were made

Food Groups/Subgroups Survey Data and Mortality Data were merged together to create the dataset as used in this project Mortality Data is provided by USRDS

Food intake surveys are conducted by USDA and also published periodically https://www.nal.usda.gov/fnic/food-and-nutrition-surveys

The code will use PCA to find the important food groups affecting mortality

Will use heatmap to show the correlations between mortality and food groups

The code also shows univariate and bi-variate analysis of the food group data

The final outcome is correlation of Mortality and Food Groups

# PCA section, Correlation, Heatmaps, and then Bivariate plots can be seen to be important
# Might have to adjust/improve my conclusions from the plots in future work
# Target variable ESRD patients: Avg. Annual Mortality rates is given the most importance
# One another target variable is: ESRD patients: Total (or %) deaths for target year
# Other target variables: Dialysis patients: Total (or %) deaths for target year
# For plotting
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Is not using ratio with recommended amounts. Ratios as used with Food Groups did not show any difference with the actual amount taken. However, that might change with food subgroups.

# data exploration
import pandas as pd
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head(30)
png
df.describe()
png
df_actual_only = df.drop(['Age-group: From USRDS', 'Age-group: To USRDS', 'Gender'], axis=1)
df_actual_only.head()
png
####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_actual_only)
standardisedX = pd.DataFrame(standardisedX, index=df_actual_only.index, columns=df_actual_only.columns)
standardisedX.apply(np.mean), '---\n', standardisedX.apply(np.std)
(Actual Dark-green vegetables Intake -1.727014e-16
Actual Red and orange vegetables Intake -3.083953e-16
Actual Starchy vegetables Intake 5.242720e-17
Actual Other vegetables Intake 2.220446e-16
Actual Whole grains intakes 1.258253e-15
Actual Taken Refined grains amount -1.619075e-15
Avg Meat, Poultry and Eggs subgroup taken 3.577385e-16
Avg Seafood taken 2.158767e-16
Avg Nuts, Seeds, and Soy Products taken 6.167906e-16
Avg Added Sugars/Sugars and sweets taken 2.652199e-16
Avg Oils taken -2.713879e-16
Avg Solid Fats taken -1.233581e-16
Avg Milks and milk drinks taken 2.405483e-16
Avg Water, noncarbonated intake 0.000000e+00
Avg Alcoholic beverages intake -2.713879e-16
Avg Nonalcoholic beverages taken -3.824102e-16
Avg Milk desserts, sauces, gravies taken 7.401487e-16
ESRD patients: Total (or %) deaths for target year 9.868649e-17
ESRD patients: Avg. Annual Mortality rates -9.868649e-17
Dialysis patients: Total (or %) deaths for target year 4.934325e-17
Dialysis patients: Avg. Annual Mortality rates 7.401487e-17
dtype: float64,
'---\n',
Actual Dark-green vegetables Intake 1.0
Actual Red and orange vegetables Intake 1.0
Actual Starchy vegetables Intake 1.0
Actual Other vegetables Intake 1.0
Actual Whole grains intakes 1.0
Actual Taken Refined grains amount 1.0
Avg Meat, Poultry and Eggs subgroup taken 1.0
Avg Seafood taken 1.0
Avg Nuts, Seeds, and Soy Products taken 1.0
Avg Added Sugars/Sugars and sweets taken 1.0
Avg Oils taken 1.0
Avg Solid Fats taken 1.0
Avg Milks and milk drinks taken 1.0
Avg Water, noncarbonated intake 1.0
Avg Alcoholic beverages intake 1.0
Avg Nonalcoholic beverages taken 1.0
Avg Milk desserts, sauces, gravies taken 1.0
ESRD patients: Total (or %) deaths for target year 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
Dialysis patients: Total (or %) deaths for target year 1.0
Dialysis patients: Avg. Annual Mortality rates 1.0
dtype: float64)
y = abs(standardisedX['ESRD patients: Avg. Annual Mortality rates']) > 0.5
standardisedX_with_target = standardisedX
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates', 'ESRD patients: Total (or %) deaths for target year',
'Dialysis patients: Total (or %) deaths for target year', 'Dialysis patients: Avg. Annual Mortality rates'], axis=1)
from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pca
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary
####----
summary = pca_summary(pca, standardisedX)
####----
Importance of components:
png
# First five component can define over 90%# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
np.sum( summary.sdev ** 2 )
plt.rcParams['figure.figsize'] = 8, 8

Important Components

def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_variance' + '.png')
plt.show()

screeplot(pca, standardisedX)
#plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_variance' + '.png')
png

comp 3 to 4 has the most slope change — comp 7 starts the flat line

first three are the most important

Will retain first five though first three will be analyzed primarily

#summary.sdev**2
#pca.components_[0]
#np.sum(pca.components_[0]**2)
####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function with some adjustments
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc
####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
#highest loadings for
0.9999999999999989# Define high and low mortality
# The code below is not used.
# y as defined earlier will rather be used

#import sklearn
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_diff_ratio_mortality)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_diff_ratio_mortality.index, columns=df_diff_ratio_mortality.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality
####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
#plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Class separations True = High Mortality')
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_separating_high_low_mortality' + '.png')
standardisedX
png
png
# first two components really cannot separate high and low mortality very effectively

Plot first two components

Next section (i.e. four components) is more important than this

####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21


pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show() #
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
png

First Five components

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 5

pca_components_cont = pca.components_[0:components_to_count]
ylabels = []

for c in range (components_to_count):
ylabels.append('Component ' + str(c) )

xlabels = []
for c in range (components_to_count):
xlabels.append(c)

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()
plt.xlabel('\nContributors to the first five components')
plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#
png
standardisedX.columnsIndex(['Actual Dark-green vegetables Intake',
'Actual Red and orange vegetables Intake',
'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
'Avg Meat, Poultry and Eggs subgroup taken', 'Avg Seafood taken',
'Avg Nuts, Seeds, and Soy Products taken',
'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
'Avg Solid Fats taken', 'Avg Milks and milk drinks taken',
'Avg Water, noncarbonated intake', 'Avg Alcoholic beverages intake',
'Avg Nonalcoholic beverages taken',
'Avg Milk desserts, sauces, gravies taken'],
dtype='object')

First five components explain the data by 90%

Contributors for the First Component: Positively: Avg Milks and milk drinks taken -> (0.4)

Negatively (-0.3 to -0.4) contributing: Actual Dark-green vegetables, Actual Starchy vegetables, Actual Whole grains, Avg Meat Poultry and Eggs subgroup, Avg Oils taken, Avg Water noncarbonated intake, Avg Nonalcoholic beverages taken

Contributors for 2nd Component: Positively: Actual Red and orange vegetables Intake, Actual Other vegetables Negatively: ‘Avg Nuts, Seeds, and Soy Products’, ‘Avg Added Sugars/Sugars and sweets’, ‘Avg Alcoholic beverages’

Contributors for 3rd component: Positively: ‘Actual Taken Refined grains amount’, ‘Avg Milk desserts, sauces, gravies taken’ Negatively: Avg Water noncarbonated intake’

Features that do not contribute much: ‘Avg Seafood taken’, ‘Avg Solid Fats taken’,

Considering all Five Components:

Positively the most contributing:
‘Actual Taken Refined grains amount’, ‘Avg Milk desserts, sauces, gravies taken’ ‘Avg Nuts, Seeds, and Soy Products taken’, ‘Actual Red and orange vegetables Intake’

Most Negatively:
‘Actual Other vegetables Intake’, ‘Avg Added Sugars/Sugars and sweets taken’, Avg Alcoholic beverages intake

Can do regression with all food sub groups except ‘Avg Seafood taken’, ‘Avg Solid Fats taken’. Though will do a comparison with them as well

standardisedX_important = standardisedX_with_target.drop(['Avg Seafood taken', 'Avg Solid Fats taken'], axis=1)standardisedX_important.head()
png
standardisedX_important.corr()
png
plt.figure(figsize=(12, 12))


corr = standardisedX_important.corr() #df_diff_ratio_with_mortality.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True,
cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
plt.savefig('heatmap-ckd-mortality-subgroup-regression-after-pca.png')
standardisedX_important.columns
Index(['Actual Dark-green vegetables Intake',
'Actual Red and orange vegetables Intake',
'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
'Avg Meat, Poultry and Eggs subgroup taken',
'Avg Nuts, Seeds, and Soy Products taken',
'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
'Avg Milks and milk drinks taken', 'Avg Water, noncarbonated intake',
'Avg Alcoholic beverages intake', 'Avg Nonalcoholic beverages taken',
'Avg Milk desserts, sauces, gravies taken',
'ESRD patients: Total (or %) deaths for target year',
'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Total (or %) deaths for target year',
'Dialysis patients: Avg. Annual Mortality rates'],
dtype='object')
png

Most Positvely Correlated: ‘Actual Other vegetables Intake’, ‘Actual Red and orange vegetables Intake’, ‘Actual Starchy vegetables Intake’,

Most Negatively Correlated: ‘Avg Alcoholic beverages intake’ (-0.79), ‘Avg Added Sugars/Sugars and sweets taken’ (-0.64), ‘Avg Nuts, Seeds, and Soy Products taken’ (-0.55)

Check: the data

# Positive correlationplt.rcParams['figure.figsize'] = 5, 5 
positively = standardisedX_important[ ['Actual Other vegetables Intake', 'Actual Red and orange vegetables Intake', 'Actual Starchy vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_other = positively[['Actual Other vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_other.plot.line()
plt.savefig('positive_subgroup_line_1')

positively_red = positively[['Actual Red and orange vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_red.plot.line()
plt.savefig('positive_subgroup_line_2')

positively_starchy = positively[['Actual Starchy vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_starchy.plot.line()
plt.savefig('positive_subgroup_line_3')
png
png
png

negatively for Food Subgroup

negatively= standardisedX_important[ ['Avg Alcoholic beverages intake', 'Avg Added Sugars/Sugars and sweets taken', 'Avg Nuts, Seeds, and Soy Products taken',  'ESRD patients: Avg. Annual Mortality rates']]
negatively_alcohol = negatively[['Avg Alcoholic beverages intake', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_alcohol.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_1')

negatively_sugar = negatively[['Avg Added Sugars/Sugars and sweets taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_sugar.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_2')

negatively_nuts = negatively[['Avg Nuts, Seeds, and Soy Products taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_nuts.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_3')
png
png
png

The following code were relevant for exploration phase. For experiment and methodology the above code are relevant

Univariate

Checked and plotted each column.

Recommended intake amount is not included in the data (can be calculated like food groups)

"""
plt.figure(figsize=(16, 5));
#plt.rcParams['figure.figsize'] = 14, 14
df[' Actual Dark-green vegetables Intake'].plot.bar();
#plt.setxlabels(df['age_to'])
plt.suptitle('Actual Dark-green vegetables Intake')
plt.xticks(range(len(df[' Age-group: From USRDS'])), df[' Age-group: From USRDS']);
plt.xlabel('Age Group: From')
"""
"\nplt.figure(figsize=(16, 5));\n#plt.rcParams['figure.figsize'] = 14, 14\ndf[' Actual Dark-green vegetables Intake'].plot.bar();\n#plt.setxlabels(df['age_to'])\nplt.suptitle('Actual Dark-green vegetables Intake')\nplt.xticks(range(len(df[' Age-group: From USRDS'])), df[' Age-group: From USRDS']);\nplt.xlabel('Age Group: From')\n"df = pd.read_csv('./data-for-code/older_no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
#plt.figure(figsize=(16, 5));
plt.rcParams['figure.figsize'] = 14, 7
df1 = df.iloc[:,3:6]
df2 = df.iloc[:,6:9]
df3 = df.iloc[:,9:12]
df4 = df.iloc[:,12:15]
df5 = df.iloc[:,15:20]
df6 = df.iloc[:,20:24]

#df7 = df.iloc[:,22:24]
#df8 = df.iloc[:,30:33]
#df9 = df.iloc[:,33:36]

df1.plot.bar();
plt.title('Univariate plots, Check Legends for Aspects Plotted')

df2.plot.bar();
df3.plot.bar();
df4.plot.bar();
df5.plot.bar();
df6.plot.bar();
#df7.plot.bar();
#df8.plot.bar();
#df9.plot.bar();

plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
Text(0.5,0.98,'Univariate plots, Check Legends for Aspect')
png
png
png
png
png
png
df1.plot.line();
df2.plot.line();
df3.plot.line();
df4.plot.line();
df5.plot.line();
df6.plot.line();
#df7.plot.line();


plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
png
png
png
png
png
png
df1.plot.area();
df2.plot.area();
df3.plot.area();
df4.plot.area();
df5.plot.area();
df6.plot.area();
#df7.plot.area();

plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
png
png
png
png
png
png
df1.plot.hist();
df2.plot.hist();
df3.plot.hist();
df4.plot.hist();
df5.plot.hist();
df6.plot.hist();
#df7.plot.hist();


plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
png
png
png
png
png
png
df1.plot.box();
df2.plot.box();
df3.plot.box();
df4.plot.box();
df5.plot.box();
df6.plot.box();
#df7.plot.box();


plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
png
png
png
png
png
png
df.corr()
png
import numpy as np
import pandas as pd
from subprocess import check_output

from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

import seaborn as sns
%matplotlib inline
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()
png
df.corr()
png

Age-group: From USRDS Age-group: To USRDS Actual Dark-green vegetables Intake Actual Red and orange vegetables Intake Actual Starchy vegetables Intake Actual Other vegetables Intake Actual Whole grains intakes Actual Taken Refined grains amount Avg Meat, Poultry and Eggs subgroup taken Avg Seafood taken … Avg Solid Fats taken Avg Milks and milk drinks taken Avg Water, noncarbonated intake Avg Alcoholic beverages intake Avg Nonalcoholic beverages taken Avg Milk desserts, sauces, gravies taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year 0.909230 0.909458 0.056973 0.572518 0.517307 0.542376 -0.469825 0.112844 -0.036558 0.107024 … -0.236728 -0.598140 -0.023290 -0.738501 0.384077 0.353425 1.000000 0.833280 0.999627 0.868295 ESRD patients: Avg. Annual Mortality rates 0.854852 0.854771 -0.082390 0.545127 0.444551 0.676325 -0.609984 0.057939 -0.216943 -0.113325 … -0.365416 -0.461741 -0.150719 -0.786412 0.192810 0.336838 0.833280 1.000000 0.847302 0.995724 Dialysis patients: Total (or %) deaths for target year 0.914521 0.914720 0.054157 0.577456 0.518309 0.554746 -0.478563 0.114916 -0.042973 0.100156 … -0.241894 -0.598720 -0.028184 -0.744098 0.381084 0.357477 0.999627 0.847302 1.000000 0.880661 Dialysis patients: Avg. Annual Mortality rates 0.878499 0.879277 -0.076082 0.564846 0.453687 0.675571 -0.613042 0.035448 -0.205311 -0.097894 … -0.369415 -0.485633 -0.136271 -0.791673 0.225472 0.314988 0.868295 0.995724 0.880661 1.000000

len(df.columns)24plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
png
df = df.drop(['Gender'], axis=1)df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
png
df_normalized.corr()
png
plt.figure(figsize=(12, 8))

corr = df_normalized.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
png

Bivariate

Bivariate plots on actual amount intake and target variables. will be saved in bivariate_food_subgroup.png. The correlation pattern can be checked in the image saved

The correlation are shown in heatmaps and Pearson’s correlations. Now linearity and non-linearity can be seen using the Bivariate plots below.

The plots show similarity with correlations

df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()
png
df.columnsIndex(['Actual Dark-green vegetables Intake',
'Actual Red and orange vegetables Intake',
'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
'Avg Meat, Poultry and Eggs subgroup taken', 'Avg Seafood taken',
'Avg Nuts, Seeds, and Soy Products taken',
'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
'Avg Solid Fats taken', 'Avg Milks and milk drinks taken',
'Avg Water, noncarbonated intake', 'Avg Alcoholic beverages intake',
'Avg Nonalcoholic beverages taken',
'Avg Milk desserts, sauces, gravies taken',
'ESRD patients: Total (or %) deaths for target year',
'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Total (or %) deaths for target year',
'Dialysis patients: Avg. Annual Mortality rates'],
dtype='object')
# on actual amounts
# plt.figure(figsize=(16, 300))
#df = df.drop(['Gender', 'Age-group: From USRDS', 'Age-group: To USRDS'], axis=1)
sns.pairplot(df, vars=df.columns, size=5, kind='reg'); # diag_kind='kde',
plt.title('Bivariate Plot, Food Subgroups');
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/bivariate_food_subgroup' + '.png')
plt.show()
png

Bivariate plots with few variables at each plot

‘ESRD patients: Avg. Annual Mortality rates’ is used as the target variable

df1 = df.iloc[:,3:6]
df1['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']


df2 = df.iloc[:,6:9]
df2['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']

df3 = df.iloc[:,9:12]
df3['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']

df4 = df.iloc[:,12:15]
df4['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']


df5 = df.iloc[:,15:17]
df5['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']

#df6 = df.iloc[:,18:21]
df7 = df.iloc[:,17:24]
#df7[' ESRD patients: Avg. Annual Mortality rates'] = df[' ESRD patients: Avg. Annual Mortality rates']
plt.figure(figsize=(14, 14))
sns.pairplot(df1, diag_kind='kde', kind='reg', size=5);

#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=70);
#plt.xticks(rotation=70);
#g.set_xlabels(g.get_xlabels(), rotation=90)
#plt.xlabel(range(len(df1.columns)), df1.columns, rotation=90);
# plt.setyylable('dd', rotation=90);
<Figure size 1008x1008 with 0 Axes>
png
sns.pairplot(df2, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);
png
sns.pairplot(df3, diag_kind='kde', kind='reg', size=5);
png
sns.pairplot(df4, diag_kind='kde', kind='reg', size=5);
png
sns.pairplot(df5, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);
png
sns.pairplot(df6, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);
png

PCA: Just experiment and Exploration only:

Will reuse some methods from a github project and apply on our data

#steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()
png
df_pca = df.drop(['Age-group: From USRDS', 'Age-group: To USRDS', 'Gender'],axis=1)
import sklearn
from sklearn import preprocessing
import numpy as np
standardisedX = sklearn.preprocessing.scale(df_pca)
standardisedX = pd.DataFrame(standardisedX, index=df_pca.index, columns=df_pca.columns)
standardisedX.apply(np.mean)
Actual Dark-green vegetables Intake -1.727014e-16
Actual Red and orange vegetables Intake -3.083953e-16
Actual Starchy vegetables Intake 5.242720e-17
Actual Other vegetables Intake 2.220446e-16
Actual Whole grains intakes 1.258253e-15
Actual Taken Refined grains amount -1.619075e-15
Avg Meat, Poultry and Eggs subgroup taken 3.577385e-16
Avg Seafood taken 2.158767e-16
Avg Nuts, Seeds, and Soy Products taken 6.167906e-16
Avg Added Sugars/Sugars and sweets taken 2.652199e-16
Avg Oils taken -2.713879e-16
Avg Solid Fats taken -1.233581e-16
Avg Milks and milk drinks taken 2.405483e-16
Avg Water, noncarbonated intake 0.000000e+00
Avg Alcoholic beverages intake -2.713879e-16
Avg Nonalcoholic beverages taken -3.824102e-16
Avg Milk desserts, sauces, gravies taken 7.401487e-16
ESRD patients: Total (or %) deaths for target year 9.868649e-17
ESRD patients: Avg. Annual Mortality rates -9.868649e-17
Dialysis patients: Total (or %) deaths for target year 4.934325e-17
Dialysis patients: Avg. Annual Mortality rates 7.401487e-17
dtype: float64
standardisedX.apply(np.std)Actual Dark-green vegetables Intake 1.0
Actual Red and orange vegetables Intake 1.0
Actual Starchy vegetables Intake 1.0
Actual Other vegetables Intake 1.0
Actual Whole grains intakes 1.0
Actual Taken Refined grains amount 1.0
Avg Meat, Poultry and Eggs subgroup taken 1.0
Avg Seafood taken 1.0
Avg Nuts, Seeds, and Soy Products taken 1.0
Avg Added Sugars/Sugars and sweets taken 1.0
Avg Oils taken 1.0
Avg Solid Fats taken 1.0
Avg Milks and milk drinks taken 1.0
Avg Water, noncarbonated intake 1.0
Avg Alcoholic beverages intake 1.0
Avg Nonalcoholic beverages taken 1.0
Avg Milk desserts, sauces, gravies taken 1.0
ESRD patients: Total (or %) deaths for target year 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
Dialysis patients: Total (or %) deaths for target year 1.0
Dialysis patients: Avg. Annual Mortality rates 1.0
dtype: float64
from sklearn import decomposition
pca = decomposition.PCA().fit(standardisedX)
pca
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary
summary = pca_summary(pca, standardisedX)Importance of components:
png
# First Five component can define over 91%np.sum(summary.sdev**2)Standard deviation    21.0
dtype: float64
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.show()

plt.rcParams['figure.figsize'] = 8, 8
screeplot(pca, standardisedX)
png
# comp 3 to comp 4 is the most change for slope
# first three or at best first 4 can be retained
summary.sdev**2
png
pca.components_[0]array([-0.29107597, -0.14117176, -0.2469594 , -0.17548548, -0.25393257,
-0.21937776, -0.30982985, -0.28356691, -0.0489583 , -0.14248964,
-0.31487843, -0.26718644, 0.26503007, -0.30282359, -0.14339706,
-0.28353417, -0.23557178, -0.03748978, 0.01047719, -0.03648573,
0.00854257])
np.sum(pca.components_[0]**2)0.9999999999999997# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc
calcpc(standardisedX, pca.components_[0])array([ 9.84586938, 4.93557236, 1.7245801 , 0.51665275, -2.23906467,
-2.37480972, -1.57908354, -1.82857302, -2.5779144 , -1.48209187,
-1.18821375, -1.74464742, -2.54901027, -1.14542236, -0.61057954,
0.39468681, 0.5706472 , 1.33140195])
pca.transform(standardisedX)[:, 0]array([ 9.84586938, 4.93557236, 1.7245801 , 0.51665275, -2.23906467,
-2.37480972, -1.57908354, -1.82857302, -2.5779144 , -1.48209187,
-1.18821375, -1.74464742, -2.54901027, -1.14542236, -0.61057954,
0.39468681, 0.5706472 , 1.33140195])
pca.components_[1]array([-0.02860489, 0.23507636, 0.18185158, 0.2330107 , -0.22042602,
0.00767883, -0.06542032, -0.01520477, -0.23666744, -0.27940994,
-0.05868199, -0.12681133, -0.19603246, -0.03891072, -0.3029888 ,
0.11437419, 0.10127936, 0.34621579, 0.35198445, 0.34906457,
0.3593115 ])
np.sum(pca.components_[1]**2)1.0000000000000009# for following code : Classes will/might be defined such as like High, low, neutral mortality in final worksorted(standardisedX['ESRD patients: Avg. Annual Mortality rates'])[-0.8508781732392158,
-0.8447380350810376,
-0.807897206131968,
-0.7541709972479083,
-0.6983980756444559,
-0.6114127850702639,
-0.535172736272884,
-0.4788881364895833,
-0.45330422749717386,
-0.4405122730009692,
-0.29059056630545005,
-0.13094697419281548,
0.08753960860236082,
0.32905170949070545,
0.6810862972262587,
1.1635988208231,
1.6916507024264298,
2.9439830476048696]
y = standardisedX['ESRD patients: Avg. Annual Mortality rates'] > 0.5
y
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 True
16 True
17 True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

#y = df_pca[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
png
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21



#keep four components
pca_components_cont = pca.components_[0:5]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1,2,3],['1st Comp','2nd Comp', '3rd comp', '4th comp'],fontsize=10)
plt.colorbar()

plt.xticks(range(len(df_pca.columns)), df_pca.columns, rotation=65,ha='left')

plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_sub_groups_what_contributes_to_PCA_components' + '.png')
plt.show()#
png
plt.figure(figsize=(16, 10));

diffs = list(df_pca.columns[:-1])
import seaborn as sns
s = sns.heatmap(df_pca[diffs].corr(),cmap='coolwarm')
#s.set_yticklabels(s.get_yticklabels(),rotation=60,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=90,fontsize=7)
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_sub_groups_how_in_together_influencing_PCA_components' + '.png')
plt.show()
png

https://www.kaggle.com/etakla/exploring-the-dataset-univariate-analysis https://www.kaggle.com/etakla/exploring-the-dataset-bivariate-analysis https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e https://lifelines.readthedocs.io/en/latest/ https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html https://www.statsdirect.com/help/survival_analysis/cox_regression.htm https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/

References

--

--

Justetc Social Services (non-profit)
Data Science Project Development

All proceeds from Medium will go to Justetc Social Services ( non-profit). Justetc Social Services provides services in the Training and Education Areas.