Find relation between food Sub groups and ckd mortality: CKD: Chronic Kidney Disease

Published in

Data Science Project Development

20 min readMar 9, 2021

The code shows the relation between food sub groups and CKD mortality Food Sub Group concept is used as are used by CDC/USDA/USRDS

Dietray Guidelines are reviewed every couple of years and published https://health.gov/our-work/food-nutrition/previous-dietary-guidelines/2015

A background database was there where some data pre-processing and adjustments were made

Food Groups/Subgroups Survey Data and Mortality Data were merged together to create the dataset as used in this project Mortality Data is provided by USRDS

Food intake surveys are conducted by USDA and also published periodically https://www.nal.usda.gov/fnic/food-and-nutrition-surveys

The code will use PCA to find the important food groups affecting mortality

Will use heatmap to show the correlations between mortality and food groups

The code also shows univariate and bi-variate analysis of the food group data

The final outcome is correlation of Mortality and Food Groups

# PCA section, Correlation, Heatmaps, and then Bivariate plots can be seen to be important
# Might have to adjust/improve my conclusions from the plots in future work# Target variable ESRD patients: Avg. Annual Mortality rates is given the most importance
# One another target variable is: ESRD patients: Total (or %) deaths for target year
# Other target variables: Dialysis patients: Total (or %) deaths for target year# For plotting
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as npimport warnings
warnings.filterwarnings('ignore')

Is not using ratio with recommended amounts. Ratios as used with Food Groups did not show any difference with the actual amount taken. However, that might change with food subgroups.

# data exploration
import pandas as pd
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head(30)

df.describe()

df_actual_only = df.drop(['Age-group: From USRDS', 'Age-group: To USRDS', 'Gender'], axis=1)
df_actual_only.head()

####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_actual_only)
standardisedX = pd.DataFrame(standardisedX, index=df_actual_only.index, columns=df_actual_only.columns)
standardisedX.apply(np.mean), '---\n', standardisedX.apply(np.std)(Actual Dark-green vegetables Intake                      -1.727014e-16
 Actual Red and orange vegetables Intake                  -3.083953e-16
 Actual Starchy vegetables Intake                          5.242720e-17
 Actual Other vegetables Intake                            2.220446e-16
 Actual Whole grains intakes                               1.258253e-15
 Actual Taken Refined grains amount                       -1.619075e-15
 Avg Meat, Poultry and Eggs subgroup taken                 3.577385e-16
 Avg Seafood taken                                         2.158767e-16
 Avg Nuts, Seeds, and Soy Products taken                   6.167906e-16
 Avg Added Sugars/Sugars and sweets taken                  2.652199e-16
 Avg Oils taken                                           -2.713879e-16
 Avg Solid Fats taken                                     -1.233581e-16
 Avg Milks and milk drinks  taken                          2.405483e-16
 Avg Water, noncarbonated intake                           0.000000e+00
 Avg Alcoholic beverages intake                           -2.713879e-16
 Avg Nonalcoholic beverages taken                         -3.824102e-16
 Avg Milk desserts, sauces, gravies taken                  7.401487e-16
 ESRD patients: Total (or %) deaths for target year        9.868649e-17
 ESRD patients: Avg. Annual Mortality rates               -9.868649e-17
 Dialysis patients: Total (or %) deaths for target year    4.934325e-17
 Dialysis patients: Avg. Annual Mortality rates            7.401487e-17
 dtype: float64,
 '---\n',
 Actual Dark-green vegetables Intake                       1.0
 Actual Red and orange vegetables Intake                   1.0
 Actual Starchy vegetables Intake                          1.0
 Actual Other vegetables Intake                            1.0
 Actual Whole grains intakes                               1.0
 Actual Taken Refined grains amount                        1.0
 Avg Meat, Poultry and Eggs subgroup taken                 1.0
 Avg Seafood taken                                         1.0
 Avg Nuts, Seeds, and Soy Products taken                   1.0
 Avg Added Sugars/Sugars and sweets taken                  1.0
 Avg Oils taken                                            1.0
 Avg Solid Fats taken                                      1.0
 Avg Milks and milk drinks  taken                          1.0
 Avg Water, noncarbonated intake                           1.0
 Avg Alcoholic beverages intake                            1.0
 Avg Nonalcoholic beverages taken                          1.0
 Avg Milk desserts, sauces, gravies taken                  1.0
 ESRD patients: Total (or %) deaths for target year        1.0
 ESRD patients: Avg. Annual Mortality rates                1.0
 Dialysis patients: Total (or %) deaths for target year    1.0
 Dialysis patients: Avg. Annual Mortality rates            1.0
 dtype: float64)y = abs(standardisedX['ESRD patients: Avg. Annual Mortality rates']) > 0.5 
standardisedX_with_target = standardisedX
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates', 'ESRD patients: Total (or %) deaths for target year',
                                   'Dialysis patients: Total (or %) deaths for target year', 'Dialysis patients: Avg. Annual Mortality rates'], axis=1)from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = list(np.std(pca.transform(standardised_data), axis=0))
    b = list(pca.explained_variance_ratio_)
    c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
    if out:
        print("Importance of components:")
        display(summary)
    return summary####----
summary = pca_summary(pca, standardisedX)
####----Importance of components:

# First five component can define over 90%# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
np.sum( summary.sdev ** 2 )
plt.rcParams['figure.figsize'] = 8, 8

Important Components

def screeplot(pca, standardised_values):
    y = np.std(pca.transform(standardised_values), axis=0)**2
    x = np.arange(len(y)) + 1
    plt.plot(x, y, "o-")
    plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
    plt.ylabel("Variance")
    plt.title('PCA variance in Components')
    plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_variance' + '.png')            
    plt.show()

screeplot(pca, standardisedX)
#plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_variance' + '.png')

comp 3 to 4 has the most slope change — comp 7 starts the flat line

first three are the most important

Will retain first five though first three will be analyzed primarily

#summary.sdev**2
#pca.components_[0]
#np.sum(pca.components_[0]**2)####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function with some adjustments
def calcpc(variables, loadings):
    # find the number of samples in the data set and the number of variables
    numsamples, numvariables = variables.shape
    # make a vector to store the component
    pc = np.zeros(numsamples)
    # calculate the value of the component for each sample
    for i in range(numsamples):
        valuei = 0
        for j in range(numvariables):
            valueij = variables.iloc[i, j]
            loadingj = loadings[j]
            valuei = valuei + (valueij * loadingj)
        pc[i] = valuei
    return pc####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
#highest loadings for0.9999999999999989# Define high and low mortality
# The code below is not used.
# y as defined earlier will rather be used

#import sklearn 
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_diff_ratio_mortality)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_diff_ratio_mortality.index, columns=df_diff_ratio_mortality.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
    foo = pca.transform(standardised_values)
    bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
    #plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_separating_high_low_mortality' + '.png')            
    sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Class separations True = High Mortality')
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_separating_high_low_mortality' + '.png')            
standardisedX

# first two components really cannot separate high and low mortality very effectively

Plot first two components

Next section (i.e. four components) is more important than this

####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21


pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_groups_what_contributes_to_PCA_components' + '.png')            
plt.show() # 
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component

First Five components

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 5

pca_components_cont = pca.components_[0:components_to_count]
ylabels = []

for c in range (components_to_count):
    ylabels.append('Component ' + str(c) )

xlabels = []    
for c in range (components_to_count):
    xlabels.append(c) 
    
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()
plt.xlabel('\nContributors to the first five components')
plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#

standardisedX.columnsIndex(['Actual Dark-green vegetables Intake',
       'Actual Red and orange vegetables Intake',
       'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
       'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
       'Avg Meat, Poultry and Eggs subgroup taken', 'Avg Seafood taken',
       'Avg Nuts, Seeds, and Soy Products taken',
       'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
       'Avg Solid Fats taken', 'Avg Milks and milk drinks  taken',
       'Avg Water, noncarbonated intake', 'Avg Alcoholic beverages intake',
       'Avg Nonalcoholic beverages taken',
       'Avg Milk desserts, sauces, gravies taken'],
      dtype='object')

First five components explain the data by 90%

Contributors for the First Component: Positively: Avg Milks and milk drinks taken -> (0.4)

Negatively (-0.3 to -0.4) contributing: Actual Dark-green vegetables, Actual Starchy vegetables, Actual Whole grains, Avg Meat Poultry and Eggs subgroup, Avg Oils taken, Avg Water noncarbonated intake, Avg Nonalcoholic beverages taken

Contributors for 2nd Component: Positively: Actual Red and orange vegetables Intake, Actual Other vegetables Negatively: ‘Avg Nuts, Seeds, and Soy Products’, ‘Avg Added Sugars/Sugars and sweets’, ‘Avg Alcoholic beverages’

Contributors for 3rd component: Positively: ‘Actual Taken Refined grains amount’, ‘Avg Milk desserts, sauces, gravies taken’ Negatively: Avg Water noncarbonated intake’

Features that do not contribute much: ‘Avg Seafood taken’, ‘Avg Solid Fats taken’,

Considering all Five Components:

Positively the most contributing:
‘Actual Taken Refined grains amount’, ‘Avg Milk desserts, sauces, gravies taken’ ‘Avg Nuts, Seeds, and Soy Products taken’, ‘Actual Red and orange vegetables Intake’

Most Negatively:
‘Actual Other vegetables Intake’, ‘Avg Added Sugars/Sugars and sweets taken’, Avg Alcoholic beverages intake

Can do regression with all food sub groups except ‘Avg Seafood taken’, ‘Avg Solid Fats taken’. Though will do a comparison with them as well

standardisedX_important = standardisedX_with_target.drop(['Avg Seafood taken', 'Avg Solid Fats taken'], axis=1)standardisedX_important.head()

standardisedX_important.corr()

plt.figure(figsize=(12, 12))


corr = standardisedX_important.corr() #df_diff_ratio_with_mortality.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True,
           cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
plt.savefig('heatmap-ckd-mortality-subgroup-regression-after-pca.png')
standardisedX_important.columnsIndex(['Actual Dark-green vegetables Intake',
       'Actual Red and orange vegetables Intake',
       'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
       'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
       'Avg Meat, Poultry and Eggs subgroup taken',
       'Avg Nuts, Seeds, and Soy Products taken',
       'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
       'Avg Milks and milk drinks  taken', 'Avg Water, noncarbonated intake',
       'Avg Alcoholic beverages intake', 'Avg Nonalcoholic beverages taken',
       'Avg Milk desserts, sauces, gravies taken',
       'ESRD patients: Total (or %) deaths for target year',
       'ESRD patients: Avg. Annual Mortality rates',
       'Dialysis patients: Total (or %) deaths for target year',
       'Dialysis patients: Avg. Annual Mortality rates'],
      dtype='object')

Most Positvely Correlated: ‘Actual Other vegetables Intake’, ‘Actual Red and orange vegetables Intake’, ‘Actual Starchy vegetables Intake’,

Most Negatively Correlated: ‘Avg Alcoholic beverages intake’ (-0.79), ‘Avg Added Sugars/Sugars and sweets taken’ (-0.64), ‘Avg Nuts, Seeds, and Soy Products taken’ (-0.55)

Check: the data

# Positive correlationplt.rcParams['figure.figsize'] = 5, 5 
positively = standardisedX_important[ ['Actual Other vegetables Intake', 'Actual Red and orange vegetables Intake', 'Actual Starchy vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_other = positively[['Actual Other vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_other.plot.line()
plt.savefig('positive_subgroup_line_1')

positively_red = positively[['Actual Red and orange vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_red.plot.line()
plt.savefig('positive_subgroup_line_2')

positively_starchy = positively[['Actual Starchy vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_starchy.plot.line()
plt.savefig('positive_subgroup_line_3')

negatively for Food Subgroup

negatively= standardisedX_important[ ['Avg Alcoholic beverages intake', 'Avg Added Sugars/Sugars and sweets taken', 'Avg Nuts, Seeds, and Soy Products taken',  'ESRD patients: Avg. Annual Mortality rates']]
negatively_alcohol = negatively[['Avg Alcoholic beverages intake', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_alcohol.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_1')

negatively_sugar = negatively[['Avg Added Sugars/Sugars and sweets taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_sugar.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_2')

negatively_nuts = negatively[['Avg Nuts, Seeds, and Soy Products taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_nuts.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_3')

The following code were relevant for exploration phase. For experiment and methodology the above code are relevant

Univariate

Checked and plotted each column.

Recommended intake amount is not included in the data (can be calculated like food groups)

"""
plt.figure(figsize=(16, 5));
#plt.rcParams['figure.figsize'] = 14, 14
df[' Actual Dark-green vegetables Intake'].plot.bar();
#plt.setxlabels(df['age_to'])
plt.suptitle('Actual Dark-green vegetables Intake')
plt.xticks(range(len(df[' Age-group: From USRDS'])), df[' Age-group: From USRDS']);
plt.xlabel('Age Group: From')
""""\nplt.figure(figsize=(16, 5));\n#plt.rcParams['figure.figsize'] = 14, 14\ndf[' Actual Dark-green vegetables Intake'].plot.bar();\n#plt.setxlabels(df['age_to'])\nplt.suptitle('Actual Dark-green vegetables Intake')\nplt.xticks(range(len(df[' Age-group: From USRDS'])), df[' Age-group: From USRDS']);\nplt.xlabel('Age Group: From')\n"df = pd.read_csv('./data-for-code/older_no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
#plt.figure(figsize=(16, 5));
plt.rcParams['figure.figsize'] = 14, 7
df1 = df.iloc[:,3:6]
df2 = df.iloc[:,6:9]
df3 = df.iloc[:,9:12]
df4 = df.iloc[:,12:15]
df5 = df.iloc[:,15:20]
df6 = df.iloc[:,20:24]

#df7 = df.iloc[:,22:24]
#df8 = df.iloc[:,30:33]
#df9 = df.iloc[:,33:36]

df1.plot.bar();
plt.title('Univariate plots, Check Legends for Aspects Plotted')

df2.plot.bar();
df3.plot.bar();
df4.plot.bar();
df5.plot.bar();
df6.plot.bar();
#df7.plot.bar();
#df8.plot.bar();
#df9.plot.bar();

plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')Text(0.5,0.98,'Univariate plots, Check Legends for Aspect')

df1.plot.line();
df2.plot.line();
df3.plot.line();
df4.plot.line();
df5.plot.line();
df6.plot.line();
#df7.plot.line();


plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);

df1.plot.area();
df2.plot.area();
df3.plot.area();
df4.plot.area();
df5.plot.area();
df6.plot.area();
#df7.plot.area();

plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);

df1.plot.hist();
df2.plot.hist();
df3.plot.hist();
df4.plot.hist();
df5.plot.hist();
df6.plot.hist();
#df7.plot.hist();


plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);

df1.plot.box();
df2.plot.box();
df3.plot.box();
df4.plot.box();
df5.plot.box();
df6.plot.box();
#df7.plot.box();


plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);

df.corr()

import numpy as np
import pandas as pd 
from subprocess import check_output

from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

import seaborn as sns
%matplotlib inlinedf = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()

df.corr()

Age-group: From USRDS Age-group: To USRDS Actual Dark-green vegetables Intake Actual Red and orange vegetables Intake Actual Starchy vegetables Intake Actual Other vegetables Intake Actual Whole grains intakes Actual Taken Refined grains amount Avg Meat, Poultry and Eggs subgroup taken Avg Seafood taken … Avg Solid Fats taken Avg Milks and milk drinks taken Avg Water, noncarbonated intake Avg Alcoholic beverages intake Avg Nonalcoholic beverages taken Avg Milk desserts, sauces, gravies taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year 0.909230 0.909458 0.056973 0.572518 0.517307 0.542376 -0.469825 0.112844 -0.036558 0.107024 … -0.236728 -0.598140 -0.023290 -0.738501 0.384077 0.353425 1.000000 0.833280 0.999627 0.868295 ESRD patients: Avg. Annual Mortality rates 0.854852 0.854771 -0.082390 0.545127 0.444551 0.676325 -0.609984 0.057939 -0.216943 -0.113325 … -0.365416 -0.461741 -0.150719 -0.786412 0.192810 0.336838 0.833280 1.000000 0.847302 0.995724 Dialysis patients: Total (or %) deaths for target year 0.914521 0.914720 0.054157 0.577456 0.518309 0.554746 -0.478563 0.114916 -0.042973 0.100156 … -0.241894 -0.598720 -0.028184 -0.744098 0.381084 0.357477 0.999627 0.847302 1.000000 0.880661 Dialysis patients: Avg. Annual Mortality rates 0.878499 0.879277 -0.076082 0.564846 0.453687 0.675571 -0.613042 0.035448 -0.205311 -0.097894 … -0.369415 -0.485633 -0.136271 -0.791673 0.225472 0.314988 0.868295 0.995724 0.880661 1.000000

len(df.columns)24plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);

df = df.drop(['Gender'], axis=1)df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()

df_normalized.corr()

plt.figure(figsize=(12, 8))

corr = df_normalized.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);

Bivariate

Bivariate plots on actual amount intake and target variables. will be saved in bivariate_food_subgroup.png. The correlation pattern can be checked in the image saved

The correlation are shown in heatmaps and Pearson’s correlations. Now linearity and non-linearity can be seen using the Bivariate plots below.

The plots show similarity with correlations

df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()

df.columnsIndex(['Actual Dark-green vegetables Intake',
       'Actual Red and orange vegetables Intake',
       'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
       'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
       'Avg Meat, Poultry and Eggs subgroup taken', 'Avg Seafood taken',
       'Avg Nuts, Seeds, and Soy Products taken',
       'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
       'Avg Solid Fats taken', 'Avg Milks and milk drinks  taken',
       'Avg Water, noncarbonated intake', 'Avg Alcoholic beverages intake',
       'Avg Nonalcoholic beverages taken',
       'Avg Milk desserts, sauces, gravies taken',
       'ESRD patients: Total (or %) deaths for target year',
       'ESRD patients: Avg. Annual Mortality rates',
       'Dialysis patients: Total (or %) deaths for target year',
       'Dialysis patients: Avg. Annual Mortality rates'],
      dtype='object')# on actual amounts
# plt.figure(figsize=(16, 300))
#df = df.drop(['Gender', 'Age-group: From USRDS', 'Age-group: To USRDS'], axis=1)
sns.pairplot(df, vars=df.columns, size=5, kind='reg'); # diag_kind='kde',
plt.title('Bivariate Plot, Food Subgroups');
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/bivariate_food_subgroup' + '.png')            
plt.show()

Bivariate plots with few variables at each plot

‘ESRD patients: Avg. Annual Mortality rates’ is used as the target variable

df1 = df.iloc[:,3:6]
df1['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']


df2 = df.iloc[:,6:9]
df2['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']

df3 = df.iloc[:,9:12]
df3['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']

df4 = df.iloc[:,12:15]
df4['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']


df5 = df.iloc[:,15:17]
df5['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']

#df6 = df.iloc[:,18:21]
df7 = df.iloc[:,17:24]
#df7[' ESRD patients: Avg. Annual Mortality rates'] = df[' ESRD patients: Avg. Annual Mortality rates']plt.figure(figsize=(14, 14))
sns.pairplot(df1, diag_kind='kde', kind='reg', size=5);

#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=70);
#plt.xticks(rotation=70);
#g.set_xlabels(g.get_xlabels(), rotation=90)
#plt.xlabel(range(len(df1.columns)), df1.columns, rotation=90);
# plt.setyylable('dd', rotation=90);<Figure size 1008x1008 with 0 Axes>

sns.pairplot(df2, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);

sns.pairplot(df3, diag_kind='kde', kind='reg', size=5);

sns.pairplot(df4, diag_kind='kde', kind='reg', size=5);

sns.pairplot(df5, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);

sns.pairplot(df6, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);

PCA: Just experiment and Exploration only:

Will reuse some methods from a github project and apply on our data

#steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()

df_pca = df.drop(['Age-group: From USRDS', 'Age-group: To USRDS', 'Gender'],axis=1)
import sklearn 
from sklearn import preprocessing
import numpy as np
standardisedX = sklearn.preprocessing.scale(df_pca)
standardisedX = pd.DataFrame(standardisedX, index=df_pca.index, columns=df_pca.columns)
standardisedX.apply(np.mean)Actual Dark-green vegetables Intake                      -1.727014e-16
Actual Red and orange vegetables Intake                  -3.083953e-16
Actual Starchy vegetables Intake                          5.242720e-17
Actual Other vegetables Intake                            2.220446e-16
Actual Whole grains intakes                               1.258253e-15
Actual Taken Refined grains amount                       -1.619075e-15
Avg Meat, Poultry and Eggs subgroup taken                 3.577385e-16
Avg Seafood taken                                         2.158767e-16
Avg Nuts, Seeds, and Soy Products taken                   6.167906e-16
Avg Added Sugars/Sugars and sweets taken                  2.652199e-16
Avg Oils taken                                           -2.713879e-16
Avg Solid Fats taken                                     -1.233581e-16
Avg Milks and milk drinks  taken                          2.405483e-16
Avg Water, noncarbonated intake                           0.000000e+00
Avg Alcoholic beverages intake                           -2.713879e-16
Avg Nonalcoholic beverages taken                         -3.824102e-16
Avg Milk desserts, sauces, gravies taken                  7.401487e-16
ESRD patients: Total (or %) deaths for target year        9.868649e-17
ESRD patients: Avg. Annual Mortality rates               -9.868649e-17
Dialysis patients: Total (or %) deaths for target year    4.934325e-17
Dialysis patients: Avg. Annual Mortality rates            7.401487e-17
dtype: float64standardisedX.apply(np.std)Actual Dark-green vegetables Intake                       1.0
Actual Red and orange vegetables Intake                   1.0
Actual Starchy vegetables Intake                          1.0
Actual Other vegetables Intake                            1.0
Actual Whole grains intakes                               1.0
Actual Taken Refined grains amount                        1.0
Avg Meat, Poultry and Eggs subgroup taken                 1.0
Avg Seafood taken                                         1.0
Avg Nuts, Seeds, and Soy Products taken                   1.0
Avg Added Sugars/Sugars and sweets taken                  1.0
Avg Oils taken                                            1.0
Avg Solid Fats taken                                      1.0
Avg Milks and milk drinks  taken                          1.0
Avg Water, noncarbonated intake                           1.0
Avg Alcoholic beverages intake                            1.0
Avg Nonalcoholic beverages taken                          1.0
Avg Milk desserts, sauces, gravies taken                  1.0
ESRD patients: Total (or %) deaths for target year        1.0
ESRD patients: Avg. Annual Mortality rates                1.0
Dialysis patients: Total (or %) deaths for target year    1.0
Dialysis patients: Avg. Annual Mortality rates            1.0
dtype: float64from sklearn import decomposition
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = list(np.std(pca.transform(standardised_data), axis=0))
    b = list(pca.explained_variance_ratio_)
    c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
    if out:
        print("Importance of components:")
        display(summary)
    return summarysummary = pca_summary(pca, standardisedX)Importance of components:

# First Five component can define over 91%np.sum(summary.sdev**2)Standard deviation    21.0
dtype: float64# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def screeplot(pca, standardised_values):
    y = np.std(pca.transform(standardised_values), axis=0)**2
    x = np.arange(len(y)) + 1
    plt.plot(x, y, "o-")
    plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
    plt.ylabel("Variance")
    plt.show()

plt.rcParams['figure.figsize'] = 8, 8   
screeplot(pca, standardisedX)

# comp 3 to comp 4 is the most change for slope
# first three or at best first 4 can be retainedsummary.sdev**2

pca.components_[0]array([-0.29107597, -0.14117176, -0.2469594 , -0.17548548, -0.25393257,
       -0.21937776, -0.30982985, -0.28356691, -0.0489583 , -0.14248964,
       -0.31487843, -0.26718644,  0.26503007, -0.30282359, -0.14339706,
       -0.28353417, -0.23557178, -0.03748978,  0.01047719, -0.03648573,
        0.00854257])np.sum(pca.components_[0]**2)0.9999999999999997# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
    # find the number of samples in the data set and the number of variables
    numsamples, numvariables = variables.shape
    # make a vector to store the component
    pc = np.zeros(numsamples)
    # calculate the value of the component for each sample
    for i in range(numsamples):
        valuei = 0
        for j in range(numvariables):
            valueij = variables.iloc[i, j]
            loadingj = loadings[j]
            valuei = valuei + (valueij * loadingj)
        pc[i] = valuei
    return pccalcpc(standardisedX, pca.components_[0])array([ 9.84586938,  4.93557236,  1.7245801 ,  0.51665275, -2.23906467,
       -2.37480972, -1.57908354, -1.82857302, -2.5779144 , -1.48209187,
       -1.18821375, -1.74464742, -2.54901027, -1.14542236, -0.61057954,
        0.39468681,  0.5706472 ,  1.33140195])pca.transform(standardisedX)[:, 0]array([ 9.84586938,  4.93557236,  1.7245801 ,  0.51665275, -2.23906467,
       -2.37480972, -1.57908354, -1.82857302, -2.5779144 , -1.48209187,
       -1.18821375, -1.74464742, -2.54901027, -1.14542236, -0.61057954,
        0.39468681,  0.5706472 ,  1.33140195])pca.components_[1]array([-0.02860489,  0.23507636,  0.18185158,  0.2330107 , -0.22042602,
        0.00767883, -0.06542032, -0.01520477, -0.23666744, -0.27940994,
       -0.05868199, -0.12681133, -0.19603246, -0.03891072, -0.3029888 ,
        0.11437419,  0.10127936,  0.34621579,  0.35198445,  0.34906457,
        0.3593115 ])np.sum(pca.components_[1]**2)1.0000000000000009# for following code : Classes will/might be defined such as like High, low, neutral mortality in final worksorted(standardisedX['ESRD patients: Avg. Annual Mortality rates'])[-0.8508781732392158,
 -0.8447380350810376,
 -0.807897206131968,
 -0.7541709972479083,
 -0.6983980756444559,
 -0.6114127850702639,
 -0.535172736272884,
 -0.4788881364895833,
 -0.45330422749717386,
 -0.4405122730009692,
 -0.29059056630545005,
 -0.13094697419281548,
 0.08753960860236082,
 0.32905170949070545,
 0.6810862972262587,
 1.1635988208231,
 1.6916507024264298,
 2.9439830476048696]y = standardisedX['ESRD patients: Avg. Annual Mortality rates'] > 0.5
y0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
15     True
16     True
17     True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
    foo = pca.transform(standardised_values)
    bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
    sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

#y = df_pca[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21



#keep four components
pca_components_cont = pca.components_[0:5]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1,2,3],['1st Comp','2nd Comp', '3rd comp', '4th comp'],fontsize=10)
plt.colorbar()

plt.xticks(range(len(df_pca.columns)), df_pca.columns, rotation=65,ha='left')

plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_sub_groups_what_contributes_to_PCA_components' + '.png')            
plt.show()#

plt.figure(figsize=(16, 10));

diffs = list(df_pca.columns[:-1]) 
import seaborn as sns
s = sns.heatmap(df_pca[diffs].corr(),cmap='coolwarm') 
#s.set_yticklabels(s.get_yticklabels(),rotation=60,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=90,fontsize=7)
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_sub_groups_how_in_together_influencing_PCA_components' + '.png')            
plt.show()

https://www.kaggle.com/etakla/exploring-the-dataset-univariate-analysis https://www.kaggle.com/etakla/exploring-the-dataset-bivariate-analysis https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e https://lifelines.readthedocs.io/en/latest/ https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html https://www.statsdirect.com/help/survival_analysis/cox_regression.htm https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/

References

Linear Regression Example - scikit-learn 0.24.1 documentation

The example below uses only the first feature of the dataset, in order to illustrate the data points within the…

scikit-learn.org

Python Pandas DataFrame: load, edit, view data | Shane Lynn

If you're developing in data science, and moving from excel-based analysis to the world of Python, scripting, and…

www.shanelynn.ie

A Little Book of Python for Multivariate Analysis - A Little Book of Python for Multivariate…

Although there are a number of ways of getting Python to your system, for a hassle free install and quick start using…

python-for-multivariate-analysis.readthedocs.io