Find relation between foodgroups and ckd mortality

Published in

Data Science Project Development

29 min readMar 9, 2021

The code shows the relation between food groups and CKD mortality Food Group concept is used as are used by CDC/USDA/USRDS

Dietray Guidelines are reviewed every couple of years and published https://health.gov/our-work/food-nutrition/previous-dietary-guidelines/2015

A background database was there where some data pre-processing and adjustments were made

Food Groups Survey Data and Mortality Data were merged together to create the dataset as used in this project Mortality Data is provided by USRDS

Food intake surveys are conducted by USDA and also published periodically https://www.nal.usda.gov/fnic/food-and-nutrition-surveys

The code will use PCA to find the important food groups affecting mortality

Will use heatmap to show the correlations between mortality and food groups

The code also shows univariate and bi-variate analysis of the food group data

The final outcome is correlation of Mortality and Food Groups

Related : database table: age_food_group_recommended_amount_adjusted
stored procedures: adjusted_gms_age_food_group_recommended_amount multi_day_get_food_group_based_dietary_intake_by_participants

Cup http://dish.allrecipes.com/cup-to-gram-conversions/ More Cup and grams conversion references are provided in the stored procedures as well

# notes on regenerating some of the core data can be found near the end. Also, reports can be checked for other details# PCA section, Correlation, Heatmaps,  plots can be seen to be the most important
# Might have to adjust/improve my conclusions from the plots in future work# Target variable ESRD patients: Avg. Annual Mortality rates is given the most importance

#Potential Others:   
# One another target variable is: ESRD patients: Total (or %) deaths for target year
# Other target variables: Dialysis patients: Total (or %) deaths for target year# For plotting, import libraries

from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
import numpy as npimport warnings
warnings.filterwarnings('ignore')

Core Input data File: Food Group

mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv

Intake data and recommended data

Recommended data:

Removed (no longer relevant): The ones with format: From: Recommended Vegetable Intake and To: Recommended Vegetable Intake: health.gov recommendation data (age group based) is distributed to each age then took average using age groups as used by USRDS

Relevant: The ones with format: vegetables_recommended_high and vegetables_recommended_low: health.gov recommendation data (age group based) is mapped to ages (rather than dividing and distributing) then regrouped using USRDS age groups. To mention: health.gov and USRDS provided aggregated data using different age groups. Hence, such approaches were taken

# data exploration
import pandas as pd
df = pd.read_csv('./data-for-code/mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df.head()

df.describe()

PCA

Applying PCA on Actual Intake Amount

For recommended amounts: fields with format: vegetables_recommended_high and vegetables_recommended_low

df.columnsIndex(['age_from', 'age_to', 'Gender', 'vegetables_recommended_low',
       'vegetables_recommended_high', 'Actual Vegetable Intake',
       'protein_recommended_low', 'protein_recommended_high',
       'Actual Protein Intake', 'grains_recommended_low',
       'grains_recommended_high', 'Actual Grain Intake',
       'dairy_recommended_low', 'dairy_recommended_high',
       'Actual Dairy Intake', 'fruits_recommended_low',
       'fruits_recommended_high', 'Actual Fruit intakes', 'sugars_recommended',
       'sugars_recommended_low', 'sugars_recommended_high',
       'Actual Added Sugars Taken', 'oils_recommended_low',
       'oils_recommended_high', 'Avg Fats oils and salad dressings taken',
       'ESRD patients: Avg. Annual Mortality rates',
       'Dialysis patients: Avg. Annual Mortality rates',
       'ESRD patients: Total (or %) deaths for target year',
       'Dialysis patients: Total (or %) deaths for target year'],
      dtype='object')# Will Keep only important columnsdf_esrdonly = df [
    [        
        #'age_from', 
        #'age_to', 
        #'Gender', 
        #'From: Recommended Vegetable Intake',
       'vegetables_recommended_low', 
        #'To: Recommended Vegetable Intake',
       'vegetables_recommended_high', 
        'Actual Vegetable Intake',
       #'From: Recommended Protein Intake', 
        'protein_recommended_low',
       #'To: Recommended Protein Intake', 
        'protein_recommended_high',
       'Actual Protein Intake', 
        #'From: Recommended Grain Intake',
       'grains_recommended_low', 
        #'To: Recommended Grain Intake',
       'grains_recommended_high', 
        'Actual Grain Intake',
       #'From: Recommended Dairy Intake', 
        'dairy_recommended_low',
       #'To: Recommended Dairy Intake', 
        'dairy_recommended_high',
       'Actual Dairy Intake', 
        #'From: Recommended Fruit category intakes',
       'fruits_recommended_low', 
        #'To: Recommended Fruit category intakes',
       'fruits_recommended_high', 
        'Actual Fruit intakes',
        
        
        'sugars_recommended_low',
       'sugars_recommended_high', 
        'Actual Added Sugars Taken',
       #'Actual Taken Sugars sweets and beverages amount',
       'oils_recommended_low', 
        'oils_recommended_high',
       'Avg Fats oils and salad dressings taken', 
        
                        
       #'% Population got CKD',
       #'People (or %) progressed to Stage 3 CKD',
       #'People (or %) progressed to Stage 4 CKD',
       #'People (or %) progressed to ESRD',
       #'Received dietitian care (Optional)',
       #'Did not Receive dietitian care (Optional)',
       #'Patients went for dialysis',
       #'Patients went for Kidney Transplantation',
       'ESRD patients: Avg. Annual Mortality rates',
       #'Dialysis patients: Avg. Annual Mortality rates',
       #'ESRD patients: Total (or %) deaths for target year',
       #'Dialysis patients: Total (or %) deaths for target year'
        
    ]   
        
]
df_esrdonly.head()

df_esrdonly.columnsIndex(['vegetables_recommended_low', 'vegetables_recommended_high',
       'Actual Vegetable Intake', 'protein_recommended_low',
       'protein_recommended_high', 'Actual Protein Intake',
       'grains_recommended_low', 'grains_recommended_high',
       'Actual Grain Intake', 'dairy_recommended_low',
       'dairy_recommended_high', 'Actual Dairy Intake',
       'fruits_recommended_low', 'fruits_recommended_high',
       'Actual Fruit intakes', 'sugars_recommended_low',
       'sugars_recommended_high', 'Actual Added Sugars Taken',
       'oils_recommended_low', 'oils_recommended_high',
       'Avg Fats oils and salad dressings taken',
       'ESRD patients: Avg. Annual Mortality rates'],
      dtype='object')# only actual intake by the participants -- othe fields such as recommended amounts are removed from this dataframedf_esrdonly_actual_only = df_esrdonly [    
    [  
        #'vegetables_recommended_low', 
        #'vegetables_recommended_high',
       'Actual Vegetable Intake', 
        #'protein_recommended_low',
       #'protein_recommended_high', 
        'Actual Protein Intake',
       #'grains_recommended_low', 
        #'grains_recommended_high',
       'Actual Grain Intake', 
        #'dairy_recommended_low',
       #'dairy_recommended_high', 
        'Actual Dairy Intake',
       #'fruits_recommended_low', 
        #'fruits_recommended_high', 
        'Actual Fruit intakes',
       #'sugars_recommended_low', 
        #'sugars_recommended_high',
       'Actual Added Sugars Taken',
        #'Actual Taken Sugars sweets and beverages amount',
       #'oils_recommended_low', 
        #'oils_recommended_high',
       'Avg Fats oils and salad dressings taken',
       'ESRD patients: Avg. Annual Mortality rates'
    ]
    
]

df_esrdonly_mortality_only = df_esrdonly[['ESRD patients: Avg. Annual Mortality rates']]# saving data to csv files to verify with excel analysisdata_folder = 'C:/Users/Sayed Ahmed/mrp_project_implementation/phase methodology and experiments/excel-xlstat-analysis/food-groups/'
data_folder = './data-for-code/'
df_esrdonly_actual_only.to_csv(data_folder + 'actual_only_for_excel_analysis_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')

Load file with ratios of actual intake with recommended high amount

this is done to reduce the bias for age i.e. younger kids/children might be taking less and recommendation can be less at least in couple of cases

df_diff = pd.read_csv('./data-for-code/diff_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df_diff.head()

df_diff.columnsIndex(['age_from', 'age_to', 'Gender', 'vegetables_recommended_low',
       'vegetables_recommended_high', 'Actual Vegetable Intake',
       'vegetables_ratio', 'protein_recommended_low',
       'protein_recommended_high', 'Actual Protein Intake', 'protein_ratio',
       'grains_recommended_low', 'grains_recommended_high',
       'Actual Grain Intake', 'grain_ratio', 'dairy_recommended_low',
       'dairy_recommended_high', 'Actual Dairy Intake', 'dairy_ratio',
       'fruits_recommended_low', 'fruits_recommended_high',
       'Actual Fruit intakes', 'fruits_ratio', 'sugars_recommended',
       'sugars_recommended_low', 'sugars_recommended_high',
       'Actual Added Sugars Taken', 'sugars_ratio', 'oils_recommended_low',
       'oils_recommended_high', 'Avg Fats oils and salad dressings taken',
       'oils_ratio', 'ESRD patients: Avg. Annual Mortality rates',
       'Dialysis patients: Avg. Annual Mortality rates',
       'ESRD patients: Total (or %) deaths for target year',
       'Dialysis patients: Total (or %) deaths for target year'],
      dtype='object')

only fields that provide the ratios

df_diff_ratio_only = df_diff [
          
        [
            #'age_from', 
            #'age_to', 
            #'Gender', 
            #'From: Recommended Vegetable Intake',
            #'vegetables_recommended_low', 
            #'To: Recommended Vegetable Intake',
            #'vegetables_recommended_high', 
            #'Actual Vegetable Intake',
            'vegetables_ratio', 
            #'From: Recommended Protein Intake',
            #'protein_recommended_low', 'To: Recommended Protein Intake',
            #'protein_recommended_high', 'Actual Protein Intake', 
            'protein_ratio',
            #'From: Recommended Grain Intake', 'grains_recommended_low',
            #'To: Recommended Grain Intake', 'grains_recommended_high',
            #'Actual Grain Intake', 
            'grain_ratio', 
            #'From: Recommended Dairy Intake',
            #'dairy_recommended_low', 'To: Recommended Dairy Intake',
            #'dairy_recommended_high', 'Actual Dairy Intake', 
            'dairy_ratio',
            #'From: Recommended Fruit category intakes', 'fruits_recommended_low',
            #'To: Recommended Fruit category intakes', 'fruits_recommended_high',
            #'Actual Fruit intakes', 
            'fruits_ratio', 
            #'sugars_recommended_low',
            #'sugars_recommended_high', 'Actual Added Sugars Taken', 
            'sugars_ratio',
            #'Actual Taken Sugars sweets and beverages amount',
            #'oils_recommended_low', 'oils_recommended_high',
            #'Avg Fats oils and salad dressings taken', 
            'oils_ratio',
            #'% Population got CKD', 'People (or %) progressed to Stage 3 CKD',
            #'People (or %) progressed to Stage 4 CKD',
            #'People (or %) progressed to ESRD',
            #'Received dietitian care (Optional)',
            #'Did not Receive dietitian care (Optional)',
            #'Patients went for dialysis',
            #'Patients went for Kidney Transplantation',
            'ESRD patients: Avg. Annual Mortality rates', #<------------------
            #'Dialysis patients: Avg. Annual Mortality rates',
            #'ESRD patients: Total (or %) deaths for target year',
            #'Dialysis patients: Total (or %) deaths for target year'        
    ]
    
]

df_diff_ratio_mortality = df_diff[['ESRD patients: Avg. Annual Mortality rates']]
df_diff_ratio_only.head()

df_diff_ratio_only.to_csv(data_folder + 'ratio_only_diff_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')

Want to focus on Ratio only i.e. ratio of taken against recommended_high

Actual amount might not reflect because age can bias the outcome

####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_diff_ratio_only)
standardisedX = pd.DataFrame(standardisedX, index=df_diff_ratio_only.index, columns=df_diff_ratio_only.columns)
standardisedX.apply(np.mean), '---\n', standardisedX.apply(np.std)(vegetables_ratio                             -5.628214e-16
 protein_ratio                                 8.635068e-17
 grain_ratio                                  -2.713879e-16
 dairy_ratio                                   4.826386e-16
 fruits_ratio                                 -6.599659e-16
 sugars_ratio                                  6.476301e-16
 oils_ratio                                   -1.060880e-15
 ESRD patients: Avg. Annual Mortality rates    1.480297e-16
 dtype: float64, '---\n', vegetables_ratio                              1.0
 protein_ratio                                 1.0
 grain_ratio                                   1.0
 dairy_ratio                                   1.0
 fruits_ratio                                  1.0
 sugars_ratio                                  1.0
 oils_ratio                                    1.0
 ESRD patients: Avg. Annual Mortality rates    1.0
 dtype: float64)

Target variable data : standard

dividing into high and low mortality

y = abs(standardisedX['ESRD patients: Avg. Annual Mortality rates']) > 0.5 
standardisedX_with_target = standardisedX
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates'], axis=1)from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)####---

#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = list(np.std(pca.transform(standardised_data), axis=0))
    b = list(pca.explained_variance_ratio_)
    c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
    if out:
        print("Importance of components:")
        display(summary)
    return summary####----
summary = pca_summary(pca, standardisedX)
####----Importance of components:

# First three component can define over 91%# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
np.sum( summary.sdev ** 2 )
plt.rcParams['figure.figsize'] = 8, 8

Important Components

def screeplot(pca, standardised_values):
    y = np.std(pca.transform(standardised_values), axis=0)**2
    x = np.arange(len(y)) + 1
    plt.plot(x, y, "o-")
    plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
    plt.ylabel("Variance")
    plt.title('PCA variance in Components')
    plt.savefig('./images/pca_components_variance' + '.png')            
    plt.show()

screeplot(pca, standardisedX)

comp 2 to comp 3 or 3 to 4 has the most slope change — comp 4 starts the flat line

first two or three at best first 4 can be retained

Will retain until 3 components

#summary.sdev**2
#pca.components_[0]
#np.sum(pca.components_[0]**2)####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function with some adjustments
def calcpc(variables, loadings):
    # find the number of samples in the data set and the number of variables
    numsamples, numvariables = variables.shape
    # make a vector to store the component
    pc = np.zeros(numsamples)
    # calculate the value of the component for each sample
    for i in range(numsamples):
        valuei = 0
        for j in range(numvariables):
            valueij = variables.iloc[i, j]
            loadingj = loadings[j]
            valuei = valuei + (valueij * loadingj)
        pc[i] = valuei
    return pc####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
#highest loadings for0.9999999999999991# Define high and low mortality
# The code below is not used.
# y as defined earlier will rather be used

#import sklearn 
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_diff_ratio_mortality)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_diff_ratio_mortality.index, columns=df_diff_ratio_mortality.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
    foo = pca.transform(standardised_values)
    bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
    plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')            
    sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Class separations True = High Mortality')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')            
standardisedX

<Figure size 576x576 with 0 Axes>

Plot first two components

Next section (i.e. four components) is more important than this

####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21


pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')            
plt.show() # 
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 4

pca_components_cont = pca.components_[0:components_to_count]
ylabels = []

for c in range (components_to_count):
    ylabels.append('Component ' + str(c) )

xlabels = []    
for c in range (components_to_count):
    xlabels.append(c) 
    
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()

plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#

First three components explain the data by 91%

Contributors for First Component: Sugar, Oil, Protein and then Vegetables negatively (-0.6 approx). Dairy somewhat (0.5) positively, and Fruits (.3) minimally positively

Contributors for 2nd Component: grain strongly and negatively. Oil and protein minimally and negatively. Vegetables strongly and positively. Sugars minimally positively

Contributors for 3rd component: Fruit positively where sugars minimally positive.

Contributors for 4th component: Oil strongly and negatively

Considering all Components: Grain and Oil contribute the most negatively then protein and sugar

Fruits contributes the most positively then vegetable and dairy.

—

Can do regression with all food groups

#df_diff_ratio_with_mortality = df_diff_ratio_only
#df_diff_ratio_with_mortality['mortality_rates'] = df_diff_ratio_mortality['ESRD patients: Avg. Annual Mortality rates']#df_diff_ratio_with_mortality.corr()
standardisedX_with_target.corr()

pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
#pd.set_option('display.height', 1000)plt.figure(figsize=(8, 8))

corr = standardisedX_with_target.corr() #df_diff_ratio_with_mortality.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True,
           cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
plt.savefig('./images/heatmap-ckd-mortality-regression-after-pca.png')

just with actual intake data — this can be biased by ages — still for a comparison

df_esrdonly_mortality_only.columnsIndex(['ESRD patients: Avg. Annual Mortality rates'], dtype='object')# PCA
# Applying PCA on Difference data

####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html

import sklearn 
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_esrdonly_actual_only)
standardisedX = pd.DataFrame(standardisedX, index=df_esrdonly_actual_only.index, columns=df_esrdonly_actual_only.columns)
standardisedX.apply(np.mean)Actual Vegetable Intake                      -3.407768e-16
Actual Protein Intake                         7.771561e-16
Actual Grain Intake                          -4.440892e-16
Actual Dairy Intake                           9.637353e-17
Actual Fruit intakes                         -3.380012e-15
Actual Added Sugars Taken                    -1.634495e-16
Avg Fats oils and salad dressings taken       8.018277e-16
ESRD patients: Avg. Annual Mortality rates    1.480297e-16
dtype: float64standardisedX_with_target = standardisedX
standardisedX_with_target

standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates'], axis=1)

####----

standardisedX.apply(np.std)

####---

from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pca

####---

#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = list(np.std(pca.transform(standardised_data), axis=0))
    b = list(pca.explained_variance_ratio_)
    c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
    if out:
        print("Importance of components:")
        display(summary)
    return summary

####----

summary = pca_summary(pca, standardisedX)Importance of components:

####----

# First two component can define over 95% 
np.sum(summary.sdev**2)

# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
plt.rcParams['figure.figsize'] = 8, 8   

def screeplot(pca, standardised_values):
    y = np.std(pca.transform(standardised_values), axis=0)**2
    x = np.arange(len(y)) + 1
    plt.plot(x, y, "o-")
    plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
    plt.ylabel("Variance")
    plt.title('PCA variance in Components')
    plt.savefig('./images/pca_components_variance' + '.png')            
    plt.show()



screeplot(pca, standardisedX)
plt.savefig('./images/pca_components_variance' + '.png')

<Figure size 576x576 with 0 Axes>####---

# comp 2 to comp 3 is the most change - slope
# first two or at best first 3 can be retained

summary.sdev**2

#Can be retained
#PC1 4.263124 PC2 1.449435

pca.components_[0]
np.sum(pca.components_[0]**2)

####----

# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
    # find the number of samples in the data set and the number of variables
    numsamples, numvariables = variables.shape
    # make a vector to store the component
    pc = np.zeros(numsamples)
    # calculate the value of the component for each sample
    for i in range(numsamples):
        valuei = 0
        for j in range(numvariables):
            valueij = variables.iloc[i, j]
            loadingj = loadings[j]
            valuei = valuei + (valueij * loadingj)
        pc[i] = valuei
    return pc

####----

calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)1.0000000000000007#highest loadings for

####----

# for following code : Classes will/might be defined such as like High, low, neutral mortality in final work

#df_esrdonly_mortality_only['ESRD patients: Avg. Annual Mortality rates']
#sorted(standardisedX['ESRD patients: Avg. Annual Mortality rates'])

# Define high and low mortality


#y = df_esrdonly_mortality[' ESRD patients: Avg. Annual Mortality rates'] #< 0.5
# from normalized data. > 0.5 = high mortality
#print('as the older age show higher mortality, does it mean th')
#y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
#y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
#y


#import sklearn 
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_esrdonly_mortality_only)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_esrdonly_mortality_only.index,  columns=df_esrdonly_mortality_only.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality

y = standardisedX_with_target['ESRD patients: Avg. Annual Mortality rates']  > 0.5 
y0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
15     True
16     True
17     True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool####----

# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
    foo = pca.transform(standardised_values)
    bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
    plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')            
    sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

#y = df_esrdonly[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Only two components can separate the mortality data. True = High ')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')<Figure size 576x576 with 0 Axes>

####----

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')            
plt.show() # 
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 4

pca_components_cont = pca.components_[0:components_to_count]
ylabels = []

for c in range (components_to_count):
    ylabels.append('Component ' + str(c) )

xlabels = []    
for c in range (components_to_count):
    xlabels.append(c) 
    
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()

plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#

plt.figure(figsize=(16, 16))

corr = standardisedX_with_target.corr() #df_esrdonly_actual_only.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True,
           cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');

Univariate and Bivariate sections are kept for historical reasons. The input data file is changed;

UNIVARIATE

Check and plot each column. Can show differences between intake and recommended amounts

Data are not missing here. These are processed data. Data adjustments were made for age or age group missing as well as adjusted for food groups.

Fats and Sugars do not have recommended amounts in the data. Subject to be updated in future work

Bar Plot and Box Plots can show the relation clearly. Used other plots such as line, area, hist

matplotlib.style.use('ggplot')
plt.figure(figsize=(16, 5));

"""
plt.rcParams['image.cmap'] = 'jet'
#plt.rcParams['figure.figsize'] = 14, 14
df[' Actual Vegetable Intake'].plot.bar();
#plt.setxlabels(df['age_to'])
plt.suptitle('Vegetable Intake')
plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From\n Colors do not mean anything')
plt.ylabel('Intake amount in gms')
"""
# might change color palette
# https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html"\nplt.rcParams['image.cmap'] = 'jet'\n#plt.rcParams['figure.figsize'] = 14, 14\ndf[' Actual Vegetable Intake'].plot.bar();\n#plt.setxlabels(df['age_to'])\nplt.suptitle('Vegetable Intake')\nplt.xticks(range(len(df['age_to'])), df['age_from']);\nplt.xlabel('Age group: From\n Colors do not mean anything')\nplt.ylabel('Intake amount in gms')\n"




<Figure size 1152x360 with 0 Axes>df = pd.read_csv(data_folder + 'older_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
# plt.figure(figsize=(16, 5));
plt.rcParams['figure.figsize'] = 14, 5
df1 = df.iloc[:,3:6]

# df1['ESRD patients: Total (or %) deaths for target year'] = df[' ESRD patients: Total (or %) deaths for target year']
# df1['ESRD patients: Avg. Annual Mortality rates'] = df[' ESRD patients: Avg. Annual Mortality rates']
        
df2 = df.iloc[:,6:9]
df3 = df.iloc[:,9:12]
df4 = df.iloc[:,12:15]
df5 = df.iloc[:,15:18]
df6 = df.iloc[:,18:21]
df7 = df.iloc[:,21:24]
df8 = df.iloc[:,30:32]
df9 = df.iloc[:,32:34]


df1.plot.bar();
df2.plot.bar();
df3.plot.bar();
df4.plot.bar();
df5.plot.bar();
df6.plot.bar();
df7.plot.bar();
df8.plot.bar();
df9.plot.bar();

# plt.xlabel('Age Groups (from)')
# plt.ylabel('Amount in Gms')

plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From\n Colors do not mean anything, just the different column')
plt.ylabel('Intake amount in gms')Text(0,0.5,'Intake amount in gms')

df1.plot.line();
df2.plot.line();
df3.plot.line();
df4.plot.line();
df5.plot.line();
df6.plot.line();
df7.plot.line();
df8.plot.line();
df9.plot.line();

# plt.xlabel('Age Groups (from)')
# plt.ylabel('Amount in Gms')

plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From')
plt.ylabel('Intake amount in gms')Text(0,0.5,'Intake amount in gms')

df1.plot.area();
df2.plot.area();
df3.plot.area();
df4.plot.area();
df5.plot.area();
df6.plot.area();
df7.plot.area();
df8.plot.area();
df9.plot.area();


plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From')
plt.ylabel('Intake amount in gms')Text(0,0.5,'Intake amount in gms')

df1.plot.hist();
df2.plot.hist();
df3.plot.hist();
df4.plot.hist();
df5.plot.hist();
df6.plot.hist();
df7.plot.hist();
df8.plot.hist();
df9.plot.hist();

#plt.xlabel('Age Groups (from)')
#plt.ylabel('Amount in Gms')
#plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Intake/Recommended data distribution')Text(0.5,0,'Intake/Recommended data distribution')

df1.plot.box();
df2.plot.box();
df3.plot.box();
df4.plot.box();
df5.plot.box();
df6.plot.box();
plt.xticks(range(1, len(df6.columns)+1),df6.columns,rotation=90)

df77 = df.iloc[:,21:22]
df88 = df.iloc[:,30:34]
df99 = df.iloc[:,34:36]

df77.plot.box();


df88.plot.box();
#df99.plot.box();
plt.xticks(range(1, len(df88.columns)+1),df88.columns,rotation=90)

#plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')

#plt.xticks(range(len(df['age_to'])), df['age_from']);Text(0,0.5,'Amount in Gms')

df.corr()

import numpy as np
import pandas as pd 
from subprocess import check_output

from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

import seaborn as sns
%matplotlib inlinedf = pd.read_csv('./data-for-code/older_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df = df.drop( ['Recommended Sugars sweets and beverages amount', ' % Population got CKD', ' People (or %) progressed to Stage 3 CKD', ' People (or %) progressed to Stage 4 CKD', ' People (or %) progressed to ESRD', ' Received dietitian care (Optional)', ' Did not Receive dietitian care (Optional)', ' Patients went for dialysis', ' Patients went for Kidney Transplantation' ], axis=1)
df.head()

df.corr()

plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);
plt.suptitle('Heatmap, Correlation All Variables');

Will Check on Actual Taken Amount Only

age_from and age_to can be removed

df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
# df = df.drop([ 'age_from', 'age_to' ], axis=1)
df.head()

df.describe()

#find correlation with price after standardization
df.corr()

ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.828148 1.000000 0.847294 0.995676 Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.999315 0.847294 1.000000 0.880927 Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.864052 0.995676 0.880927 1.000000

plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');

# Without Age Groupsdf_without_ages = df.drop(['age_from', 'age_to', ' Gender'], axis=1)
df_without_ages.head()

df_without_ages.corr()

plt.figure(figsize=(12, 8))

corr = df_without_ages.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
plt.savefig('./images/univariate_food_groups_heatmaps_actual_intake_amount' + '.png')

Will use Normalization all data

df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df = df.drop([' Gender'], axis=1)
df.head()

# normalize data
df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
# df.head()

df_normalized.corr()

age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year 0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 1.000000 0.828148 0.999315 0.864052 ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.828148 1.000000 0.847294 0.995676 Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.999315 0.847294 1.000000 0.880927 Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.864052 0.995676 0.880927 1.000000

plt.figure(figsize=(12, 8))

corr = df_normalized.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
plt.savefig('./images/univariate_food_groups_heatmaps_normalized_actual_intake_amount' + '.png')

Will check only on the difference from average recommended amount

df = pd.read_csv('./data-for-code/copy-only-diff-no-empty-data-diff-recomm-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing .csv')
df.head()

df.describe()

# find correlation with price after standardization
df.corr()

age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken Diff Vegetable diff protein diff grain diff dairy diff fruit ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year

0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 0.426499 -0.273596 -0.268453 -0.364665 -0.488155 1.000000 0.828148 0.999315 0.864052

ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.359590 -0.266776 -0.263227 -0.213864 -0.392090 0.828148 1.000000 0.847294 0.995676

Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.424340 -0.278817 -0.272938 -0.362140 -0.489873 0.999315 0.847294 1.000000 0.880927

Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.401294 -0.244088 -0.246893 -0.229239 -0.387759 0.864052 0.995676 0.880927 1.000000

plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Variables, and Difference from Recommended Intake Variables');

df.head()
df = df.drop([' Gender'], axis=1)
df.head()

# normalize data

df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
#df.head()

df_normalized.corr()

plt.figure(figsize=(12, 8))

corr = df_normalized.corr()
sns.heatmap(corr, 
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values,
            annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Variables, and Difference from Recommended Intake Variables');
plt.savefig('./images/univariate_food_groups_heatmaps_diff_intake_amount_normalized' + '.png')

Bivariate

The most important for Bivariate: bivariate_diff_norm.png and bivariate_diff_norm_rate_only.png. Difference in intake amounts from recommended, also normalized.

Bivariate plots on actual amount intake and target variables. will be saved in bivariate.png. The correlation pattern can be checked in the image saved

Correlation within the intake amounts do not say much as we are using actual intake amounts not differences with recomended amount. Also because, higher aged people will take higher amount. Also, normalized data will give true relation.

The correlation was shown above. Now linearity and non-linearity can be seen using the Bivariate plots.

The plots show similarity with correlation

import numpy as np 
import pandas as pd 

from IPython.display import display, HTML

from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

import seaborn as sns
%matplotlib inline

# ref in ref section

On Actual intake Amounts

df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_actual = df.drop([' Gender', 'age_from', 'age_to'], axis=1)
df_actual.head()

# on actual amounts
#plt.figure(figsize=(16, 300))
sns.pairplot(df_actual,   vars=df_actual.columns, size=5, kind='reg'); # diag_kind='kde',
plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
plt.savefig('./images/bivariate_food_group' + '.png')            
plt.show()

On difference from recommended amount (data not normalized)

df = pd.read_csv('./data-for-code/copy-only-diff-no-actual-no-empty-cell-diff-recomm-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing .csv')
df.head()

# On difference from recommended amount (data not normalized)

#df_actual_diff = df.drop(['age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual_diff.head()
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_actual_diff = df.drop(['age_from', 'age_to'], axis=1)
df_actual_diff.head()
#df.head()

# On difference from recommended amount (data not normalized)

sns.pairplot(df_actual_diff,   vars=df_actual_diff.columns, size=5, kind='reg'); 
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
#plt.savefig('./saved_images_from_visualizations/' + 'line_' +indicator.replace(' ', '_')[0:10] + '_' + str(np.random.randint(0, 99999)) + '.png')            
plt.savefig('./images/bivariate_diff' + '.png')            
plt.show()

On difference from recommended amount (data/diff normalized)

df_normalized_diff = (df_actual_diff - df_actual_diff.mean())/ (df_actual_diff.max() - df_actual_diff.min())
df_normalized_diff.head()

sns.pairplot(df_normalized_diff,   vars=df_normalized_diff.columns, size=5, kind='reg', aspect=1); # diag_kind='kde',
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
plt.savefig('./images/bivariate_diff_norm' + '.png')            
plt.show()

df_normalized_diff = df_normalized_diff.drop([' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates'], axis=1)
df_normalized_diff.head()

sns.pairplot(df_normalized_diff,   vars=df_normalized_diff.columns, size=5, kind='reg', aspect=1); # diag_kind='kde',
plt.suptitle('Bivariate : Diff : Food Group: Normalized\n')
plt.ylabel('Difference in Intake amount from Recommended : Normalized')
plt.xlabel('Intakes')
plt.savefig('./images/bivariate_diff_norm_rate_only' + '.png')            
plt.show()

# can be ignored
"""
df_esrdonly = df.drop(['age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_esrdonly.head()

plt.figure(figsize=(14, 14))
sns.pairplot(df_esrdonly, diag_kind='kde');
plt.xlabel('Bivariate Plot, Difference from Recommended Variables, Total ESRD target variable');

#plt.figure(figsize=(16, 16))
plt.rcParams['figure.figsize'] = 16, 16
pd.plotting.scatter_matrix(df_esrdonly, diagonal="kde")
plt.suptitle('Scatter Matrix');
plt.show();
"""'\ndf_esrdonly = df.drop([\'age_from\', \'age_to\',\' ESRD patients: Avg. Annual Mortality rates\', \' Dialysis patients: Total (or %) deaths for target year\', \' Dialysis patients: Avg. Annual Mortality rates\' ], axis=1)\ndf_esrdonly.head()\n\nplt.figure(figsize=(14, 14))\nsns.pairplot(df_esrdonly, diag_kind=\'kde\');\nplt.xlabel(\'Bivariate Plot, Difference from Recommended Variables, Total ESRD target variable\');\n\n#plt.figure(figsize=(16, 16))\nplt.rcParams[\'figure.figsize\'] = 16, 16\npd.plotting.scatter_matrix(df_esrdonly, diagonal="kde")\nplt.suptitle(\'Scatter Matrix\');\nplt.show();\n'

PCA

Applying PCA on Difference data

df_esrdonly = df.drop(['age_from', 'age_to',' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_esrdonly.head()
#df_esrdonly.head()

# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html

import sklearn 
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_esrdonly)
standardisedX = pd.DataFrame(standardisedX, index=df_esrdonly.index, columns=df_esrdonly.columns)
standardisedX.apply(np.mean)Diff Vegetable                                -4.332954e-16
diff protein                                  -8.018277e-16
diff grain                                     5.366078e-16
diff dairy                                    -3.885781e-16
diff fruit                                    -9.251859e-17
 ESRD patients: Avg. Annual Mortality rates    1.480297e-16
dtype: float64standardisedX.apply(np.std)Diff Vegetable                                 1.0
diff protein                                   1.0
diff grain                                     1.0
diff dairy                                     1.0
diff fruit                                     1.0
 ESRD patients: Avg. Annual Mortality rates    1.0
dtype: float64from sklearn import decomposition
pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = list(np.std(pca.transform(standardised_data), axis=0))
    b = list(pca.explained_variance_ratio_)
    c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
    if out:
        print("Importance of components:")
        display(summary)
    return summarysummary = pca_summary(pca, standardisedX)Importance of components:

# First two component can define over 95%np.sum(summary.sdev**2)Standard deviation    6.0
dtype: float64# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
plt.rcParams['figure.figsize'] = 8, 8   

def screeplot(pca, standardised_values):
    y = np.std(pca.transform(standardised_values), axis=0)**2
    x = np.arange(len(y)) + 1
    plt.plot(x, y, "o-")
    plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
    plt.ylabel("Variance")
    plt.title('PCA variance in Components')
    plt.savefig('./images/pca_components_variance' + '.png')            
    plt.show()



screeplot(pca, standardisedX)

# comp 2 to comp 3 is the most change - slope
# first two or at best first 3 can be retainedsummary.sdev**2

Can be retained

PC1 4.263124 PC2 1.449435

pca.components_[0]array([ 0.33558246,  0.47787327,  0.4773956 ,  0.45412539,  0.45708252,
       -0.12633561])np.sum(pca.components_[0]**2)0.999999999999999# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
    # find the number of samples in the data set and the number of variables
    numsamples, numvariables = variables.shape
    # make a vector to store the component
    pc = np.zeros(numsamples)
    # calculate the value of the component for each sample
    for i in range(numsamples):
        valuei = 0
        for j in range(numvariables):
            valueij = variables.iloc[i, j]
            loadingj = loadings[j]
            valuei = valuei + (valueij * loadingj)
        pc[i] = valuei
    return pccalcpc(standardisedX, pca.components_[0])array([ 7.33661891,  3.2759593 ,  0.87786527, -0.76096561, -1.27436192,
       -1.11656647, -1.05210251, -0.55640492, -0.52219388, -0.70806478,
       -1.0803665 , -0.08831969, -0.22064849, -0.32119398, -0.82738023,
       -0.54845766, -1.0429077 , -1.37050914])pca.transform(standardisedX)[:, 0]array([ 7.33661891,  3.2759593 ,  0.87786527, -0.76096561, -1.27436192,
       -1.11656647, -1.05210251, -0.55640492, -0.52219388, -0.70806478,
       -1.0803665 , -0.08831969, -0.22064849, -0.32119398, -0.82738023,
       -0.54845766, -1.0429077 , -1.37050914])pca.components_[1]array([ 0.56910521,  0.00870248,  0.00721787, -0.03193287, -0.18125629,
        0.80132258])np.sum(pca.components_[1]**2)0.9999999999999991

highest loadings for

# for following code : Classes will/might be defined such as like High, low, neutral mortality in final workdf_esrdonly[' ESRD patients: Avg. Annual Mortality rates']
sorted(standardisedX[' ESRD patients: Avg. Annual Mortality rates'])[-0.8508567828409127,
 -0.8447169016030269,
 -0.8078776141757118,
 -0.7541536533442106,
 -0.6983830654334141,
 -0.6114014145633645,
 -0.5351645558596153,
 -0.478882311178995,
 -0.45329947268780396,
 -0.44050805344220845,
 -0.29059261988382884,
 -0.13095570769879675,
 0.08752173301597484,
 0.32902372837281824,
 0.681043586011607,
 1.1635359199554702,
 1.6915657064136536,
 2.9441014789423674]# Define high and low mortalityy = df_esrdonly[' ESRD patients: Avg. Annual Mortality rates'] #< 0.5
# from normalized data. > 0.5 = high mortality
print('as the older age show higher mortality, does it mean th')
y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
yas the older age show higher mortality, does it mean th





0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
15     True
16     True
17     True
Name:  ESRD patients: Avg. Annual Mortality rates, dtype: bool# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
    foo = pca.transform(standardised_values)
    bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
    plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')            
    sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

#y = df_esrdonly[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Only two components can separate the mortality data. True = High ')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')<Figure size 576x576 with 0 Axes>

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(df_esrdonly.columns)),df_esrdonly.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')            
plt.show()#

# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component

How the intake differences contributed for the affecting PCA components

diffs = list(df_esrdonly.columns[:-1]) 
import seaborn as sns
s = sns.heatmap(df_esrdonly[diffs].corr(),cmap='coolwarm') 
s.set_yticklabels(s.get_yticklabels(),rotation=30,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=30,fontsize=7)
plt.savefig('./images/pca_food_groups_how_in_together_influencing_PCA_components' + '.png')            
plt.show()

Vegetable by itself is important — contributing factors

protein + grain + then fruit

grain with fruit — slighly with dairy

dairy with protein and grain*

fruit with protein and grain then dairy

References

https://www.kaggle.com/etakla/exploring-the-dataset-univariate-analysis https://www.kaggle.com/etakla/exploring-the-dataset-bivariate-analysis https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e https://lifelines.readthedocs.io/en/latest/ https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html https://www.statsdirect.com/help/survival_analysis/cox_regression.htm https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/ https://www.kaggle.com/residentmario/univariate-plotting-with-pandas

Linear Regression Model On (NBA) Dataset

Explore and run machine learning code with Kaggle Notebooks | Using data from National Basketball Association(NBA)…

www.kaggle.com

Practical Statistics for Data Scientists

Chapter 4. Regression and Prediction Perhaps the most common goal in statistics is to answer the question: Is the…

www.oreilly.com

Pearson Correlation and Linear Regression

A correlation or simple linear regression analysis can determine if two numeric variables are significantly linearly…

sites.utexas.edu

What is the difference between correlation and linear regression?

When investigating the relationship between two or more numeric variables, it is important to know the difference…

www.graphpad.com

Simple Linear Regression and Correlation

Menu location: Analysis_Regression and Correlation_Simple Linear and Correlation. This function provides simple linear…

www.statsdirect.com

https://erc.barnard.edu/spss/pearsons_r

https://www.statpac.com/statistics-calculator/correlation-regression.htm

Correlation (Pearson, Kendall, Spearman) - Statistics Solutions

Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the…

www.statisticssolutions.com

11. Correlation and regression

The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a…

www.bmj.com

Conduct and Interpret a Bivariate (Pearson) Correlation - Statistics Solutions

Correlation is a widely used term in statistics. In fact, it entered the English language in 1561, 200 years before…

www.statisticssolutions.com

Python Statistics - Python p-Value, Correlation, T-test, KS Test - DataFlair

In this Python Statistics tutorial, we will learn how to calculate the p-value and Correlation in Python. Moreover, we…

data-flair.training

Bivariate plotting with pandas

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

Exploratory Data Analysis(beginner) , Univariate , Bivariate and Multivariate — Habberman dataset.

Edit description

purnasaigudikandula.medium.com

A Comprehensive Guide to Data Exploration

Overview A complete tutorial on data exploration (EDA) We cover several data exploration aspects, including missing…

www.analyticsvidhya.com

A Little Book of Python for Multivariate Analysis - A Little Book of Python for Multivariate…

Although there are a number of ways of getting Python to your system, for a hassle free install and quick start using…

python-for-multivariate-analysis.readthedocs.io

Univariate plotting with pandas

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

#ref https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html #https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

Note to help with re-generating the data

To include oil and sugar recommendation amount, the sql as below was used.

also for other food groups

#/****** Script for SelectTopNRows command from SSMS ******/

Stored Procedure: [find_recommended_amounts_for_food_groups]

SELECT [age] ,[food_group] ,[average_taken] ,[recommended_low] ,[recommended_high] FROM [mrp_experiment_phase_data_process].[dbo].[age_food_group_recommended_amount] where age in ( select ‘1’ union select age_groups.[from] from age_groups) and food_group in (‘Fats, oils, and salad dressings’, ‘Sugars, sweets, and beverages’) order by food_group

The python file: for_each_age_recommended_amount_take_from_related_group.ipynb helped to generate the recommended amounts based on health.gov data Where for_each_age_recommended_amount.ipynb was the earlier versions : there are some changes in the approch

The study as below also the recommendations from health.gov was used for the recommended amount calculations https://health.gov/dietaryguidelines/2015/guidelines/chapter-2/a-closer-look-at-current-intakes-and-recommended-shifts/#figure-2-7-desc-toggle

For sugar recommended amounts: multiple sources were used. The article above and health.gov recommendations were used. health.gov recommends 10% of calory from sugar using the recommended calory intake per day Other articles were also. check the ipynb files and notes there

Folders such as: input_csvdietfiles, output_agescsvdietfiles are related to recommendation amount calculations from health.gov or shift_recommendation article data

Find relation between foodgroups and ckd mortality

The code will use PCA to find the important food groups affecting mortality

Will use heatmap to show the correlations between mortality and food groups

The code also shows univariate and bi-variate analysis of the food group data

The final outcome is correlation of Mortality and Food Groups

Core Input data File: Food Group

PCA

For recommended amounts: fields with format: vegetables_recommended_high and vegetables_recommended_low

Load file with ratios of actual intake with recommended high amount

this is done to reduce the bias for age i.e. younger kids/children might be taking less and recommendation can be less at least in couple of cases

only fields that provide the ratios

Want to focus on Ratio only i.e. ratio of taken against recommended_high

Actual amount might not reflect because age can bias the outcome

Target variable data : standard

Important Components

comp 2 to comp 3 or 3 to 4 has the most slope change — comp 4 starts the flat line

first two or three at best first 4 can be retained

Will retain until 3 components

Plot first two components

just with actual intake data — this can be biased by ages — still for a comparison

Univariate and Bivariate sections are kept for historical reasons. The input data file is changed;

UNIVARIATE

Will Check on Actual Taken Amount Only

age_from and age_to can be removed

Will use Normalization all data

Will check only on the difference from average recommended amount

Bivariate

The most important for Bivariate: bivariate_diff_norm.png and bivariate_diff_norm_rate_only.png. Difference in intake amounts from recommended, also normalized.

On Actual intake Amounts

On difference from recommended amount (data not normalized)

On difference from recommended amount (data/diff normalized)

PCA

How the intake differences contributed for the affecting PCA components

References

Linear Regression Model On (NBA) Dataset

Explore and run machine learning code with Kaggle Notebooks | Using data from National Basketball Association(NBA)…

Practical Statistics for Data Scientists

Chapter 4. Regression and Prediction Perhaps the most common goal in statistics is to answer the question: Is the…

Pearson Correlation and Linear Regression

A correlation or simple linear regression analysis can determine if two numeric variables are significantly linearly…

What is the difference between correlation and linear regression?

When investigating the relationship between two or more numeric variables, it is important to know the difference…

Simple Linear Regression and Correlation

Menu location: Analysis_Regression and Correlation_Simple Linear and Correlation. This function provides simple linear…

Correlation (Pearson, Kendall, Spearman) - Statistics Solutions

Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the…

11. Correlation and regression

The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a…

Conduct and Interpret a Bivariate (Pearson) Correlation - Statistics Solutions

Correlation is a widely used term in statistics. In fact, it entered the English language in 1561, 200 years before…

Python Statistics - Python p-Value, Correlation, T-test, KS Test - DataFlair

In this Python Statistics tutorial, we will learn how to calculate the p-value and Correlation in Python. Moreover, we…

Bivariate plotting with pandas

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

Exploratory Data Analysis(beginner) , Univariate , Bivariate and Multivariate — Habberman dataset.

Edit description

A Comprehensive Guide to Data Exploration

Overview A complete tutorial on data exploration (EDA) We cover several data exploration aspects, including missing…

A Little Book of Python for Multivariate Analysis - A Little Book of Python for Multivariate…

Although there are a number of ways of getting Python to your system, for a hassle free install and quick start using…

Univariate plotting with pandas

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

Note to help with re-generating the data

To include oil and sugar recommendation amount, the sql as below was used.

also for other food groups

Written by Justetc Social Services (non-profit)