Find relation between foodgroups and ckd mortality

--

The code shows the relation between food groups and CKD mortality Food Group concept is used as are used by CDC/USDA/USRDS

Dietray Guidelines are reviewed every couple of years and published https://health.gov/our-work/food-nutrition/previous-dietary-guidelines/2015

A background database was there where some data pre-processing and adjustments were made

Food Groups Survey Data and Mortality Data were merged together to create the dataset as used in this project Mortality Data is provided by USRDS

Food intake surveys are conducted by USDA and also published periodically https://www.nal.usda.gov/fnic/food-and-nutrition-surveys

The code will use PCA to find the important food groups affecting mortality

Will use heatmap to show the correlations between mortality and food groups

The code also shows univariate and bi-variate analysis of the food group data

The final outcome is correlation of Mortality and Food Groups

Related : database table: age_food_group_recommended_amount_adjusted
stored procedures: adjusted_gms_age_food_group_recommended_amount multi_day_get_food_group_based_dietary_intake_by_participants

Cup http://dish.allrecipes.com/cup-to-gram-conversions/ More Cup and grams conversion references are provided in the stored procedures as well

# notes on regenerating some of the core data can be found near the end. Also, reports can be checked for other details# PCA section, Correlation, Heatmaps,  plots can be seen to be the most important
# Might have to adjust/improve my conclusions from the plots in future work
# Target variable ESRD patients: Avg. Annual Mortality rates is given the most importance

#Potential Others:
# One another target variable is: ESRD patients: Total (or %) deaths for target year
# Other target variables: Dialysis patients: Total (or %) deaths for target year
# For plotting, import libraries

from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Core Input data File: Food Group

mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv

Intake data and recommended data

Recommended data:

Removed (no longer relevant): The ones with format: From: Recommended Vegetable Intake and To: Recommended Vegetable Intake: health.gov recommendation data (age group based) is distributed to each age then took average using age groups as used by USRDS

Relevant: The ones with format: vegetables_recommended_high and vegetables_recommended_low: health.gov recommendation data (age group based) is mapped to ages (rather than dividing and distributing) then regrouped using USRDS age groups. To mention: health.gov and USRDS provided aggregated data using different age groups. Hence, such approaches were taken

# data exploration
import pandas as pd
df = pd.read_csv('./data-for-code/mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df.head()
png
df.describe()
png

PCA

Applying PCA on Actual Intake Amount

For recommended amounts: fields with format: vegetables_recommended_high and vegetables_recommended_low

df.columnsIndex(['age_from', 'age_to', 'Gender', 'vegetables_recommended_low',
'vegetables_recommended_high', 'Actual Vegetable Intake',
'protein_recommended_low', 'protein_recommended_high',
'Actual Protein Intake', 'grains_recommended_low',
'grains_recommended_high', 'Actual Grain Intake',
'dairy_recommended_low', 'dairy_recommended_high',
'Actual Dairy Intake', 'fruits_recommended_low',
'fruits_recommended_high', 'Actual Fruit intakes', 'sugars_recommended',
'sugars_recommended_low', 'sugars_recommended_high',
'Actual Added Sugars Taken', 'oils_recommended_low',
'oils_recommended_high', 'Avg Fats oils and salad dressings taken',
'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Avg. Annual Mortality rates',
'ESRD patients: Total (or %) deaths for target year',
'Dialysis patients: Total (or %) deaths for target year'],
dtype='object')
# Will Keep only important columnsdf_esrdonly = df [
[
#'age_from',
#'age_to',
#'Gender',
#'From: Recommended Vegetable Intake',
'vegetables_recommended_low',
#'To: Recommended Vegetable Intake',
'vegetables_recommended_high',
'Actual Vegetable Intake',
#'From: Recommended Protein Intake',
'protein_recommended_low',
#'To: Recommended Protein Intake',
'protein_recommended_high',
'Actual Protein Intake',
#'From: Recommended Grain Intake',
'grains_recommended_low',
#'To: Recommended Grain Intake',
'grains_recommended_high',
'Actual Grain Intake',
#'From: Recommended Dairy Intake',
'dairy_recommended_low',
#'To: Recommended Dairy Intake',
'dairy_recommended_high',
'Actual Dairy Intake',
#'From: Recommended Fruit category intakes',
'fruits_recommended_low',
#'To: Recommended Fruit category intakes',
'fruits_recommended_high',
'Actual Fruit intakes',


'sugars_recommended_low',
'sugars_recommended_high',
'Actual Added Sugars Taken',
#'Actual Taken Sugars sweets and beverages amount',
'oils_recommended_low',
'oils_recommended_high',
'Avg Fats oils and salad dressings taken',


#'% Population got CKD',
#'People (or %) progressed to Stage 3 CKD',
#'People (or %) progressed to Stage 4 CKD',
#'People (or %) progressed to ESRD',
#'Received dietitian care (Optional)',
#'Did not Receive dietitian care (Optional)',
#'Patients went for dialysis',
#'Patients went for Kidney Transplantation',
'ESRD patients: Avg. Annual Mortality rates',
#'Dialysis patients: Avg. Annual Mortality rates',
#'ESRD patients: Total (or %) deaths for target year',
#'Dialysis patients: Total (or %) deaths for target year'

]

]
df_esrdonly.head()
png
df_esrdonly.columnsIndex(['vegetables_recommended_low', 'vegetables_recommended_high',
'Actual Vegetable Intake', 'protein_recommended_low',
'protein_recommended_high', 'Actual Protein Intake',
'grains_recommended_low', 'grains_recommended_high',
'Actual Grain Intake', 'dairy_recommended_low',
'dairy_recommended_high', 'Actual Dairy Intake',
'fruits_recommended_low', 'fruits_recommended_high',
'Actual Fruit intakes', 'sugars_recommended_low',
'sugars_recommended_high', 'Actual Added Sugars Taken',
'oils_recommended_low', 'oils_recommended_high',
'Avg Fats oils and salad dressings taken',
'ESRD patients: Avg. Annual Mortality rates'],
dtype='object')
# only actual intake by the participants -- othe fields such as recommended amounts are removed from this dataframedf_esrdonly_actual_only = df_esrdonly [
[
#'vegetables_recommended_low',
#'vegetables_recommended_high',
'Actual Vegetable Intake',
#'protein_recommended_low',
#'protein_recommended_high',
'Actual Protein Intake',
#'grains_recommended_low',
#'grains_recommended_high',
'Actual Grain Intake',
#'dairy_recommended_low',
#'dairy_recommended_high',
'Actual Dairy Intake',
#'fruits_recommended_low',
#'fruits_recommended_high',
'Actual Fruit intakes',
#'sugars_recommended_low',
#'sugars_recommended_high',
'Actual Added Sugars Taken',
#'Actual Taken Sugars sweets and beverages amount',
#'oils_recommended_low',
#'oils_recommended_high',
'Avg Fats oils and salad dressings taken',
'ESRD patients: Avg. Annual Mortality rates'
]

]

df_esrdonly_mortality_only = df_esrdonly[['ESRD patients: Avg. Annual Mortality rates']]
# saving data to csv files to verify with excel analysisdata_folder = 'C:/Users/Sayed Ahmed/mrp_project_implementation/phase methodology and experiments/excel-xlstat-analysis/food-groups/'
data_folder = './data-for-code/'
df_esrdonly_actual_only.to_csv(data_folder + 'actual_only_for_excel_analysis_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')

Load file with ratios of actual intake with recommended high amount

this is done to reduce the bias for age i.e. younger kids/children might be taking less and recommendation can be less at least in couple of cases

df_diff = pd.read_csv('./data-for-code/diff_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df_diff.head()
png
df_diff.columnsIndex(['age_from', 'age_to', 'Gender', 'vegetables_recommended_low',
'vegetables_recommended_high', 'Actual Vegetable Intake',
'vegetables_ratio', 'protein_recommended_low',
'protein_recommended_high', 'Actual Protein Intake', 'protein_ratio',
'grains_recommended_low', 'grains_recommended_high',
'Actual Grain Intake', 'grain_ratio', 'dairy_recommended_low',
'dairy_recommended_high', 'Actual Dairy Intake', 'dairy_ratio',
'fruits_recommended_low', 'fruits_recommended_high',
'Actual Fruit intakes', 'fruits_ratio', 'sugars_recommended',
'sugars_recommended_low', 'sugars_recommended_high',
'Actual Added Sugars Taken', 'sugars_ratio', 'oils_recommended_low',
'oils_recommended_high', 'Avg Fats oils and salad dressings taken',
'oils_ratio', 'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Avg. Annual Mortality rates',
'ESRD patients: Total (or %) deaths for target year',
'Dialysis patients: Total (or %) deaths for target year'],
dtype='object')

only fields that provide the ratios

df_diff_ratio_only = df_diff [

[
#'age_from',
#'age_to',
#'Gender',
#'From: Recommended Vegetable Intake',
#'vegetables_recommended_low',
#'To: Recommended Vegetable Intake',
#'vegetables_recommended_high',
#'Actual Vegetable Intake',
'vegetables_ratio',
#'From: Recommended Protein Intake',
#'protein_recommended_low', 'To: Recommended Protein Intake',
#'protein_recommended_high', 'Actual Protein Intake',
'protein_ratio',
#'From: Recommended Grain Intake', 'grains_recommended_low',
#'To: Recommended Grain Intake', 'grains_recommended_high',
#'Actual Grain Intake',
'grain_ratio',
#'From: Recommended Dairy Intake',
#'dairy_recommended_low', 'To: Recommended Dairy Intake',
#'dairy_recommended_high', 'Actual Dairy Intake',
'dairy_ratio',
#'From: Recommended Fruit category intakes', 'fruits_recommended_low',
#'To: Recommended Fruit category intakes', 'fruits_recommended_high',
#'Actual Fruit intakes',
'fruits_ratio',
#'sugars_recommended_low',
#'sugars_recommended_high', 'Actual Added Sugars Taken',
'sugars_ratio',
#'Actual Taken Sugars sweets and beverages amount',
#'oils_recommended_low', 'oils_recommended_high',
#'Avg Fats oils and salad dressings taken',
'oils_ratio',
#'% Population got CKD', 'People (or %) progressed to Stage 3 CKD',
#'People (or %) progressed to Stage 4 CKD',
#'People (or %) progressed to ESRD',
#'Received dietitian care (Optional)',
#'Did not Receive dietitian care (Optional)',
#'Patients went for dialysis',
#'Patients went for Kidney Transplantation',
'ESRD patients: Avg. Annual Mortality rates', #<------------------
#'Dialysis patients: Avg. Annual Mortality rates',
#'ESRD patients: Total (or %) deaths for target year',
#'Dialysis patients: Total (or %) deaths for target year'
]

]

df_diff_ratio_mortality = df_diff[['ESRD patients: Avg. Annual Mortality rates']]
df_diff_ratio_only.head()
png
df_diff_ratio_only.to_csv(data_folder + 'ratio_only_diff_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')

Want to focus on Ratio only i.e. ratio of taken against recommended_high

Actual amount might not reflect because age can bias the outcome

####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_diff_ratio_only)
standardisedX = pd.DataFrame(standardisedX, index=df_diff_ratio_only.index, columns=df_diff_ratio_only.columns)
standardisedX.apply(np.mean), '---\n', standardisedX.apply(np.std)
(vegetables_ratio -5.628214e-16
protein_ratio 8.635068e-17
grain_ratio -2.713879e-16
dairy_ratio 4.826386e-16
fruits_ratio -6.599659e-16
sugars_ratio 6.476301e-16
oils_ratio -1.060880e-15
ESRD patients: Avg. Annual Mortality rates 1.480297e-16
dtype: float64, '---\n', vegetables_ratio 1.0
protein_ratio 1.0
grain_ratio 1.0
dairy_ratio 1.0
fruits_ratio 1.0
sugars_ratio 1.0
oils_ratio 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
dtype: float64)

Target variable data : standard

dividing into high and low mortality

y = abs(standardisedX['ESRD patients: Avg. Annual Mortality rates']) > 0.5 
standardisedX_with_target = standardisedX
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates'], axis=1)
from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pca
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
####---

#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary
####----
summary = pca_summary(pca, standardisedX)
####----
Importance of components:
png
# First three component can define over 91%# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
np.sum( summary.sdev ** 2 )
plt.rcParams['figure.figsize'] = 8, 8

Important Components

def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('./images/pca_components_variance' + '.png')
plt.show()

screeplot(pca, standardisedX)
png

comp 2 to comp 3 or 3 to 4 has the most slope change — comp 4 starts the flat line

first two or three at best first 4 can be retained

Will retain until 3 components

#summary.sdev**2
#pca.components_[0]
#np.sum(pca.components_[0]**2)
####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function with some adjustments
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc
####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
#highest loadings for
0.9999999999999991# Define high and low mortality
# The code below is not used.
# y as defined earlier will rather be used

#import sklearn
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_diff_ratio_mortality)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_diff_ratio_mortality.index, columns=df_diff_ratio_mortality.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality
####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Class separations True = High Mortality')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
standardisedX
png
<Figure size 576x576 with 0 Axes>
png

Plot first two components

Next section (i.e. four components) is more important than this

####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21


pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show() #
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
png
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 4

pca_components_cont = pca.components_[0:components_to_count]
ylabels = []

for c in range (components_to_count):
ylabels.append('Component ' + str(c) )

xlabels = []
for c in range (components_to_count):
xlabels.append(c)

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()

plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#
png

First three components explain the data by 91%

Contributors for First Component: Sugar, Oil, Protein and then Vegetables negatively (-0.6 approx). Dairy somewhat (0.5) positively, and Fruits (.3) minimally positively

Contributors for 2nd Component: grain strongly and negatively. Oil and protein minimally and negatively. Vegetables strongly and positively. Sugars minimally positively

Contributors for 3rd component: Fruit positively where sugars minimally positive.

Contributors for 4th component: Oil strongly and negatively

Considering all Components: Grain and Oil contribute the most negatively then protein and sugar

Fruits contributes the most positively then vegetable and dairy.

Can do regression with all food groups

#df_diff_ratio_with_mortality = df_diff_ratio_only
#df_diff_ratio_with_mortality['mortality_rates'] = df_diff_ratio_mortality['ESRD patients: Avg. Annual Mortality rates']
#df_diff_ratio_with_mortality.corr()
standardisedX_with_target.corr()
png
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
#pd.set_option('display.height', 1000)
plt.figure(figsize=(8, 8))

corr = standardisedX_with_target.corr() #df_diff_ratio_with_mortality.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True,
cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
plt.savefig('./images/heatmap-ckd-mortality-regression-after-pca.png')
png

just with actual intake data — this can be biased by ages — still for a comparison

df_esrdonly_mortality_only.columnsIndex(['ESRD patients: Avg. Annual Mortality rates'], dtype='object')# PCA
# Applying PCA on Difference data

####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html

import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_esrdonly_actual_only)
standardisedX = pd.DataFrame(standardisedX, index=df_esrdonly_actual_only.index, columns=df_esrdonly_actual_only.columns)
standardisedX.apply(np.mean)
Actual Vegetable Intake -3.407768e-16
Actual Protein Intake 7.771561e-16
Actual Grain Intake -4.440892e-16
Actual Dairy Intake 9.637353e-17
Actual Fruit intakes -3.380012e-15
Actual Added Sugars Taken -1.634495e-16
Avg Fats oils and salad dressings taken 8.018277e-16
ESRD patients: Avg. Annual Mortality rates 1.480297e-16
dtype: float64
standardisedX_with_target = standardisedX
standardisedX_with_target
png
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates'], axis=1)

####----

standardisedX.apply(np.std)

####---

from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pca

####---

#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary

####----

summary = pca_summary(pca, standardisedX)
Importance of components:
png
####----

# First two component can define over 95%
np.sum(summary.sdev**2)

# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
plt.rcParams['figure.figsize'] = 8, 8

def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('./images/pca_components_variance' + '.png')
plt.show()



screeplot(pca, standardisedX)
plt.savefig('./images/pca_components_variance' + '.png')
png
<Figure size 576x576 with 0 Axes>####---

# comp 2 to comp 3 is the most change - slope
# first two or at best first 3 can be retained

summary.sdev**2

#Can be retained
#PC1 4.263124 PC2 1.449435

pca.components_[0]
np.sum(pca.components_[0]**2)

####----

# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc

####----

calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
1.0000000000000007#highest loadings for

####----

# for following code : Classes will/might be defined such as like High, low, neutral mortality in final work

#df_esrdonly_mortality_only['ESRD patients: Avg. Annual Mortality rates']
#sorted(standardisedX['ESRD patients: Avg. Annual Mortality rates'])

# Define high and low mortality


#y = df_esrdonly_mortality[' ESRD patients: Avg. Annual Mortality rates'] #< 0.5
# from normalized data. > 0.5 = high mortality
#print('as the older age show higher mortality, does it mean th')
#y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
#y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
#y


#import sklearn
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_esrdonly_mortality_only)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_esrdonly_mortality_only.index, columns=df_esrdonly_mortality_only.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality

y = standardisedX_with_target['ESRD patients: Avg. Annual Mortality rates'] > 0.5
y
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 True
16 True
17 True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool
####----

# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

#y = df_esrdonly[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Only two components can separate the mortality data. True = High ')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
<Figure size 576x576 with 0 Axes>
png
####----

# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show() #
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
png
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 4

pca_components_cont = pca.components_[0:components_to_count]
ylabels = []

for c in range (components_to_count):
ylabels.append('Component ' + str(c) )

xlabels = []
for c in range (components_to_count):
xlabels.append(c)

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()

plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#
png
plt.figure(figsize=(16, 16))

corr = standardisedX_with_target.corr() #df_esrdonly_actual_only.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True,
cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
png

Univariate and Bivariate sections are kept for historical reasons. The input data file is changed;

UNIVARIATE

Check and plot each column. Can show differences between intake and recommended amounts

Data are not missing here. These are processed data. Data adjustments were made for age or age group missing as well as adjusted for food groups.

Fats and Sugars do not have recommended amounts in the data. Subject to be updated in future work

Bar Plot and Box Plots can show the relation clearly. Used other plots such as line, area, hist

matplotlib.style.use('ggplot')
plt.figure(figsize=(16, 5));

"""
plt.rcParams['image.cmap'] = 'jet'
#plt.rcParams['figure.figsize'] = 14, 14
df[' Actual Vegetable Intake'].plot.bar();
#plt.setxlabels(df['age_to'])
plt.suptitle('Vegetable Intake')
plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From\n Colors do not mean anything')
plt.ylabel('Intake amount in gms')
"""
# might change color palette
# https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
"\nplt.rcParams['image.cmap'] = 'jet'\n#plt.rcParams['figure.figsize'] = 14, 14\ndf[' Actual Vegetable Intake'].plot.bar();\n#plt.setxlabels(df['age_to'])\nplt.suptitle('Vegetable Intake')\nplt.xticks(range(len(df['age_to'])), df['age_from']);\nplt.xlabel('Age group: From\n Colors do not mean anything')\nplt.ylabel('Intake amount in gms')\n"




<Figure size 1152x360 with 0 Axes>
df = pd.read_csv(data_folder + 'older_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
# plt.figure(figsize=(16, 5));
plt.rcParams['figure.figsize'] = 14, 5
df1 = df.iloc[:,3:6]

# df1['ESRD patients: Total (or %) deaths for target year'] = df[' ESRD patients: Total (or %) deaths for target year']
# df1['ESRD patients: Avg. Annual Mortality rates'] = df[' ESRD patients: Avg. Annual Mortality rates']

df2 = df.iloc[:,6:9]
df3 = df.iloc[:,9:12]
df4 = df.iloc[:,12:15]
df5 = df.iloc[:,15:18]
df6 = df.iloc[:,18:21]
df7 = df.iloc[:,21:24]
df8 = df.iloc[:,30:32]
df9 = df.iloc[:,32:34]


df1.plot.bar();
df2.plot.bar();
df3.plot.bar();
df4.plot.bar();
df5.plot.bar();
df6.plot.bar();
df7.plot.bar();
df8.plot.bar();
df9.plot.bar();

# plt.xlabel('Age Groups (from)')
# plt.ylabel('Amount in Gms')

plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From\n Colors do not mean anything, just the different column')
plt.ylabel('Intake amount in gms')
Text(0,0.5,'Intake amount in gms')
png
png
png
png
png
png
png
png
png
df1.plot.line();
df2.plot.line();
df3.plot.line();
df4.plot.line();
df5.plot.line();
df6.plot.line();
df7.plot.line();
df8.plot.line();
df9.plot.line();

# plt.xlabel('Age Groups (from)')
# plt.ylabel('Amount in Gms')

plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From')
plt.ylabel('Intake amount in gms')
Text(0,0.5,'Intake amount in gms')
png
png
png
png
png
png
png
png
png
df1.plot.area();
df2.plot.area();
df3.plot.area();
df4.plot.area();
df5.plot.area();
df6.plot.area();
df7.plot.area();
df8.plot.area();
df9.plot.area();


plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From')
plt.ylabel('Intake amount in gms')
Text(0,0.5,'Intake amount in gms')
png
png
png
png
png
png
png
png
png
df1.plot.hist();
df2.plot.hist();
df3.plot.hist();
df4.plot.hist();
df5.plot.hist();
df6.plot.hist();
df7.plot.hist();
df8.plot.hist();
df9.plot.hist();

#plt.xlabel('Age Groups (from)')
#plt.ylabel('Amount in Gms')
#plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Intake/Recommended data distribution')
Text(0.5,0,'Intake/Recommended data distribution')
png
png
png
png
png
png
png
png
png
df1.plot.box();
df2.plot.box();
df3.plot.box();
df4.plot.box();
df5.plot.box();
df6.plot.box();
plt.xticks(range(1, len(df6.columns)+1),df6.columns,rotation=90)

df77 = df.iloc[:,21:22]
df88 = df.iloc[:,30:34]
df99 = df.iloc[:,34:36]

df77.plot.box();


df88.plot.box();
#df99.plot.box();
plt.xticks(range(1, len(df88.columns)+1),df88.columns,rotation=90)

#plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')

#plt.xticks(range(len(df['age_to'])), df['age_from']);
Text(0,0.5,'Amount in Gms')
png
png
png
png
png
png
png
png
df.corr()
png
import numpy as np
import pandas as pd
from subprocess import check_output

from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

import seaborn as sns
%matplotlib inline
df = pd.read_csv('./data-for-code/older_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df = df.drop( ['Recommended Sugars sweets and beverages amount', ' % Population got CKD', ' People (or %) progressed to Stage 3 CKD', ' People (or %) progressed to Stage 4 CKD', ' People (or %) progressed to ESRD', ' Received dietitian care (Optional)', ' Did not Receive dietitian care (Optional)', ' Patients went for dialysis', ' Patients went for Kidney Transplantation' ], axis=1)
df.head()
png
df.corr()
png
plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation All Variables');
png

Will Check on Actual Taken Amount Only

age_from and age_to can be removed

df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
# df = df.drop([ 'age_from', 'age_to' ], axis=1)
df.head()
png
df.describe()
png
#find correlation with price after standardization
df.corr()
png

age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates 0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 1.000000 0.828148 0.999315 0.864052

ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.828148 1.000000 0.847294 0.995676 Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.999315 0.847294 1.000000 0.880927 Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.864052 0.995676 0.880927 1.000000

plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
png
# Without Age Groupsdf_without_ages = df.drop(['age_from', 'age_to', ' Gender'], axis=1)
df_without_ages.head()
png
df_without_ages.corr()
png
plt.figure(figsize=(12, 8))

corr = df_without_ages.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
plt.savefig('./images/univariate_food_groups_heatmaps_actual_intake_amount' + '.png')
png

Will use Normalization all data

df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df = df.drop([' Gender'], axis=1)
df.head()
png
# normalize data
df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
# df.head()
png
df_normalized.corr()
png

age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year 0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 1.000000 0.828148 0.999315 0.864052 ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.828148 1.000000 0.847294 0.995676 Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.999315 0.847294 1.000000 0.880927 Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.864052 0.995676 0.880927 1.000000

plt.figure(figsize=(12, 8))

corr = df_normalized.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
plt.savefig('./images/univariate_food_groups_heatmaps_normalized_actual_intake_amount' + '.png')
png

Will check only on the difference from average recommended amount

df = pd.read_csv('./data-for-code/copy-only-diff-no-empty-data-diff-recomm-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing .csv')
df.head()
png
df.describe()
png
# find correlation with price after standardization
df.corr()
png

age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken Diff Vegetable diff protein diff grain diff dairy diff fruit ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year

0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 0.426499 -0.273596 -0.268453 -0.364665 -0.488155 1.000000 0.828148 0.999315 0.864052

ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.359590 -0.266776 -0.263227 -0.213864 -0.392090 0.828148 1.000000 0.847294 0.995676

Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.424340 -0.278817 -0.272938 -0.362140 -0.489873 0.999315 0.847294 1.000000 0.880927

Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.401294 -0.244088 -0.246893 -0.229239 -0.387759 0.864052 0.995676 0.880927 1.000000

plt.figure(figsize=(12, 8))

corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Variables, and Difference from Recommended Intake Variables');
png
df.head()
df = df.drop([' Gender'], axis=1)
df.head()
png
# normalize data

df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
#df.head()
png
df_normalized.corr()
png
plt.figure(figsize=(12, 8))

corr = df_normalized.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Variables, and Difference from Recommended Intake Variables');
plt.savefig('./images/univariate_food_groups_heatmaps_diff_intake_amount_normalized' + '.png')
png

Bivariate

The most important for Bivariate: bivariate_diff_norm.png and bivariate_diff_norm_rate_only.png. Difference in intake amounts from recommended, also normalized.

Bivariate plots on actual amount intake and target variables. will be saved in bivariate.png. The correlation pattern can be checked in the image saved

Correlation within the intake amounts do not say much as we are using actual intake amounts not differences with recomended amount. Also because, higher aged people will take higher amount. Also, normalized data will give true relation.

The correlation was shown above. Now linearity and non-linearity can be seen using the Bivariate plots.

The plots show similarity with correlation

import numpy as np 
import pandas as pd

from IPython.display import display, HTML

from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

import seaborn as sns
%matplotlib inline

# ref in ref section

On Actual intake Amounts

df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_actual = df.drop([' Gender', 'age_from', 'age_to'], axis=1)
df_actual.head()
png
# on actual amounts
#plt.figure(figsize=(16, 300))
sns.pairplot(df_actual, vars=df_actual.columns, size=5, kind='reg'); # diag_kind='kde',
plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
plt.savefig('./images/bivariate_food_group' + '.png')
plt.show()
png

On difference from recommended amount (data not normalized)

df = pd.read_csv('./data-for-code/copy-only-diff-no-actual-no-empty-cell-diff-recomm-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing .csv')
df.head()
png
# On difference from recommended amount (data not normalized)

#df_actual_diff = df.drop(['age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual_diff.head()
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_actual_diff = df.drop(['age_from', 'age_to'], axis=1)
df_actual_diff.head()
#df.head()
png
# On difference from recommended amount (data not normalized)

sns.pairplot(df_actual_diff, vars=df_actual_diff.columns, size=5, kind='reg');
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
#plt.savefig('./saved_images_from_visualizations/' + 'line_' +indicator.replace(' ', '_')[0:10] + '_' + str(np.random.randint(0, 99999)) + '.png')
plt.savefig('./images/bivariate_diff' + '.png')
plt.show()
png

On difference from recommended amount (data/diff normalized)

df_normalized_diff = (df_actual_diff - df_actual_diff.mean())/ (df_actual_diff.max() - df_actual_diff.min())
df_normalized_diff.head()

sns.pairplot(df_normalized_diff, vars=df_normalized_diff.columns, size=5, kind='reg', aspect=1); # diag_kind='kde',
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
plt.savefig('./images/bivariate_diff_norm' + '.png')
plt.show()
png
df_normalized_diff = df_normalized_diff.drop([' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates'], axis=1)
df_normalized_diff.head()
png
sns.pairplot(df_normalized_diff,   vars=df_normalized_diff.columns, size=5, kind='reg', aspect=1); # diag_kind='kde',
plt.suptitle('Bivariate : Diff : Food Group: Normalized\n')
plt.ylabel('Difference in Intake amount from Recommended : Normalized')
plt.xlabel('Intakes')
plt.savefig('./images/bivariate_diff_norm_rate_only' + '.png')
plt.show()
png
# can be ignored
"""
df_esrdonly = df.drop(['age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_esrdonly.head()

plt.figure(figsize=(14, 14))
sns.pairplot(df_esrdonly, diag_kind='kde');
plt.xlabel('Bivariate Plot, Difference from Recommended Variables, Total ESRD target variable');

#plt.figure(figsize=(16, 16))
plt.rcParams['figure.figsize'] = 16, 16
pd.plotting.scatter_matrix(df_esrdonly, diagonal="kde")
plt.suptitle('Scatter Matrix');
plt.show();
"""
'\ndf_esrdonly = df.drop([\'age_from\', \'age_to\',\' ESRD patients: Avg. Annual Mortality rates\', \' Dialysis patients: Total (or %) deaths for target year\', \' Dialysis patients: Avg. Annual Mortality rates\' ], axis=1)\ndf_esrdonly.head()\n\nplt.figure(figsize=(14, 14))\nsns.pairplot(df_esrdonly, diag_kind=\'kde\');\nplt.xlabel(\'Bivariate Plot, Difference from Recommended Variables, Total ESRD target variable\');\n\n#plt.figure(figsize=(16, 16))\nplt.rcParams[\'figure.figsize\'] = 16, 16\npd.plotting.scatter_matrix(df_esrdonly, diagonal="kde")\nplt.suptitle(\'Scatter Matrix\');\nplt.show();\n'

PCA

Applying PCA on Difference data

df_esrdonly = df.drop(['age_from', 'age_to',' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_esrdonly.head()
#df_esrdonly.head()
png
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html

import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_esrdonly)
standardisedX = pd.DataFrame(standardisedX, index=df_esrdonly.index, columns=df_esrdonly.columns)
standardisedX.apply(np.mean)
Diff Vegetable -4.332954e-16
diff protein -8.018277e-16
diff grain 5.366078e-16
diff dairy -3.885781e-16
diff fruit -9.251859e-17
ESRD patients: Avg. Annual Mortality rates 1.480297e-16
dtype: float64
standardisedX.apply(np.std)Diff Vegetable 1.0
diff protein 1.0
diff grain 1.0
diff dairy 1.0
diff fruit 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
dtype: float64
from sklearn import decomposition
pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pca
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary
summary = pca_summary(pca, standardisedX)Importance of components:
png
# First two component can define over 95%np.sum(summary.sdev**2)Standard deviation    6.0
dtype: float64
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
plt.rcParams['figure.figsize'] = 8, 8

def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('./images/pca_components_variance' + '.png')
plt.show()



screeplot(pca, standardisedX)
png
# comp 2 to comp 3 is the most change - slope
# first two or at best first 3 can be retained
summary.sdev**2
png

Can be retained

PC1 4.263124 PC2 1.449435

pca.components_[0]array([ 0.33558246,  0.47787327,  0.4773956 ,  0.45412539,  0.45708252,
-0.12633561])
np.sum(pca.components_[0]**2)0.999999999999999# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc
calcpc(standardisedX, pca.components_[0])array([ 7.33661891, 3.2759593 , 0.87786527, -0.76096561, -1.27436192,
-1.11656647, -1.05210251, -0.55640492, -0.52219388, -0.70806478,
-1.0803665 , -0.08831969, -0.22064849, -0.32119398, -0.82738023,
-0.54845766, -1.0429077 , -1.37050914])
pca.transform(standardisedX)[:, 0]array([ 7.33661891, 3.2759593 , 0.87786527, -0.76096561, -1.27436192,
-1.11656647, -1.05210251, -0.55640492, -0.52219388, -0.70806478,
-1.0803665 , -0.08831969, -0.22064849, -0.32119398, -0.82738023,
-0.54845766, -1.0429077 , -1.37050914])
pca.components_[1]array([ 0.56910521, 0.00870248, 0.00721787, -0.03193287, -0.18125629,
0.80132258])
np.sum(pca.components_[1]**2)0.9999999999999991

highest loadings for

# for following code : Classes will/might be defined such as like High, low, neutral mortality in final workdf_esrdonly[' ESRD patients: Avg. Annual Mortality rates']
sorted(standardisedX[' ESRD patients: Avg. Annual Mortality rates'])
[-0.8508567828409127,
-0.8447169016030269,
-0.8078776141757118,
-0.7541536533442106,
-0.6983830654334141,
-0.6114014145633645,
-0.5351645558596153,
-0.478882311178995,
-0.45329947268780396,
-0.44050805344220845,
-0.29059261988382884,
-0.13095570769879675,
0.08752173301597484,
0.32902372837281824,
0.681043586011607,
1.1635359199554702,
1.6915657064136536,
2.9441014789423674]
# Define high and low mortalityy = df_esrdonly[' ESRD patients: Avg. Annual Mortality rates'] #< 0.5
# from normalized data. > 0.5 = high mortality
print('as the older age show higher mortality, does it mean th')
y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
y
as the older age show higher mortality, does it mean th





0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 True
16 True
17 True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)

#y = df_esrdonly[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)

# y can be used as classes like High, low, neutral mortality

# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Only two components can separate the mortality data. True = High ')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
<Figure size 576x576 with 0 Axes>
png
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]

plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(df_esrdonly.columns)),df_esrdonly.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show()#
png
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component

How the intake differences contributed for the affecting PCA components

diffs = list(df_esrdonly.columns[:-1]) 
import seaborn as sns
s = sns.heatmap(df_esrdonly[diffs].corr(),cmap='coolwarm')
s.set_yticklabels(s.get_yticklabels(),rotation=30,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=30,fontsize=7)
plt.savefig('./images/pca_food_groups_how_in_together_influencing_PCA_components' + '.png')
plt.show()
png

Vegetable by itself is important — contributing factors

protein + grain + then fruit

grain with fruit — slighly with dairy

dairy with protein and grain*

fruit with protein and grain then dairy

References

https://www.kaggle.com/etakla/exploring-the-dataset-univariate-analysis https://www.kaggle.com/etakla/exploring-the-dataset-bivariate-analysis https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e https://lifelines.readthedocs.io/en/latest/ https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html https://www.statsdirect.com/help/survival_analysis/cox_regression.htm https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/ https://www.kaggle.com/residentmario/univariate-plotting-with-pandas

https://erc.barnard.edu/spss/pearsons_r

https://www.statpac.com/statistics-calculator/correlation-regression.htm

#ref https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html #https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

Note to help with re-generating the data

To include oil and sugar recommendation amount, the sql as below was used.

also for other food groups

#/****** Script for SelectTopNRows command from SSMS ******/

Stored Procedure: [find_recommended_amounts_for_food_groups]

SELECT [age] ,[food_group] ,[average_taken] ,[recommended_low] ,[recommended_high] FROM [mrp_experiment_phase_data_process].[dbo].[age_food_group_recommended_amount] where age in ( select ‘1’ union select age_groups.[from] from age_groups) and food_group in (‘Fats, oils, and salad dressings’, ‘Sugars, sweets, and beverages’) order by food_group

The python file: for_each_age_recommended_amount_take_from_related_group.ipynb helped to generate the recommended amounts based on health.gov data Where for_each_age_recommended_amount.ipynb was the earlier versions : there are some changes in the approch

The study as below also the recommendations from health.gov was used for the recommended amount calculations https://health.gov/dietaryguidelines/2015/guidelines/chapter-2/a-closer-look-at-current-intakes-and-recommended-shifts/#figure-2-7-desc-toggle

For sugar recommended amounts: multiple sources were used. The article above and health.gov recommendations were used. health.gov recommends 10% of calory from sugar using the recommended calory intake per day Other articles were also. check the ipynb files and notes there

Folders such as: input_csvdietfiles, output_agescsvdietfiles are related to recommendation amount calculations from health.gov or shift_recommendation article data

--

--

Justetc Social Services (non-profit)
Data Science Project Development

All proceeds from Medium will go to Justetc Social Services ( non-profit). Justetc Social Services provides services in the Training and Education Areas.