Find relation between food Sub groups and ckd mortality: CKD: Chronic Kidney Disease
The code shows the relation between food sub groups and CKD mortality Food Sub Group concept is used as are used by CDC/USDA/USRDS
Dietray Guidelines are reviewed every couple of years and published https://health.gov/our-work/food-nutrition/previous-dietary-guidelines/2015
A background database was there where some data pre-processing and adjustments were made
Food Groups/Subgroups Survey Data and Mortality Data were merged together to create the dataset as used in this project Mortality Data is provided by USRDS
Food intake surveys are conducted by USDA and also published periodically https://www.nal.usda.gov/fnic/food-and-nutrition-surveys
The code will use PCA to find the important food groups affecting mortality
Will use heatmap to show the correlations between mortality and food groups
The code also shows univariate and bi-variate analysis of the food group data
The final outcome is correlation of Mortality and Food Groups
# PCA section, Correlation, Heatmaps, and then Bivariate plots can be seen to be important
# Might have to adjust/improve my conclusions from the plots in future work# Target variable ESRD patients: Avg. Annual Mortality rates is given the most importance
# One another target variable is: ESRD patients: Total (or %) deaths for target year
# Other target variables: Dialysis patients: Total (or %) deaths for target year# For plotting
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as npimport warnings
warnings.filterwarnings('ignore')
Is not using ratio with recommended amounts. Ratios as used with Food Groups did not show any difference with the actual amount taken. However, that might change with food subgroups.
# data exploration
import pandas as pd
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head(30)
df.describe()
df_actual_only = df.drop(['Age-group: From USRDS', 'Age-group: To USRDS', 'Gender'], axis=1)
df_actual_only.head()
####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_actual_only)
standardisedX = pd.DataFrame(standardisedX, index=df_actual_only.index, columns=df_actual_only.columns)
standardisedX.apply(np.mean), '---\n', standardisedX.apply(np.std)(Actual Dark-green vegetables Intake -1.727014e-16
Actual Red and orange vegetables Intake -3.083953e-16
Actual Starchy vegetables Intake 5.242720e-17
Actual Other vegetables Intake 2.220446e-16
Actual Whole grains intakes 1.258253e-15
Actual Taken Refined grains amount -1.619075e-15
Avg Meat, Poultry and Eggs subgroup taken 3.577385e-16
Avg Seafood taken 2.158767e-16
Avg Nuts, Seeds, and Soy Products taken 6.167906e-16
Avg Added Sugars/Sugars and sweets taken 2.652199e-16
Avg Oils taken -2.713879e-16
Avg Solid Fats taken -1.233581e-16
Avg Milks and milk drinks taken 2.405483e-16
Avg Water, noncarbonated intake 0.000000e+00
Avg Alcoholic beverages intake -2.713879e-16
Avg Nonalcoholic beverages taken -3.824102e-16
Avg Milk desserts, sauces, gravies taken 7.401487e-16
ESRD patients: Total (or %) deaths for target year 9.868649e-17
ESRD patients: Avg. Annual Mortality rates -9.868649e-17
Dialysis patients: Total (or %) deaths for target year 4.934325e-17
Dialysis patients: Avg. Annual Mortality rates 7.401487e-17
dtype: float64,
'---\n',
Actual Dark-green vegetables Intake 1.0
Actual Red and orange vegetables Intake 1.0
Actual Starchy vegetables Intake 1.0
Actual Other vegetables Intake 1.0
Actual Whole grains intakes 1.0
Actual Taken Refined grains amount 1.0
Avg Meat, Poultry and Eggs subgroup taken 1.0
Avg Seafood taken 1.0
Avg Nuts, Seeds, and Soy Products taken 1.0
Avg Added Sugars/Sugars and sweets taken 1.0
Avg Oils taken 1.0
Avg Solid Fats taken 1.0
Avg Milks and milk drinks taken 1.0
Avg Water, noncarbonated intake 1.0
Avg Alcoholic beverages intake 1.0
Avg Nonalcoholic beverages taken 1.0
Avg Milk desserts, sauces, gravies taken 1.0
ESRD patients: Total (or %) deaths for target year 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
Dialysis patients: Total (or %) deaths for target year 1.0
Dialysis patients: Avg. Annual Mortality rates 1.0
dtype: float64)y = abs(standardisedX['ESRD patients: Avg. Annual Mortality rates']) > 0.5
standardisedX_with_target = standardisedX
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates', 'ESRD patients: Total (or %) deaths for target year',
'Dialysis patients: Total (or %) deaths for target year', 'Dialysis patients: Avg. Annual Mortality rates'], axis=1)from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary####----
summary = pca_summary(pca, standardisedX)
####----Importance of components:
# First five component can define over 90%# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
np.sum( summary.sdev ** 2 )
plt.rcParams['figure.figsize'] = 8, 8
Important Components
def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_variance' + '.png')
plt.show()
screeplot(pca, standardisedX)
#plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_variance' + '.png')
comp 3 to 4 has the most slope change — comp 7 starts the flat line
first three are the most important
Will retain first five though first three will be analyzed primarily
#summary.sdev**2
#pca.components_[0]
#np.sum(pca.components_[0]**2)####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function with some adjustments
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
#highest loadings for0.9999999999999989# Define high and low mortality
# The code below is not used.
# y as defined earlier will rather be used
#import sklearn
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_diff_ratio_mortality)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_diff_ratio_mortality.index, columns=df_diff_ratio_mortality.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
#plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)
pca_scatter(pca, standardisedX, y)
# y can be used as classes like High, low, neutral mortality
# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Class separations True = High Mortality')
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_components_separating_high_low_mortality' + '.png')
standardisedX
# first two components really cannot separate high and low mortality very effectively
Plot first two components
Next section (i.e. four components) is more important than this
####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show() #
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
First Five components
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 5
pca_components_cont = pca.components_[0:components_to_count]
ylabels = []
for c in range (components_to_count):
ylabels.append('Component ' + str(c) )
xlabels = []
for c in range (components_to_count):
xlabels.append(c)
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()
plt.xlabel('\nContributors to the first five components')
plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#
standardisedX.columnsIndex(['Actual Dark-green vegetables Intake',
'Actual Red and orange vegetables Intake',
'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
'Avg Meat, Poultry and Eggs subgroup taken', 'Avg Seafood taken',
'Avg Nuts, Seeds, and Soy Products taken',
'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
'Avg Solid Fats taken', 'Avg Milks and milk drinks taken',
'Avg Water, noncarbonated intake', 'Avg Alcoholic beverages intake',
'Avg Nonalcoholic beverages taken',
'Avg Milk desserts, sauces, gravies taken'],
dtype='object')
First five components explain the data by 90%
Contributors for the First Component: Positively: Avg Milks and milk drinks taken -> (0.4)
Negatively (-0.3 to -0.4) contributing: Actual Dark-green vegetables, Actual Starchy vegetables, Actual Whole grains, Avg Meat Poultry and Eggs subgroup, Avg Oils taken, Avg Water noncarbonated intake, Avg Nonalcoholic beverages taken
Contributors for 2nd Component: Positively: Actual Red and orange vegetables Intake, Actual Other vegetables Negatively: ‘Avg Nuts, Seeds, and Soy Products’, ‘Avg Added Sugars/Sugars and sweets’, ‘Avg Alcoholic beverages’
Contributors for 3rd component: Positively: ‘Actual Taken Refined grains amount’, ‘Avg Milk desserts, sauces, gravies taken’ Negatively: Avg Water noncarbonated intake’
Features that do not contribute much: ‘Avg Seafood taken’, ‘Avg Solid Fats taken’,
Considering all Five Components:
Positively the most contributing:
‘Actual Taken Refined grains amount’, ‘Avg Milk desserts, sauces, gravies taken’ ‘Avg Nuts, Seeds, and Soy Products taken’, ‘Actual Red and orange vegetables Intake’
Most Negatively:
‘Actual Other vegetables Intake’, ‘Avg Added Sugars/Sugars and sweets taken’, Avg Alcoholic beverages intake
Can do regression with all food sub groups except ‘Avg Seafood taken’, ‘Avg Solid Fats taken’. Though will do a comparison with them as well
standardisedX_important = standardisedX_with_target.drop(['Avg Seafood taken', 'Avg Solid Fats taken'], axis=1)standardisedX_important.head()
standardisedX_important.corr()
plt.figure(figsize=(12, 12))
corr = standardisedX_important.corr() #df_diff_ratio_with_mortality.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True,
cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
plt.savefig('heatmap-ckd-mortality-subgroup-regression-after-pca.png')
standardisedX_important.columnsIndex(['Actual Dark-green vegetables Intake',
'Actual Red and orange vegetables Intake',
'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
'Avg Meat, Poultry and Eggs subgroup taken',
'Avg Nuts, Seeds, and Soy Products taken',
'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
'Avg Milks and milk drinks taken', 'Avg Water, noncarbonated intake',
'Avg Alcoholic beverages intake', 'Avg Nonalcoholic beverages taken',
'Avg Milk desserts, sauces, gravies taken',
'ESRD patients: Total (or %) deaths for target year',
'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Total (or %) deaths for target year',
'Dialysis patients: Avg. Annual Mortality rates'],
dtype='object')
Most Positvely Correlated: ‘Actual Other vegetables Intake’, ‘Actual Red and orange vegetables Intake’, ‘Actual Starchy vegetables Intake’,
Most Negatively Correlated: ‘Avg Alcoholic beverages intake’ (-0.79), ‘Avg Added Sugars/Sugars and sweets taken’ (-0.64), ‘Avg Nuts, Seeds, and Soy Products taken’ (-0.55)
Check: the data
# Positive correlationplt.rcParams['figure.figsize'] = 5, 5
positively = standardisedX_important[ ['Actual Other vegetables Intake', 'Actual Red and orange vegetables Intake', 'Actual Starchy vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_other = positively[['Actual Other vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_other.plot.line()
plt.savefig('positive_subgroup_line_1')
positively_red = positively[['Actual Red and orange vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_red.plot.line()
plt.savefig('positive_subgroup_line_2')
positively_starchy = positively[['Actual Starchy vegetables Intake', 'ESRD patients: Avg. Annual Mortality rates']]
positively_starchy.plot.line()
plt.savefig('positive_subgroup_line_3')
negatively for Food Subgroup
negatively= standardisedX_important[ ['Avg Alcoholic beverages intake', 'Avg Added Sugars/Sugars and sweets taken', 'Avg Nuts, Seeds, and Soy Products taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_alcohol = negatively[['Avg Alcoholic beverages intake', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_alcohol.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_1')
negatively_sugar = negatively[['Avg Added Sugars/Sugars and sweets taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_sugar.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_2')
negatively_nuts = negatively[['Avg Nuts, Seeds, and Soy Products taken', 'ESRD patients: Avg. Annual Mortality rates']]
negatively_nuts.plot.line()
plt.savefig('./saved-images/negatively_subgroup_line_3')
The following code were relevant for exploration phase. For experiment and methodology the above code are relevant
Univariate
Checked and plotted each column.
Recommended intake amount is not included in the data (can be calculated like food groups)
"""
plt.figure(figsize=(16, 5));
#plt.rcParams['figure.figsize'] = 14, 14
df[' Actual Dark-green vegetables Intake'].plot.bar();
#plt.setxlabels(df['age_to'])
plt.suptitle('Actual Dark-green vegetables Intake')
plt.xticks(range(len(df[' Age-group: From USRDS'])), df[' Age-group: From USRDS']);
plt.xlabel('Age Group: From')
""""\nplt.figure(figsize=(16, 5));\n#plt.rcParams['figure.figsize'] = 14, 14\ndf[' Actual Dark-green vegetables Intake'].plot.bar();\n#plt.setxlabels(df['age_to'])\nplt.suptitle('Actual Dark-green vegetables Intake')\nplt.xticks(range(len(df[' Age-group: From USRDS'])), df[' Age-group: From USRDS']);\nplt.xlabel('Age Group: From')\n"df = pd.read_csv('./data-for-code/older_no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
#plt.figure(figsize=(16, 5));
plt.rcParams['figure.figsize'] = 14, 7
df1 = df.iloc[:,3:6]
df2 = df.iloc[:,6:9]
df3 = df.iloc[:,9:12]
df4 = df.iloc[:,12:15]
df5 = df.iloc[:,15:20]
df6 = df.iloc[:,20:24]
#df7 = df.iloc[:,22:24]
#df8 = df.iloc[:,30:33]
#df9 = df.iloc[:,33:36]
df1.plot.bar();
plt.title('Univariate plots, Check Legends for Aspects Plotted')
df2.plot.bar();
df3.plot.bar();
df4.plot.bar();
df5.plot.bar();
df6.plot.bar();
#df7.plot.bar();
#df8.plot.bar();
#df9.plot.bar();
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')Text(0.5,0.98,'Univariate plots, Check Legends for Aspect')
df1.plot.line();
df2.plot.line();
df3.plot.line();
df4.plot.line();
df5.plot.line();
df6.plot.line();
#df7.plot.line();
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
df1.plot.area();
df2.plot.area();
df3.plot.area();
df4.plot.area();
df5.plot.area();
df6.plot.area();
#df7.plot.area();
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
df1.plot.hist();
df2.plot.hist();
df3.plot.hist();
df4.plot.hist();
df5.plot.hist();
df6.plot.hist();
#df7.plot.hist();
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
df1.plot.box();
df2.plot.box();
df3.plot.box();
df4.plot.box();
df5.plot.box();
df6.plot.box();
#df7.plot.box();
plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
plt.suptitle('Univariate plots, Check Legends for Aspect')
plt.xticks(range(len(df[' Age-group: To USRDS'])), df[' Age-group: From USRDS']);
df.corr()
import numpy as np
import pandas as pd
from subprocess import check_output
from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
%matplotlib inlinedf = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()
df.corr()
Age-group: From USRDS Age-group: To USRDS Actual Dark-green vegetables Intake Actual Red and orange vegetables Intake Actual Starchy vegetables Intake Actual Other vegetables Intake Actual Whole grains intakes Actual Taken Refined grains amount Avg Meat, Poultry and Eggs subgroup taken Avg Seafood taken … Avg Solid Fats taken Avg Milks and milk drinks taken Avg Water, noncarbonated intake Avg Alcoholic beverages intake Avg Nonalcoholic beverages taken Avg Milk desserts, sauces, gravies taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year 0.909230 0.909458 0.056973 0.572518 0.517307 0.542376 -0.469825 0.112844 -0.036558 0.107024 … -0.236728 -0.598140 -0.023290 -0.738501 0.384077 0.353425 1.000000 0.833280 0.999627 0.868295 ESRD patients: Avg. Annual Mortality rates 0.854852 0.854771 -0.082390 0.545127 0.444551 0.676325 -0.609984 0.057939 -0.216943 -0.113325 … -0.365416 -0.461741 -0.150719 -0.786412 0.192810 0.336838 0.833280 1.000000 0.847302 0.995724 Dialysis patients: Total (or %) deaths for target year 0.914521 0.914720 0.054157 0.577456 0.518309 0.554746 -0.478563 0.114916 -0.042973 0.100156 … -0.241894 -0.598720 -0.028184 -0.744098 0.381084 0.357477 0.999627 0.847302 1.000000 0.880661 Dialysis patients: Avg. Annual Mortality rates 0.878499 0.879277 -0.076082 0.564846 0.453687 0.675571 -0.613042 0.035448 -0.205311 -0.097894 … -0.369415 -0.485633 -0.136271 -0.791673 0.225472 0.314988 0.868295 0.995724 0.880661 1.000000
len(df.columns)24plt.figure(figsize=(12, 8))
corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
df = df.drop(['Gender'], axis=1)df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
df_normalized.corr()
plt.figure(figsize=(12, 8))
corr = df_normalized.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
Bivariate
Bivariate plots on actual amount intake and target variables. will be saved in bivariate_food_subgroup.png. The correlation pattern can be checked in the image saved
The correlation are shown in heatmaps and Pearson’s correlations. Now linearity and non-linearity can be seen using the Bivariate plots below.
The plots show similarity with correlations
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()
df.columnsIndex(['Actual Dark-green vegetables Intake',
'Actual Red and orange vegetables Intake',
'Actual Starchy vegetables Intake', 'Actual Other vegetables Intake',
'Actual Whole grains intakes', 'Actual Taken Refined grains amount',
'Avg Meat, Poultry and Eggs subgroup taken', 'Avg Seafood taken',
'Avg Nuts, Seeds, and Soy Products taken',
'Avg Added Sugars/Sugars and sweets taken', 'Avg Oils taken',
'Avg Solid Fats taken', 'Avg Milks and milk drinks taken',
'Avg Water, noncarbonated intake', 'Avg Alcoholic beverages intake',
'Avg Nonalcoholic beverages taken',
'Avg Milk desserts, sauces, gravies taken',
'ESRD patients: Total (or %) deaths for target year',
'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Total (or %) deaths for target year',
'Dialysis patients: Avg. Annual Mortality rates'],
dtype='object')# on actual amounts
# plt.figure(figsize=(16, 300))
#df = df.drop(['Gender', 'Age-group: From USRDS', 'Age-group: To USRDS'], axis=1)
sns.pairplot(df, vars=df.columns, size=5, kind='reg'); # diag_kind='kde',
plt.title('Bivariate Plot, Food Subgroups');
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/bivariate_food_subgroup' + '.png')
plt.show()
Bivariate plots with few variables at each plot
‘ESRD patients: Avg. Annual Mortality rates’ is used as the target variable
df1 = df.iloc[:,3:6]
df1['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']
df2 = df.iloc[:,6:9]
df2['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']
df3 = df.iloc[:,9:12]
df3['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']
df4 = df.iloc[:,12:15]
df4['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']
df5 = df.iloc[:,15:17]
df5['ESRD patients: Avg. Annual Mortality rates'] = df['ESRD patients: Avg. Annual Mortality rates']
#df6 = df.iloc[:,18:21]
df7 = df.iloc[:,17:24]
#df7[' ESRD patients: Avg. Annual Mortality rates'] = df[' ESRD patients: Avg. Annual Mortality rates']plt.figure(figsize=(14, 14))
sns.pairplot(df1, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=70);
#plt.xticks(rotation=70);
#g.set_xlabels(g.get_xlabels(), rotation=90)
#plt.xlabel(range(len(df1.columns)), df1.columns, rotation=90);
# plt.setyylable('dd', rotation=90);<Figure size 1008x1008 with 0 Axes>
sns.pairplot(df2, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);
sns.pairplot(df3, diag_kind='kde', kind='reg', size=5);
sns.pairplot(df4, diag_kind='kde', kind='reg', size=5);
sns.pairplot(df5, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);
sns.pairplot(df6, diag_kind='kde', kind='reg', size=5);
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable')
#plt.yticks(rotation=90);
PCA: Just experiment and Exploration only:
Will reuse some methods from a github project and apply on our data
#steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
df = pd.read_csv('./data-for-code/no-empty-cell-mortality_subgroup_data_june_9th_gender_neutral-based_data_after_processing.csv')
df.head()
df_pca = df.drop(['Age-group: From USRDS', 'Age-group: To USRDS', 'Gender'],axis=1)
import sklearn
from sklearn import preprocessing
import numpy as np
standardisedX = sklearn.preprocessing.scale(df_pca)
standardisedX = pd.DataFrame(standardisedX, index=df_pca.index, columns=df_pca.columns)
standardisedX.apply(np.mean)Actual Dark-green vegetables Intake -1.727014e-16
Actual Red and orange vegetables Intake -3.083953e-16
Actual Starchy vegetables Intake 5.242720e-17
Actual Other vegetables Intake 2.220446e-16
Actual Whole grains intakes 1.258253e-15
Actual Taken Refined grains amount -1.619075e-15
Avg Meat, Poultry and Eggs subgroup taken 3.577385e-16
Avg Seafood taken 2.158767e-16
Avg Nuts, Seeds, and Soy Products taken 6.167906e-16
Avg Added Sugars/Sugars and sweets taken 2.652199e-16
Avg Oils taken -2.713879e-16
Avg Solid Fats taken -1.233581e-16
Avg Milks and milk drinks taken 2.405483e-16
Avg Water, noncarbonated intake 0.000000e+00
Avg Alcoholic beverages intake -2.713879e-16
Avg Nonalcoholic beverages taken -3.824102e-16
Avg Milk desserts, sauces, gravies taken 7.401487e-16
ESRD patients: Total (or %) deaths for target year 9.868649e-17
ESRD patients: Avg. Annual Mortality rates -9.868649e-17
Dialysis patients: Total (or %) deaths for target year 4.934325e-17
Dialysis patients: Avg. Annual Mortality rates 7.401487e-17
dtype: float64standardisedX.apply(np.std)Actual Dark-green vegetables Intake 1.0
Actual Red and orange vegetables Intake 1.0
Actual Starchy vegetables Intake 1.0
Actual Other vegetables Intake 1.0
Actual Whole grains intakes 1.0
Actual Taken Refined grains amount 1.0
Avg Meat, Poultry and Eggs subgroup taken 1.0
Avg Seafood taken 1.0
Avg Nuts, Seeds, and Soy Products taken 1.0
Avg Added Sugars/Sugars and sweets taken 1.0
Avg Oils taken 1.0
Avg Solid Fats taken 1.0
Avg Milks and milk drinks taken 1.0
Avg Water, noncarbonated intake 1.0
Avg Alcoholic beverages intake 1.0
Avg Nonalcoholic beverages taken 1.0
Avg Milk desserts, sauces, gravies taken 1.0
ESRD patients: Total (or %) deaths for target year 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
Dialysis patients: Total (or %) deaths for target year 1.0
Dialysis patients: Avg. Annual Mortality rates 1.0
dtype: float64from sklearn import decomposition
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summarysummary = pca_summary(pca, standardisedX)Importance of components:
# First Five component can define over 91%np.sum(summary.sdev**2)Standard deviation 21.0
dtype: float64# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.show()
plt.rcParams['figure.figsize'] = 8, 8
screeplot(pca, standardisedX)
# comp 3 to comp 4 is the most change for slope
# first three or at best first 4 can be retainedsummary.sdev**2
pca.components_[0]array([-0.29107597, -0.14117176, -0.2469594 , -0.17548548, -0.25393257,
-0.21937776, -0.30982985, -0.28356691, -0.0489583 , -0.14248964,
-0.31487843, -0.26718644, 0.26503007, -0.30282359, -0.14339706,
-0.28353417, -0.23557178, -0.03748978, 0.01047719, -0.03648573,
0.00854257])np.sum(pca.components_[0]**2)0.9999999999999997# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pccalcpc(standardisedX, pca.components_[0])array([ 9.84586938, 4.93557236, 1.7245801 , 0.51665275, -2.23906467,
-2.37480972, -1.57908354, -1.82857302, -2.5779144 , -1.48209187,
-1.18821375, -1.74464742, -2.54901027, -1.14542236, -0.61057954,
0.39468681, 0.5706472 , 1.33140195])pca.transform(standardisedX)[:, 0]array([ 9.84586938, 4.93557236, 1.7245801 , 0.51665275, -2.23906467,
-2.37480972, -1.57908354, -1.82857302, -2.5779144 , -1.48209187,
-1.18821375, -1.74464742, -2.54901027, -1.14542236, -0.61057954,
0.39468681, 0.5706472 , 1.33140195])pca.components_[1]array([-0.02860489, 0.23507636, 0.18185158, 0.2330107 , -0.22042602,
0.00767883, -0.06542032, -0.01520477, -0.23666744, -0.27940994,
-0.05868199, -0.12681133, -0.19603246, -0.03891072, -0.3029888 ,
0.11437419, 0.10127936, 0.34621579, 0.35198445, 0.34906457,
0.3593115 ])np.sum(pca.components_[1]**2)1.0000000000000009# for following code : Classes will/might be defined such as like High, low, neutral mortality in final worksorted(standardisedX['ESRD patients: Avg. Annual Mortality rates'])[-0.8508781732392158,
-0.8447380350810376,
-0.807897206131968,
-0.7541709972479083,
-0.6983980756444559,
-0.6114127850702639,
-0.535172736272884,
-0.4788881364895833,
-0.45330422749717386,
-0.4405122730009692,
-0.29059056630545005,
-0.13094697419281548,
0.08753960860236082,
0.32905170949070545,
0.6810862972262587,
1.1635988208231,
1.6916507024264298,
2.9439830476048696]y = standardisedX['ESRD patients: Avg. Annual Mortality rates'] > 0.5
y0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 True
16 True
17 True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)
#y = df_pca[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)
# y can be used as classes like High, low, neutral mortality
plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
#keep four components
pca_components_cont = pca.components_[0:5]
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1,2,3],['1st Comp','2nd Comp', '3rd comp', '4th comp'],fontsize=10)
plt.colorbar()
plt.xticks(range(len(df_pca.columns)), df_pca.columns, rotation=65,ha='left')
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_sub_groups_what_contributes_to_PCA_components' + '.png')
plt.show()#
plt.figure(figsize=(16, 10));
diffs = list(df_pca.columns[:-1])
import seaborn as sns
s = sns.heatmap(df_pca[diffs].corr(),cmap='coolwarm')
#s.set_yticklabels(s.get_yticklabels(),rotation=60,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=90,fontsize=7)
plt.savefig('../../progress_reports/to_submit/pca_univariate_bivariate/pca_food_sub_groups_how_in_together_influencing_PCA_components' + '.png')
plt.show()
https://www.kaggle.com/etakla/exploring-the-dataset-univariate-analysis https://www.kaggle.com/etakla/exploring-the-dataset-bivariate-analysis https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e https://lifelines.readthedocs.io/en/latest/ https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html https://www.statsdirect.com/help/survival_analysis/cox_regression.htm https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/