Find relation between foodgroups and ckd mortality
The code shows the relation between food groups and CKD mortality Food Group concept is used as are used by CDC/USDA/USRDS
Dietray Guidelines are reviewed every couple of years and published https://health.gov/our-work/food-nutrition/previous-dietary-guidelines/2015
A background database was there where some data pre-processing and adjustments were made
Food Groups Survey Data and Mortality Data were merged together to create the dataset as used in this project Mortality Data is provided by USRDS
Food intake surveys are conducted by USDA and also published periodically https://www.nal.usda.gov/fnic/food-and-nutrition-surveys
The code will use PCA to find the important food groups affecting mortality
Will use heatmap to show the correlations between mortality and food groups
The code also shows univariate and bi-variate analysis of the food group data
The final outcome is correlation of Mortality and Food Groups
Related : database table: age_food_group_recommended_amount_adjusted
stored procedures: adjusted_gms_age_food_group_recommended_amount multi_day_get_food_group_based_dietary_intake_by_participants
Cup http://dish.allrecipes.com/cup-to-gram-conversions/ More Cup and grams conversion references are provided in the stored procedures as well
# notes on regenerating some of the core data can be found near the end. Also, reports can be checked for other details# PCA section, Correlation, Heatmaps, plots can be seen to be the most important
# Might have to adjust/improve my conclusions from the plots in future work# Target variable ESRD patients: Avg. Annual Mortality rates is given the most importance
#Potential Others:
# One another target variable is: ESRD patients: Total (or %) deaths for target year
# Other target variables: Dialysis patients: Total (or %) deaths for target year# For plotting, import libraries
from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
import numpy as npimport warnings
warnings.filterwarnings('ignore')
Core Input data File: Food Group
mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv
Intake data and recommended data
Recommended data:
Removed (no longer relevant): The ones with format: From: Recommended Vegetable Intake and To: Recommended Vegetable Intake: health.gov recommendation data (age group based) is distributed to each age then took average using age groups as used by USRDS
Relevant: The ones with format: vegetables_recommended_high and vegetables_recommended_low: health.gov recommendation data (age group based) is mapped to ages (rather than dividing and distributing) then regrouped using USRDS age groups. To mention: health.gov and USRDS provided aggregated data using different age groups. Hence, such approaches were taken
# data exploration
import pandas as pd
df = pd.read_csv('./data-for-code/mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df.head()
df.describe()
PCA
Applying PCA on Actual Intake Amount
For recommended amounts: fields with format: vegetables_recommended_high and vegetables_recommended_low
df.columnsIndex(['age_from', 'age_to', 'Gender', 'vegetables_recommended_low',
'vegetables_recommended_high', 'Actual Vegetable Intake',
'protein_recommended_low', 'protein_recommended_high',
'Actual Protein Intake', 'grains_recommended_low',
'grains_recommended_high', 'Actual Grain Intake',
'dairy_recommended_low', 'dairy_recommended_high',
'Actual Dairy Intake', 'fruits_recommended_low',
'fruits_recommended_high', 'Actual Fruit intakes', 'sugars_recommended',
'sugars_recommended_low', 'sugars_recommended_high',
'Actual Added Sugars Taken', 'oils_recommended_low',
'oils_recommended_high', 'Avg Fats oils and salad dressings taken',
'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Avg. Annual Mortality rates',
'ESRD patients: Total (or %) deaths for target year',
'Dialysis patients: Total (or %) deaths for target year'],
dtype='object')# Will Keep only important columnsdf_esrdonly = df [
[
#'age_from',
#'age_to',
#'Gender',
#'From: Recommended Vegetable Intake',
'vegetables_recommended_low',
#'To: Recommended Vegetable Intake',
'vegetables_recommended_high',
'Actual Vegetable Intake',
#'From: Recommended Protein Intake',
'protein_recommended_low',
#'To: Recommended Protein Intake',
'protein_recommended_high',
'Actual Protein Intake',
#'From: Recommended Grain Intake',
'grains_recommended_low',
#'To: Recommended Grain Intake',
'grains_recommended_high',
'Actual Grain Intake',
#'From: Recommended Dairy Intake',
'dairy_recommended_low',
#'To: Recommended Dairy Intake',
'dairy_recommended_high',
'Actual Dairy Intake',
#'From: Recommended Fruit category intakes',
'fruits_recommended_low',
#'To: Recommended Fruit category intakes',
'fruits_recommended_high',
'Actual Fruit intakes',
'sugars_recommended_low',
'sugars_recommended_high',
'Actual Added Sugars Taken',
#'Actual Taken Sugars sweets and beverages amount',
'oils_recommended_low',
'oils_recommended_high',
'Avg Fats oils and salad dressings taken',
#'% Population got CKD',
#'People (or %) progressed to Stage 3 CKD',
#'People (or %) progressed to Stage 4 CKD',
#'People (or %) progressed to ESRD',
#'Received dietitian care (Optional)',
#'Did not Receive dietitian care (Optional)',
#'Patients went for dialysis',
#'Patients went for Kidney Transplantation',
'ESRD patients: Avg. Annual Mortality rates',
#'Dialysis patients: Avg. Annual Mortality rates',
#'ESRD patients: Total (or %) deaths for target year',
#'Dialysis patients: Total (or %) deaths for target year'
]
]
df_esrdonly.head()
df_esrdonly.columnsIndex(['vegetables_recommended_low', 'vegetables_recommended_high',
'Actual Vegetable Intake', 'protein_recommended_low',
'protein_recommended_high', 'Actual Protein Intake',
'grains_recommended_low', 'grains_recommended_high',
'Actual Grain Intake', 'dairy_recommended_low',
'dairy_recommended_high', 'Actual Dairy Intake',
'fruits_recommended_low', 'fruits_recommended_high',
'Actual Fruit intakes', 'sugars_recommended_low',
'sugars_recommended_high', 'Actual Added Sugars Taken',
'oils_recommended_low', 'oils_recommended_high',
'Avg Fats oils and salad dressings taken',
'ESRD patients: Avg. Annual Mortality rates'],
dtype='object')# only actual intake by the participants -- othe fields such as recommended amounts are removed from this dataframedf_esrdonly_actual_only = df_esrdonly [
[
#'vegetables_recommended_low',
#'vegetables_recommended_high',
'Actual Vegetable Intake',
#'protein_recommended_low',
#'protein_recommended_high',
'Actual Protein Intake',
#'grains_recommended_low',
#'grains_recommended_high',
'Actual Grain Intake',
#'dairy_recommended_low',
#'dairy_recommended_high',
'Actual Dairy Intake',
#'fruits_recommended_low',
#'fruits_recommended_high',
'Actual Fruit intakes',
#'sugars_recommended_low',
#'sugars_recommended_high',
'Actual Added Sugars Taken',
#'Actual Taken Sugars sweets and beverages amount',
#'oils_recommended_low',
#'oils_recommended_high',
'Avg Fats oils and salad dressings taken',
'ESRD patients: Avg. Annual Mortality rates'
]
]
df_esrdonly_mortality_only = df_esrdonly[['ESRD patients: Avg. Annual Mortality rates']]# saving data to csv files to verify with excel analysisdata_folder = 'C:/Users/Sayed Ahmed/mrp_project_implementation/phase methodology and experiments/excel-xlstat-analysis/food-groups/'
data_folder = './data-for-code/'
df_esrdonly_actual_only.to_csv(data_folder + 'actual_only_for_excel_analysis_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
Load file with ratios of actual intake with recommended high amount
this is done to reduce the bias for age i.e. younger kids/children might be taking less and recommendation can be less at least in couple of cases
df_diff = pd.read_csv('./data-for-code/diff_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df_diff.head()
df_diff.columnsIndex(['age_from', 'age_to', 'Gender', 'vegetables_recommended_low',
'vegetables_recommended_high', 'Actual Vegetable Intake',
'vegetables_ratio', 'protein_recommended_low',
'protein_recommended_high', 'Actual Protein Intake', 'protein_ratio',
'grains_recommended_low', 'grains_recommended_high',
'Actual Grain Intake', 'grain_ratio', 'dairy_recommended_low',
'dairy_recommended_high', 'Actual Dairy Intake', 'dairy_ratio',
'fruits_recommended_low', 'fruits_recommended_high',
'Actual Fruit intakes', 'fruits_ratio', 'sugars_recommended',
'sugars_recommended_low', 'sugars_recommended_high',
'Actual Added Sugars Taken', 'sugars_ratio', 'oils_recommended_low',
'oils_recommended_high', 'Avg Fats oils and salad dressings taken',
'oils_ratio', 'ESRD patients: Avg. Annual Mortality rates',
'Dialysis patients: Avg. Annual Mortality rates',
'ESRD patients: Total (or %) deaths for target year',
'Dialysis patients: Total (or %) deaths for target year'],
dtype='object')
only fields that provide the ratios
df_diff_ratio_only = df_diff [
[
#'age_from',
#'age_to',
#'Gender',
#'From: Recommended Vegetable Intake',
#'vegetables_recommended_low',
#'To: Recommended Vegetable Intake',
#'vegetables_recommended_high',
#'Actual Vegetable Intake',
'vegetables_ratio',
#'From: Recommended Protein Intake',
#'protein_recommended_low', 'To: Recommended Protein Intake',
#'protein_recommended_high', 'Actual Protein Intake',
'protein_ratio',
#'From: Recommended Grain Intake', 'grains_recommended_low',
#'To: Recommended Grain Intake', 'grains_recommended_high',
#'Actual Grain Intake',
'grain_ratio',
#'From: Recommended Dairy Intake',
#'dairy_recommended_low', 'To: Recommended Dairy Intake',
#'dairy_recommended_high', 'Actual Dairy Intake',
'dairy_ratio',
#'From: Recommended Fruit category intakes', 'fruits_recommended_low',
#'To: Recommended Fruit category intakes', 'fruits_recommended_high',
#'Actual Fruit intakes',
'fruits_ratio',
#'sugars_recommended_low',
#'sugars_recommended_high', 'Actual Added Sugars Taken',
'sugars_ratio',
#'Actual Taken Sugars sweets and beverages amount',
#'oils_recommended_low', 'oils_recommended_high',
#'Avg Fats oils and salad dressings taken',
'oils_ratio',
#'% Population got CKD', 'People (or %) progressed to Stage 3 CKD',
#'People (or %) progressed to Stage 4 CKD',
#'People (or %) progressed to ESRD',
#'Received dietitian care (Optional)',
#'Did not Receive dietitian care (Optional)',
#'Patients went for dialysis',
#'Patients went for Kidney Transplantation',
'ESRD patients: Avg. Annual Mortality rates', #<------------------
#'Dialysis patients: Avg. Annual Mortality rates',
#'ESRD patients: Total (or %) deaths for target year',
#'Dialysis patients: Total (or %) deaths for target year'
]
]
df_diff_ratio_mortality = df_diff[['ESRD patients: Avg. Annual Mortality rates']]
df_diff_ratio_only.head()
df_diff_ratio_only.to_csv(data_folder + 'ratio_only_diff_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
Want to focus on Ratio only i.e. ratio of taken against recommended_high
Actual amount might not reflect because age can bias the outcome
####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_diff_ratio_only)
standardisedX = pd.DataFrame(standardisedX, index=df_diff_ratio_only.index, columns=df_diff_ratio_only.columns)
standardisedX.apply(np.mean), '---\n', standardisedX.apply(np.std)(vegetables_ratio -5.628214e-16
protein_ratio 8.635068e-17
grain_ratio -2.713879e-16
dairy_ratio 4.826386e-16
fruits_ratio -6.599659e-16
sugars_ratio 6.476301e-16
oils_ratio -1.060880e-15
ESRD patients: Avg. Annual Mortality rates 1.480297e-16
dtype: float64, '---\n', vegetables_ratio 1.0
protein_ratio 1.0
grain_ratio 1.0
dairy_ratio 1.0
fruits_ratio 1.0
sugars_ratio 1.0
oils_ratio 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
dtype: float64)
Target variable data : standard
dividing into high and low mortality
y = abs(standardisedX['ESRD patients: Avg. Annual Mortality rates']) > 0.5
standardisedX_with_target = standardisedX
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates'], axis=1)from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)####---
#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary####----
summary = pca_summary(pca, standardisedX)
####----Importance of components:
# First three component can define over 91%# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
np.sum( summary.sdev ** 2 )
plt.rcParams['figure.figsize'] = 8, 8
Important Components
def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('./images/pca_components_variance' + '.png')
plt.show()
screeplot(pca, standardisedX)
comp 2 to comp 3 or 3 to 4 has the most slope change — comp 4 starts the flat line
first two or three at best first 4 can be retained
Will retain until 3 components
#summary.sdev**2
#pca.components_[0]
#np.sum(pca.components_[0]**2)####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function with some adjustments
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)
#highest loadings for0.9999999999999991# Define high and low mortality
# The code below is not used.
# y as defined earlier will rather be used
#import sklearn
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_diff_ratio_mortality)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_diff_ratio_mortality.index, columns=df_diff_ratio_mortality.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)
pca_scatter(pca, standardisedX, y)
# y can be used as classes like High, low, neutral mortality
# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Class separations True = High Mortality')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
standardisedX
<Figure size 576x576 with 0 Axes>
Plot first two components
Next section (i.e. four components) is more important than this
####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show() #
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 4
pca_components_cont = pca.components_[0:components_to_count]
ylabels = []
for c in range (components_to_count):
ylabels.append('Component ' + str(c) )
xlabels = []
for c in range (components_to_count):
xlabels.append(c)
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()
plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#
First three components explain the data by 91%
Contributors for First Component: Sugar, Oil, Protein and then Vegetables negatively (-0.6 approx). Dairy somewhat (0.5) positively, and Fruits (.3) minimally positively
Contributors for 2nd Component: grain strongly and negatively. Oil and protein minimally and negatively. Vegetables strongly and positively. Sugars minimally positively
Contributors for 3rd component: Fruit positively where sugars minimally positive.
Contributors for 4th component: Oil strongly and negatively
Considering all Components: Grain and Oil contribute the most negatively then protein and sugar
Fruits contributes the most positively then vegetable and dairy.
—
Can do regression with all food groups
#df_diff_ratio_with_mortality = df_diff_ratio_only
#df_diff_ratio_with_mortality['mortality_rates'] = df_diff_ratio_mortality['ESRD patients: Avg. Annual Mortality rates']#df_diff_ratio_with_mortality.corr()
standardisedX_with_target.corr()
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
#pd.set_option('display.height', 1000)plt.figure(figsize=(8, 8))
corr = standardisedX_with_target.corr() #df_diff_ratio_with_mortality.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True,
cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
plt.savefig('./images/heatmap-ckd-mortality-regression-after-pca.png')
just with actual intake data — this can be biased by ages — still for a comparison
df_esrdonly_mortality_only.columnsIndex(['ESRD patients: Avg. Annual Mortality rates'], dtype='object')# PCA
# Applying PCA on Difference data
####----
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_esrdonly_actual_only)
standardisedX = pd.DataFrame(standardisedX, index=df_esrdonly_actual_only.index, columns=df_esrdonly_actual_only.columns)
standardisedX.apply(np.mean)Actual Vegetable Intake -3.407768e-16
Actual Protein Intake 7.771561e-16
Actual Grain Intake -4.440892e-16
Actual Dairy Intake 9.637353e-17
Actual Fruit intakes -3.380012e-15
Actual Added Sugars Taken -1.634495e-16
Avg Fats oils and salad dressings taken 8.018277e-16
ESRD patients: Avg. Annual Mortality rates 1.480297e-16
dtype: float64standardisedX_with_target = standardisedX
standardisedX_with_target
standardisedX = standardisedX.drop(['ESRD patients: Avg. Annual Mortality rates'], axis=1)
####----
standardisedX.apply(np.std)
####---
from sklearn import decomposition
#pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pca
####---
#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary
####----
summary = pca_summary(pca, standardisedX)Importance of components:
####----
# First two component can define over 95%
np.sum(summary.sdev**2)
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
plt.rcParams['figure.figsize'] = 8, 8
def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('./images/pca_components_variance' + '.png')
plt.show()
screeplot(pca, standardisedX)
plt.savefig('./images/pca_components_variance' + '.png')
<Figure size 576x576 with 0 Axes>####---
# comp 2 to comp 3 is the most change - slope
# first two or at best first 3 can be retained
summary.sdev**2
#Can be retained
#PC1 4.263124 PC2 1.449435
pca.components_[0]
np.sum(pca.components_[0]**2)
####----
# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pc
####----
calcpc(standardisedX, pca.components_[0])
pca.transform(standardisedX)[:, 0]
pca.transform(standardisedX)[:, 0]
pca.components_[1]
np.sum(pca.components_[1]**2)1.0000000000000007#highest loadings for
####----
# for following code : Classes will/might be defined such as like High, low, neutral mortality in final work
#df_esrdonly_mortality_only['ESRD patients: Avg. Annual Mortality rates']
#sorted(standardisedX['ESRD patients: Avg. Annual Mortality rates'])
# Define high and low mortality
#y = df_esrdonly_mortality[' ESRD patients: Avg. Annual Mortality rates'] #< 0.5
# from normalized data. > 0.5 = high mortality
#print('as the older age show higher mortality, does it mean th')
#y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
#y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
#y
#import sklearn
#from sklearn import preprocessing
#standardisedX_mortality = sklearn.preprocessing.scale(df_esrdonly_mortality_only)
#standardisedX_mortality = pd.DataFrame(standardisedX_mortality, index=df_esrdonly_mortality_only.index, columns=df_esrdonly_mortality_only.columns)
#standardisedX_mortality.apply(np.mean), '---', standardisedX_mortality.apply(np.std)
#standardisedX_mortality
y = standardisedX_with_target['ESRD patients: Avg. Annual Mortality rates'] > 0.5
y0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 True
16 True
17 True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool####----
# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)
#y = df_esrdonly[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)
# y can be used as classes like High, low, neutral mortality
# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Only two components can separate the mortality data. True = High ')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')<Figure size 576x576 with 0 Axes>
####----
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(standardisedX.columns)),standardisedX.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show() #
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
plt.rcParams['figure.figsize'] = 30, 16
components_to_count = 4
pca_components_cont = pca.components_[0:components_to_count]
ylabels = []
for c in range (components_to_count):
ylabels.append('Component ' + str(c) )
xlabels = []
for c in range (components_to_count):
xlabels.append(c)
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks(xlabels, ylabels, fontsize=10)
plt.colorbar()
plt.xticks(range(len(standardisedX.columns)), standardisedX.columns, rotation=90, ha='left')
plt.show()#
plt.figure(figsize=(16, 16))
corr = standardisedX_with_target.corr() #df_esrdonly_actual_only.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True,
cmap="BuPu");
plt.suptitle('Heatmap, Correlation with Mortality Rates');
Univariate and Bivariate sections are kept for historical reasons. The input data file is changed;
UNIVARIATE
Check and plot each column. Can show differences between intake and recommended amounts
Data are not missing here. These are processed data. Data adjustments were made for age or age group missing as well as adjusted for food groups.
Fats and Sugars do not have recommended amounts in the data. Subject to be updated in future work
Bar Plot and Box Plots can show the relation clearly. Used other plots such as line, area, hist
matplotlib.style.use('ggplot')
plt.figure(figsize=(16, 5));
"""
plt.rcParams['image.cmap'] = 'jet'
#plt.rcParams['figure.figsize'] = 14, 14
df[' Actual Vegetable Intake'].plot.bar();
#plt.setxlabels(df['age_to'])
plt.suptitle('Vegetable Intake')
plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From\n Colors do not mean anything')
plt.ylabel('Intake amount in gms')
"""
# might change color palette
# https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html"\nplt.rcParams['image.cmap'] = 'jet'\n#plt.rcParams['figure.figsize'] = 14, 14\ndf[' Actual Vegetable Intake'].plot.bar();\n#plt.setxlabels(df['age_to'])\nplt.suptitle('Vegetable Intake')\nplt.xticks(range(len(df['age_to'])), df['age_from']);\nplt.xlabel('Age group: From\n Colors do not mean anything')\nplt.ylabel('Intake amount in gms')\n"
<Figure size 1152x360 with 0 Axes>df = pd.read_csv(data_folder + 'older_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
# plt.figure(figsize=(16, 5));
plt.rcParams['figure.figsize'] = 14, 5
df1 = df.iloc[:,3:6]
# df1['ESRD patients: Total (or %) deaths for target year'] = df[' ESRD patients: Total (or %) deaths for target year']
# df1['ESRD patients: Avg. Annual Mortality rates'] = df[' ESRD patients: Avg. Annual Mortality rates']
df2 = df.iloc[:,6:9]
df3 = df.iloc[:,9:12]
df4 = df.iloc[:,12:15]
df5 = df.iloc[:,15:18]
df6 = df.iloc[:,18:21]
df7 = df.iloc[:,21:24]
df8 = df.iloc[:,30:32]
df9 = df.iloc[:,32:34]
df1.plot.bar();
df2.plot.bar();
df3.plot.bar();
df4.plot.bar();
df5.plot.bar();
df6.plot.bar();
df7.plot.bar();
df8.plot.bar();
df9.plot.bar();
# plt.xlabel('Age Groups (from)')
# plt.ylabel('Amount in Gms')
plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From\n Colors do not mean anything, just the different column')
plt.ylabel('Intake amount in gms')Text(0,0.5,'Intake amount in gms')
df1.plot.line();
df2.plot.line();
df3.plot.line();
df4.plot.line();
df5.plot.line();
df6.plot.line();
df7.plot.line();
df8.plot.line();
df9.plot.line();
# plt.xlabel('Age Groups (from)')
# plt.ylabel('Amount in Gms')
plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From')
plt.ylabel('Intake amount in gms')Text(0,0.5,'Intake amount in gms')
df1.plot.area();
df2.plot.area();
df3.plot.area();
df4.plot.area();
df5.plot.area();
df6.plot.area();
df7.plot.area();
df8.plot.area();
df9.plot.area();
plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Age group: From')
plt.ylabel('Intake amount in gms')Text(0,0.5,'Intake amount in gms')
df1.plot.hist();
df2.plot.hist();
df3.plot.hist();
df4.plot.hist();
df5.plot.hist();
df6.plot.hist();
df7.plot.hist();
df8.plot.hist();
df9.plot.hist();
#plt.xlabel('Age Groups (from)')
#plt.ylabel('Amount in Gms')
#plt.xticks(range(len(df['age_to'])), df['age_from']);
plt.xlabel('Intake/Recommended data distribution')Text(0.5,0,'Intake/Recommended data distribution')
df1.plot.box();
df2.plot.box();
df3.plot.box();
df4.plot.box();
df5.plot.box();
df6.plot.box();
plt.xticks(range(1, len(df6.columns)+1),df6.columns,rotation=90)
df77 = df.iloc[:,21:22]
df88 = df.iloc[:,30:34]
df99 = df.iloc[:,34:36]
df77.plot.box();
df88.plot.box();
#df99.plot.box();
plt.xticks(range(1, len(df88.columns)+1),df88.columns,rotation=90)
#plt.xlabel('Age Groups (from)')
plt.ylabel('Amount in Gms')
#plt.xticks(range(len(df['age_to'])), df['age_from']);Text(0,0.5,'Amount in Gms')
df.corr()
import numpy as np
import pandas as pd
from subprocess import check_output
from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
%matplotlib inlinedf = pd.read_csv('./data-for-code/older_mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df = df.drop( ['Recommended Sugars sweets and beverages amount', ' % Population got CKD', ' People (or %) progressed to Stage 3 CKD', ' People (or %) progressed to Stage 4 CKD', ' People (or %) progressed to ESRD', ' Received dietitian care (Optional)', ' Did not Receive dietitian care (Optional)', ' Patients went for dialysis', ' Patients went for Kidney Transplantation' ], axis=1)
df.head()
df.corr()
plt.figure(figsize=(12, 8))
corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation All Variables');
Will Check on Actual Taken Amount Only
age_from and age_to can be removed
df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
# df = df.drop([ 'age_from', 'age_to' ], axis=1)
df.head()
df.describe()
#find correlation with price after standardization
df.corr()
age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates 0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 1.000000 0.828148 0.999315 0.864052
ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.828148 1.000000 0.847294 0.995676 Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.999315 0.847294 1.000000 0.880927 Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.864052 0.995676 0.880927 1.000000
plt.figure(figsize=(12, 8))
corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
# Without Age Groupsdf_without_ages = df.drop(['age_from', 'age_to', ' Gender'], axis=1)
df_without_ages.head()
df_without_ages.corr()
plt.figure(figsize=(12, 8))
corr = df_without_ages.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
plt.savefig('./images/univariate_food_groups_heatmaps_actual_intake_amount' + '.png')
Will use Normalization all data
df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
df = df.drop([' Gender'], axis=1)
df.head()
# normalize data
df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
# df.head()
df_normalized.corr()
age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year 0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 1.000000 0.828148 0.999315 0.864052 ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.828148 1.000000 0.847294 0.995676 Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.999315 0.847294 1.000000 0.880927 Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.864052 0.995676 0.880927 1.000000
plt.figure(figsize=(12, 8))
corr = df_normalized.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Amounts Only');
plt.savefig('./images/univariate_food_groups_heatmaps_normalized_actual_intake_amount' + '.png')
Will check only on the difference from average recommended amount
df = pd.read_csv('./data-for-code/copy-only-diff-no-empty-data-diff-recomm-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing .csv')
df.head()
df.describe()
# find correlation with price after standardization
df.corr()
age_from age_to Actual Vegetable Intake Actual Protein Intake Actual Grain Intake Actual Dairy Intake Actual Fruit intakes Actual Taken Sugars sweets and beverages amount Avg Fats oils and salad dressings taken Diff Vegetable diff protein diff grain diff dairy diff fruit ESRD patients: Total (or %) deaths for target year ESRD patients: Avg. Annual Mortality rates Dialysis patients: Total (or %) deaths for target year Dialysis patients: Avg. Annual Mortality rates ESRD patients: Total (or %) deaths for target year
0.907545 0.907790 0.606775 0.040950 -0.452625 -0.408543 -0.372665 0.178760 -0.081677 0.426499 -0.273596 -0.268453 -0.364665 -0.488155 1.000000 0.828148 0.999315 0.864052
ESRD patients: Avg. Annual Mortality rates 0.854841 0.854759 0.444925 -0.151035 -0.572187 -0.238108 -0.369639 0.022898 -0.202818 0.359590 -0.266776 -0.263227 -0.213864 -0.392090 0.828148 1.000000 0.847294 0.995676
Dialysis patients: Total (or %) deaths for target year 0.914524 0.914723 0.603179 0.031982 -0.462540 -0.404490 -0.375241 0.172767 -0.087055 0.424340 -0.278817 -0.272938 -0.362140 -0.489873 0.999315 0.847294 1.000000 0.880927
Dialysis patients: Avg. Annual Mortality rates 0.878659 0.879439 0.480531 -0.134234 -0.580723 -0.264096 -0.370520 0.050212 -0.208805 0.401294 -0.244088 -0.246893 -0.229239 -0.387759 0.864052 0.995676 0.880927 1.000000
plt.figure(figsize=(12, 8))
corr = df.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Variables, and Difference from Recommended Intake Variables');
df.head()
df = df.drop([' Gender'], axis=1)
df.head()
# normalize data
df_normalized = (df - df.mean())/ (df.max() - df.min())
df_normalized.head()
#df.head()
df_normalized.corr()
plt.figure(figsize=(12, 8))
corr = df_normalized.corr()
sns.heatmap(corr,
xticklabels = corr.columns.values,
yticklabels = corr.columns.values,
annot = True);
plt.suptitle('Heatmap, Correlation Actual Intake Variables, and Difference from Recommended Intake Variables');
plt.savefig('./images/univariate_food_groups_heatmaps_diff_intake_amount_normalized' + '.png')
Bivariate
The most important for Bivariate: bivariate_diff_norm.png and bivariate_diff_norm_rate_only.png. Difference in intake amounts from recommended, also normalized.
Bivariate plots on actual amount intake and target variables. will be saved in bivariate.png. The correlation pattern can be checked in the image saved
Correlation within the intake amounts do not say much as we are using actual intake amounts not differences with recomended amount. Also because, higher aged people will take higher amount. Also, normalized data will give true relation.
The correlation was shown above. Now linearity and non-linearity can be seen using the Bivariate plots.
The plots show similarity with correlation
import numpy as np
import pandas as pd
from IPython.display import display, HTML
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
%matplotlib inline
# ref in ref section
On Actual intake Amounts
df = pd.read_csv('./data-for-code/older_no-empty-data-only-actual-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing.csv')
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_actual = df.drop([' Gender', 'age_from', 'age_to'], axis=1)
df_actual.head()
# on actual amounts
#plt.figure(figsize=(16, 300))
sns.pairplot(df_actual, vars=df_actual.columns, size=5, kind='reg'); # diag_kind='kde',
plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
plt.savefig('./images/bivariate_food_group' + '.png')
plt.show()
On difference from recommended amount (data not normalized)
df = pd.read_csv('./data-for-code/copy-only-diff-no-actual-no-empty-cell-diff-recomm-mortality_recom_added_group_data_june_9th_gender_based_data_after_processing .csv')
df.head()
# On difference from recommended amount (data not normalized)
#df_actual_diff = df.drop(['age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual_diff.head()
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
#df_actual = df.drop([' Gender', 'age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_actual_diff = df.drop(['age_from', 'age_to'], axis=1)
df_actual_diff.head()
#df.head()
# On difference from recommended amount (data not normalized)
sns.pairplot(df_actual_diff, vars=df_actual_diff.columns, size=5, kind='reg');
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
#plt.savefig('./saved_images_from_visualizations/' + 'line_' +indicator.replace(' ', '_')[0:10] + '_' + str(np.random.randint(0, 99999)) + '.png')
plt.savefig('./images/bivariate_diff' + '.png')
plt.show()
On difference from recommended amount (data/diff normalized)
df_normalized_diff = (df_actual_diff - df_actual_diff.mean())/ (df_actual_diff.max() - df_actual_diff.min())
df_normalized_diff.head()
sns.pairplot(df_normalized_diff, vars=df_normalized_diff.columns, size=5, kind='reg', aspect=1); # diag_kind='kde',
#plt.title('Bivariate Plot, All Actual Taken Variables, Total ESRD target variable');
plt.savefig('./images/bivariate_diff_norm' + '.png')
plt.show()
df_normalized_diff = df_normalized_diff.drop([' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates'], axis=1)
df_normalized_diff.head()
sns.pairplot(df_normalized_diff, vars=df_normalized_diff.columns, size=5, kind='reg', aspect=1); # diag_kind='kde',
plt.suptitle('Bivariate : Diff : Food Group: Normalized\n')
plt.ylabel('Difference in Intake amount from Recommended : Normalized')
plt.xlabel('Intakes')
plt.savefig('./images/bivariate_diff_norm_rate_only' + '.png')
plt.show()
# can be ignored
"""
df_esrdonly = df.drop(['age_from', 'age_to',' ESRD patients: Avg. Annual Mortality rates', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_esrdonly.head()
plt.figure(figsize=(14, 14))
sns.pairplot(df_esrdonly, diag_kind='kde');
plt.xlabel('Bivariate Plot, Difference from Recommended Variables, Total ESRD target variable');
#plt.figure(figsize=(16, 16))
plt.rcParams['figure.figsize'] = 16, 16
pd.plotting.scatter_matrix(df_esrdonly, diagonal="kde")
plt.suptitle('Scatter Matrix');
plt.show();
"""'\ndf_esrdonly = df.drop([\'age_from\', \'age_to\',\' ESRD patients: Avg. Annual Mortality rates\', \' Dialysis patients: Total (or %) deaths for target year\', \' Dialysis patients: Avg. Annual Mortality rates\' ], axis=1)\ndf_esrdonly.head()\n\nplt.figure(figsize=(14, 14))\nsns.pairplot(df_esrdonly, diag_kind=\'kde\');\nplt.xlabel(\'Bivariate Plot, Difference from Recommended Variables, Total ESRD target variable\');\n\n#plt.figure(figsize=(16, 16))\nplt.rcParams[\'figure.figsize\'] = 16, 16\npd.plotting.scatter_matrix(df_esrdonly, diagonal="kde")\nplt.suptitle(\'Scatter Matrix\');\nplt.show();\n'
PCA
Applying PCA on Difference data
df_esrdonly = df.drop(['age_from', 'age_to',' ESRD patients: Total (or %) deaths for target year', ' Dialysis patients: Total (or %) deaths for target year', ' Dialysis patients: Avg. Annual Mortality rates' ], axis=1)
df_esrdonly.head()
#df_esrdonly.head()
# steps followed from ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
import sklearn
from sklearn import preprocessing
standardisedX = sklearn.preprocessing.scale(df_esrdonly)
standardisedX = pd.DataFrame(standardisedX, index=df_esrdonly.index, columns=df_esrdonly.columns)
standardisedX.apply(np.mean)Diff Vegetable -4.332954e-16
diff protein -8.018277e-16
diff grain 5.366078e-16
diff dairy -3.885781e-16
diff fruit -9.251859e-17
ESRD patients: Avg. Annual Mortality rates 1.480297e-16
dtype: float64standardisedX.apply(np.std)Diff Vegetable 1.0
diff protein 1.0
diff grain 1.0
diff dairy 1.0
diff fruit 1.0
ESRD patients: Avg. Annual Mortality rates 1.0
dtype: float64from sklearn import decomposition
pca = decomposition.PCA(n_components=2).fit(standardisedX)
pca = decomposition.PCA().fit(standardisedX)
pcaPCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)#ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
def pca_summary(pca, standardised_data, out=True):
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(standardised_data), axis=0))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame( list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summarysummary = pca_summary(pca, standardisedX)Importance of components:
# First two component can define over 95%np.sum(summary.sdev**2)Standard deviation 6.0
dtype: float64# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
plt.rcParams['figure.figsize'] = 8, 8
def screeplot(pca, standardised_values):
y = np.std(pca.transform(standardised_values), axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.xticks(x, ["Comp."+str(i) for i in x], rotation=60)
plt.ylabel("Variance")
plt.title('PCA variance in Components')
plt.savefig('./images/pca_components_variance' + '.png')
plt.show()
screeplot(pca, standardisedX)
# comp 2 to comp 3 is the most change - slope
# first two or at best first 3 can be retainedsummary.sdev**2
Can be retained
PC1 4.263124 PC2 1.449435
pca.components_[0]array([ 0.33558246, 0.47787327, 0.4773956 , 0.45412539, 0.45708252,
-0.12633561])np.sum(pca.components_[0]**2)0.999999999999999# ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code, I am using this as (similar to) a library function
def calcpc(variables, loadings):
# find the number of samples in the data set and the number of variables
numsamples, numvariables = variables.shape
# make a vector to store the component
pc = np.zeros(numsamples)
# calculate the value of the component for each sample
for i in range(numsamples):
valuei = 0
for j in range(numvariables):
valueij = variables.iloc[i, j]
loadingj = loadings[j]
valuei = valuei + (valueij * loadingj)
pc[i] = valuei
return pccalcpc(standardisedX, pca.components_[0])array([ 7.33661891, 3.2759593 , 0.87786527, -0.76096561, -1.27436192,
-1.11656647, -1.05210251, -0.55640492, -0.52219388, -0.70806478,
-1.0803665 , -0.08831969, -0.22064849, -0.32119398, -0.82738023,
-0.54845766, -1.0429077 , -1.37050914])pca.transform(standardisedX)[:, 0]array([ 7.33661891, 3.2759593 , 0.87786527, -0.76096561, -1.27436192,
-1.11656647, -1.05210251, -0.55640492, -0.52219388, -0.70806478,
-1.0803665 , -0.08831969, -0.22064849, -0.32119398, -0.82738023,
-0.54845766, -1.0429077 , -1.37050914])pca.components_[1]array([ 0.56910521, 0.00870248, 0.00721787, -0.03193287, -0.18125629,
0.80132258])np.sum(pca.components_[1]**2)0.9999999999999991
highest loadings for
# for following code : Classes will/might be defined such as like High, low, neutral mortality in final workdf_esrdonly[' ESRD patients: Avg. Annual Mortality rates']
sorted(standardisedX[' ESRD patients: Avg. Annual Mortality rates'])[-0.8508567828409127,
-0.8447169016030269,
-0.8078776141757118,
-0.7541536533442106,
-0.6983830654334141,
-0.6114014145633645,
-0.5351645558596153,
-0.478882311178995,
-0.45329947268780396,
-0.44050805344220845,
-0.29059261988382884,
-0.13095570769879675,
0.08752173301597484,
0.32902372837281824,
0.681043586011607,
1.1635359199554702,
1.6915657064136536,
2.9441014789423674]# Define high and low mortalityy = df_esrdonly[' ESRD patients: Avg. Annual Mortality rates'] #< 0.5
# from normalized data. > 0.5 = high mortality
print('as the older age show higher mortality, does it mean th')
y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
y = standardisedX[' ESRD patients: Avg. Annual Mortality rates'] > 0.5
yas the older age show higher mortality, does it mean th
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 True
16 True
17 True
Name: ESRD patients: Avg. Annual Mortality rates, dtype: bool# # ref: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
# not my code from the URL above, using this as a library function
def pca_scatter(pca, standardised_values, classifs):
foo = pca.transform(standardised_values)
bar = pd.DataFrame(list(zip(foo[:, 0], foo[:, 1], classifs)), columns=["PC1", "PC2", "Class"])
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')
sns.lmplot("PC1", "PC2", bar, hue="Class", fit_reg=False)
#y = df_esrdonly[' ESRD patients: Total (or %) deaths for target year']
#y = np.std(pca.transform(standardisedX), axis=0)**2
pca_scatter(pca, standardisedX, y)
# y can be used as classes like High, low, neutral mortality
# plt.suptitle('Mortality class and Principle components, y can be used as classes like High, low, neutral mortality');
plt.title('Only two components can separate the mortality data. True = High ')
plt.savefig('./images/pca_components_separating_high_low_mortality' + '.png')<Figure size 576x576 with 0 Axes>
# reference: https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
pca_components_cont = pca.components_[0:2]
plt.matshow(pca_components_cont, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'],fontsize=10)
plt.colorbar()
#plt.xticks(range(1, len(df_esrdonly.columns)),df_esrdonly.columns[1:len(df_esrdonly.columns)],rotation=65)
plt.xticks(range(len(df_esrdonly.columns)),df_esrdonly.columns,rotation=65,ha='left')
#plt.tight_layout()
plt.savefig('./images/pca_food_groups_what_contributes_to_PCA_components' + '.png')
plt.show()#
# from the abobe plot, Vegetable, Grain, Protein contribute the most to the 1st component
How the intake differences contributed for the affecting PCA components
diffs = list(df_esrdonly.columns[:-1])
import seaborn as sns
s = sns.heatmap(df_esrdonly[diffs].corr(),cmap='coolwarm')
s.set_yticklabels(s.get_yticklabels(),rotation=30,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=30,fontsize=7)
plt.savefig('./images/pca_food_groups_how_in_together_influencing_PCA_components' + '.png')
plt.show()
Vegetable by itself is important — contributing factors
protein + grain + then fruit
grain with fruit — slighly with dairy
dairy with protein and grain*
fruit with protein and grain then dairy
References
https://www.kaggle.com/etakla/exploring-the-dataset-univariate-analysis https://www.kaggle.com/etakla/exploring-the-dataset-bivariate-analysis https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e https://lifelines.readthedocs.io/en/latest/ https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html https://www.statsdirect.com/help/survival_analysis/cox_regression.htm https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/ https://www.kaggle.com/residentmario/univariate-plotting-with-pandas
https://erc.barnard.edu/spss/pearsons_r
https://www.statpac.com/statistics-calculator/correlation-regression.htm
#ref https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html #https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
Note to help with re-generating the data
To include oil and sugar recommendation amount, the sql as below was used.
also for other food groups
#/****** Script for SelectTopNRows command from SSMS ******/
Stored Procedure: [find_recommended_amounts_for_food_groups]
SELECT [age] ,[food_group] ,[average_taken] ,[recommended_low] ,[recommended_high] FROM [mrp_experiment_phase_data_process].[dbo].[age_food_group_recommended_amount] where age in ( select ‘1’ union select age_groups.[from] from age_groups) and food_group in (‘Fats, oils, and salad dressings’, ‘Sugars, sweets, and beverages’) order by food_group
The python file: for_each_age_recommended_amount_take_from_related_group.ipynb helped to generate the recommended amounts based on health.gov data Where for_each_age_recommended_amount.ipynb was the earlier versions : there are some changes in the approch
The study as below also the recommendations from health.gov was used for the recommended amount calculations https://health.gov/dietaryguidelines/2015/guidelines/chapter-2/a-closer-look-at-current-intakes-and-recommended-shifts/#figure-2-7-desc-toggle
For sugar recommended amounts: multiple sources were used. The article above and health.gov recommendations were used. health.gov recommends 10% of calory from sugar using the recommended calory intake per day Other articles were also. check the ipynb files and notes there
Folders such as: input_csvdietfiles, output_agescsvdietfiles are related to recommendation amount calculations from health.gov or shift_recommendation article data