Hyderabad AQI Prediction Project From Scratch To Deployment
Air Quality Index prediction
Table of contents:
(i) Introduction
(ii) The Motivation for the project
(iii) Data Collection
(iv) Data Pre-processing
(v) Feature Importance
(vi) Model Building
(vii) Deployment with streamlit
(viii) Conclusion
(ix) References
(i) Introduction
Air is a mixture of many gases and dust particles. It is the clear gas in which living things live and breathe. Air is a mixture of about 78% of nitrogen, 21% of oxygen, 0.9% of argon, 0.04% of carbon dioxide, and very small amounts of other gases.
Air quality is measure with the Air Quality Index(PM 2.5)
PM 2.5 is a fine particulate matter that is an air pollutant that is a concern for people’s health when levels in the air are high.
(ii) The Motivation For The Project:
Due to COVID, there is no transportation throughout the world because of that air is so fresh. I am currently living in Hyderabad city it’s the most polluted city in India but recent days here also the air is so fresh, I am thinking about it how air is so fresh nowadays, what about last year how was the air quality last year from that thought this project born.
(iii) Data Collection:
I collect the data from the https://en.tutiempo.net/climate/ws-431280.html from 2013 to 2018 using requests module
I created a function to collect the data from the website in the HTML format
import os
import time
import sys
import requestsdef data_collection():
for year in range(2013,2019): # for loop for year range
for month in range(1,13): # for loop for month range
if (month<10):
# Condition for if month number below 10
url=’https://en.tutiempo.net/climate/0{}-{}/ws-431280.html'.format(month,year)
else:
# Condition for 10 to 12 months
url=’https://en.tutiempo.net/climate/{}-{}/ws-431280.html'.format(month,year)
collected_texts=requests.get(url) # using requests we get the html data into collected_texts variable
collected_text_utf=collected_texts.text.encode(‘utf=8’) # our html contains so many data types so we use utf8 encoding
# after getting data we need to store the data in a directory so for that we create Html_data directory with year directory
if not os.path.exists(“Data_collection/Html_data/{}”.format(year)):
os.makedirs(“Data_collection/Html_data/{}”.format(year))
# To store that we need to open the directory
with open (“Data_collection/Html_data/{}/{}.html”.format(year,month),’wb’) as Result:
Result.write(collected_text_utf)
sys.stdout.flush()if __name__==”__main__”:
start_time=time.time()
data_collection() # function calling
stop_time=time.time()
print(‘Time Taken {}’.format(stop_time-start_time)) # Time taken to store the data
Executing the above function I collected data in HTML format from 2013 to 2018
Executing the above function it creates directory Html_data in that directory, we have each year data from 2013 to 2018 in separate folders
In Html data we have T =Average Temperature (°C) ,TM = Maximum temperature (°C) ,Tm == Minimum temperature (°C) , SLP = Atmospheric pressure at sea level (hPa) , H = Average relative humidity (%) ,VV = Average visibility (Km) ,V = Average wind speed (Km/h) ,VM =Maximum sustained wind speed (Km/h).
We also need PM 2.5 value, PM 2.5 values are collected from paid API from 2013 to 2018
if you want PM 2.5 data visit below GitHub link
PM 2.5 (AQI) is a dependent feature
T,Tm,TM,SLP,H,VV.VM, V are independent features
Our dependent feature belongs to real value so now we need to solve the regression problem
We have PM 2.5 values for each hour so first, we need to convert that values form hour’s to a day
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsdef Day_values_2013():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2013.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return averagedef Day_values_2014():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2014.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return averagedef Day_values_2015():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2015.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return averagedef Day_values_2016():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2016.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return averagedef Day_values_2017():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2017.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return averagedef Day_values_2018():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2018.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return averageif __name__==”__main__”:
lst2013=Day_values_2013()
lst2014=Day_values_2014()
lst2015=Day_values_2015()
lst2016=Day_values_2016()
lst2017=Day_values_2017()
lst2018=Day_values_2018()
plt.plot(range(0,365),lst2013,label=”2013 data”)
plt.plot(range(0,364),lst2014,label=”2014 data”)
plt.plot(range(0,365),lst2015,label=”2015 data”)
plt.plot(range(0,365),lst2016,label=”2016 data”)
plt.xlabel(‘Day’)
plt.ylabel(‘PM 2.5’)
plt.legend(loc=’upper right’)
plt.show()
Executing the above function we get PM 2.5 per day values for each year
We get a dependent feature
Now we need to extract independent features from the Html, for that I am using Beautifulsoup to parse the Html data into a CSV file
After collecting independent features into CSV we need to add dependent feature PM 2.5 to that CSV file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
from Hours_to_Day import Day_values_2013,Day_values_2014,Day_values_2015,Day_values_2016,Day_values_2017,Day_values_2018
import sys
from bs4 import BeautifulSoup
import os
import csvdef html_scraping(month,year):
# To scrap the data we need to give the path of flies
file_path=open(‘Data_collection/Html_data/{}/{}.html’.format(year,month),’rb’)
#After scraping the data we need to store it in a variable
Scaraped_data=file_path.read()
# Now I create two empty lists for future purpose
Sample_data=[]
Finale_data=[]# Now I intialise the beautifulsoup class beautifulsoup(Scrapedtext,filetype)
soup=BeautifulSoup(Scaraped_data,”lxml”)
# We need the table data from html so we loop through the table tag and it’s class from scraped html data
for table in soup.findAll(‘table’,{‘class’:’medias mensuales numspan’}):
# In table data we need body of the table to find the features so we loop through the table body
for tbody in table:
# In table body we need the rows to get features data so we loop through the table rows
for tr in tbody:
# Now we extract the row data
Extract_data=tr.get_text()
# Now we append the row data into Sample_data list
Sample_data.append(Extract_data)
# If we manually check in the html we have 15 features so to check the if we are getting 15 features or not
# No of row
No_rows=len(Sample_data)/15
# Now to get the feature values first we need to go through the rows so we loop through the rows
for iterate in range(round(No_rows)):
# Creating empty list store feature values of each rows
lst=[]
# we loop through the feature to get each value
for i in range(15):
# we add the each row data in to empty lst
lst.append(Sample_data[0])
#Now we remove data from Sample_data
Sample_data.pop(0)
# Now we we add the each row values of 15 features in Finale list
Finale_data.append(lst)Length_Finale_data=len(Finale_data)
Finale_data.pop(Length_Finale_data-1)
Finale_data.pop(0)
# Now we remove the empty features from table beacause this features doesn’t contain any value
for feature in range(len(Finale_data)):
Finale_data[feature].pop(6)
Finale_data[feature].pop(13)
Finale_data[feature].pop(12)
Finale_data[feature].pop(11)
Finale_data[feature].pop(10)
Finale_data[feature].pop(9)
Finale_data[feature].pop(0)
return Finale_data# Once scrapiing the html table data
# html table features are independet variables
# Hours_to_Day PM2.5 feature is dependent features
# we need to combine both independent features and dependent features in csv file
# for that we write a datadef combine_dependent_independent(year, cs):
for i in pd.read_csv(‘Data_collection/Html_scraping_data/real_’ + str(year) + ‘.csv’, chunksize=cs):
df = pd.DataFrame(data=i)
mylist = df.values.tolist()
return mylistif __name__==”__main__”:
# We need to create a directory to store the csv files
if not os.path.exists(“Data_collection/Html_scraping_data”):
os.makedirs(“Data_collection/Html_scraping_data”)
# After creating directory we need to write the csv file
# we need years from 2013 t0 2018 so for that we create loop for year
for year in range(2013,2019):
final_data=[]
with open (“Data_collection/Html_scraping_data/real_”+str(year)+”.csv”,’w’) as csvfile:
writting_csv=csv.writer(csvfile, dialect=’excel’)
writting_csv.writerow([‘T’,’TM’,’Tm’,’SLP’,’H’,’VV’,’V’,’VM’,’PM2.5'])
# To add the data to the csv files we call the html_scraping function
for month in range(1,13):
temp=html_scraping(month, year)
final_data=final_data+temp
# To get PM2.5 avg values we need to call the corresponding function
# For the we dinamically write it with getattr
dependent=getattr(sys.modules[__name__], ‘Day_values_{}’.format(year))()
# To add the dependent feature PM2.5 to the independent features
for i in range(len(final_data)-1):
final_data[i].insert(8,dependent[i])
with open(‘Data_collection/Html_scraping_data/real_’ + str(year) + ‘.csv’, ‘a’) as csvfile:
wr = csv.writer(csvfile, dialect=’excel’)
for row in final_data:
flag = 0
for elem in row:
if elem == “” or elem == “-”:
flag = 1
if flag != 1:
wr.writerow(row)
# We call the combine_dependent_independent function to combine the both
data_2013 = combine_dependent_independent(2013, 600)
data_2014 = combine_dependent_independent(2014, 600)
data_2015 = combine_dependent_independent(2015, 600)
data_2016 = combine_dependent_independent(2016, 600)
data_2017 = combine_dependent_independent(2017, 600)
data_2018 = combine_dependent_independent(2018, 600)
# combining the all years data into single csv
total=data_2013+data_2014+data_2015+data_2016+data_2017+data_2018
with open(‘Data_collection/Html_scraping_data/Real_Combine.csv’, ‘w’) as csvfile:
wr = csv.writer(csvfile, dialect=’excel’)
wr.writerow(
[‘T’, ‘TM’, ‘Tm’, ‘SLP’, ‘H’, ‘VV’, ‘V’, ‘VM’, ‘PM 2.5’])
wr.writerows(total)
df=pd.read_csv(‘Data/Real-Data/Real_Combine.csv’)
After scrapping and appending PM 2.5 values to the CSV, we get each individual year CSV files
We combine all years data into a single CSV
Now finally we got the Real_combine.csv file
We use this data to build ML application to predict future AQI
(iv) Data Pre-Processing :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns# Reading csv filecombine_data= pd.read_csv(r'C:\Users\Desktop\janibasha\Complete Data Science life cycle\Data_collection\Html_scraping_data\Real_combine.csv')
# checking no of numerical features
combine_data.info()# To get statistical information
combine_data.describe()# Now we need check null values
combine_data.isnull()
combine_data.isnull().sum()# we also visualize null with seabornsns.heatmap(combine_data.isnull(),yticklabels=False)
From the visualization, we came to know we don’t have any null values in the data frame.
We have all numerical features no categorical features so there is no need of encoding
# checking outliers combine_data.boxplot(column=’Tm’)
plt.show()
Similarly, we can check outliers for every feature
# Multivariate anlaysissns.pairplot(combine_data)
If we observe the above analysis there is no linear relation between independent and dependent features so linear algorithms don’t give good results
# We also check the corelation between dependent and independent featurecombine_data.corr()
relation =combine_data.corr()
relation_index=relation.indexsns.heatmap(combine_data[relation_index].corr(),annot=True)
From the above heatmap, we get an idea about the relationship between a dependent(PM 2.5) and Independent features(T,TM,Tm,SLP,H,VV,V,VM]
(v) Feature Importance :
We have 8 independent features, We don’t know which feature is important for the predict PM 2.5 value.
To know feature importance we use ExtraTreesRegressor (model-based feature selection)
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as accreg= ExtraTreesRegressor()reg.fit(X_train,y_train)reg.feature_importances_feat_importances = pd.Series(reg.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()
(vi) Model Building :
we have only 8 features so for the model building I consider all features and check the performances of different models
At the time of deployment, we train the model for the top 5 features and deploy it on any cloud platform
(1) KNN Regressor :
After k=17 there is a stable condition so we take K=17
# weighted knnweighted_tuned_reg = KNeighborsRegressor(n_neighbors=17,weights='distance')weighted_tuned_reg.fit(X_train,y_train)weighted_tuned_reg.score(X_train,y_train)
1.0# performance of model on test dataset
weighted_tuned_reg.score(X_test,y_test)
0.5238963653617966# cross validation
from sklearn.model_selection import cross_val_score
score=cross_val_score(weighted_tuned_reg,X,y,cv=5)
# cross validation perfomance
score.mean()
0.43669012578295# Model evalutation
prediction=weighted_tuned_reg.predict(X_test)# Comparing predicted PM2.5 and labeld PM 2.5
plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3247.1281849543693
(2) Linear, Lasso, and Ridge Regressor :
Linear Regressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(combine_data.iloc[:,:-1], combine_data.iloc[:,-1], test_size=0.3, random_state=0)from sklearn.linear_model import LinearRegression# creating linear regression modelreg_model=LinearRegression(normalize=True)# fit independent varaibles to the dependent variables
reg_model.fit(X_train,y_train)reg_model.score(X_train,y_train)
0.40996129270540077reg_model.score(X_test,y_test)
0.3894144479464322# for slopereg_model.coef_array([-8.71256785, -0.78565823, -0.64082652, 2.64932719, -1.44568242,0.27712587, -1.83610819, -1.01908448])# for interceptreg_model.intercept_
-2180.4321527938205# cross validationfrom sklearn.model_selection import cross_val_score
score=cross_val_score(reg_model,combine_data.iloc[:,:-1],combine_data.iloc[:,-1],cv=5)score.mean()
0.3229764710803792prediction=reg_model.predict(X_test)# checking predicted y and labeled y
plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 4164.32350260401
Ridge Regressor
# Randomsearch cvfrom sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randint
parameters={'alpha':randint(1e-8,100)}ridge_reg_1=RandomizedSearchCV(reg_model_1,parameters,scoring='neg_mean_squared_error',cv=5)
ridge_reg_1.fit(combine_data.iloc[:,:-1],combine_data.iloc[:,-1])ridge_reg_1.best_score_
-4265.905962013921ridge_reg_1.best_params_
{'alpha': 75}prediction=ridge_reg_1.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))MSE: 3912.4550098306554
Lasso Regressor
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV# Intializing the model
reg_model_2=Lasso()# hyper parameter range
hyperparameters_range={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]}# Searching best hyper parameterlasso_reg=GridSearchCV(reg_model_2,hyperparameters_range,scoring='neg_mean_squared_error',cv=5)lasso_reg.fit(combine_data.iloc[:,:-1],combine_data.iloc[:,-1])lasso_reg.best_params_
{'alpha': 5}lasso_reg.best_score_
-4249.911163771522prediction = lasso_reg.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3922.586481270448
(3) Decision Tree Regressor :
from sklearn.tree import DecisionTreeRegressor# creating Decision tree regression modelreg_decision_model=DecisionTreeRegressor()# Hyper parameters range intialization for tuningparameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9,11,12],
"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }# calculating different regression metricsfrom sklearn.model_selection import GridSearchCVtuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=10,verbose=3)# function for calculating how much time take for hyperparameter tuningdef timer(start_time=None):
if not start_time:
start_time=datetime.now()
return start_time
elif start_time:
thour,temp_sec=divmod((datetime.now()-start_time).total_seconds(),3600)
tmin,tsec=divmod(temp_sec,60)
print(thour,":",tmin,':',round(tsec,2))X=combine_data.iloc[:,:-1]y=combine_data.iloc[:,-1]from datetime import datetimestart_time=timer(None)tuning_model.fit(X,y)timer(start_time)tuning_model.best_params_{'max_depth': 9,
'max_features': None,
'max_leaf_nodes': 20,
'min_samples_leaf': 8,
'min_weight_fraction_leaf': 0.1,
'splitter': 'random'}tuning_model.best_score_
-3621.0007087939457# Model Evaluationprediction=tuning_model.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))MSE: 4589.011220616968
(4) RandomForest Regressor
# Hyperparameter tuning with RandomizedSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Hyparameter rangesfrom scipy.stats import randintparameters = {'n_estimators': randint(100,1200),
'max_features': ['auto','sqrt'],
'max_depth': randint(5,40),
'min_samples_split': randint(2,30),
'min_samples_leaf': randint(1,10) }# Model for tuningbase_learner=RandomForestRegressor()# Tuningtuned_model= RandomizedSearchCV(estimator = base_learner, param_distributions = parameters,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs =-1)tuned_model.fit(X_train,y_train)tuned_model.best_params_{'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 16,
'n_estimators': 901}tuned_model.best_score_
-3425.3665578465598# Predicting X_test values using tuned_model
prediction=tuned_model.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3308.584324808751
(5) Xgboost Regressor :
import xgboost as xgb# Hyperparameter tuning with RandomizedSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Hyparameter rangesfrom scipy.stats import randintparameters = {'n_estimators': randint(100,1200),
'learning_rate': [0.001,0.002,0.003,0.005,0.01,0.04,0.05,0.1,0.2,0.3,0.4,0.5,0.6],
'max_depth': randint(5,40),
'subsample': [0.5,0.6,0.7,0.8],
'min_child_weight': randint(1,10) }# Model for tuningbase_learner=xgb.XGBRegressor()# Tuningtuned_model= RandomizedSearchCV(estimator = base_learner, param_distributions = parameters,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs =-1)tuned_model.fit(X_train,y_train)tuned_model.best_params_{'learning_rate': 0.005,
'max_depth': 5,
'min_child_weight': 8,
'n_estimators': 611,
'subsample': 0.6}tuned_model.best_score_
-3656.933662545248# Predicting X_test values using tuned_model
prediction=tuned_model.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3458.1210809592762
(vii) Deployment :
Comparing MSE of all models
KNN: 3247.1281849543693
Linear Regression: 4164.32350260401
Ridge Regression: 3912.4550098306554
Lasso Regression:3922.586481270448
Decision Tree Regressor: 4589.011220616968
Random Forest Regressor:3308.584324808751
Xgboost Regressor:3458.1210809592762
Random forest is a Generalized model compare to KNN so I deploy Random Forest Regressor with the top 5 features using streamlit
Random Forest Regressor for Top 5 features
# Taking top 5 features for depolyment
# top 5 features taken from the extratree regressorX_train, X_test, y_train, y_test = train_test_split(combine_data.iloc[:,[0,1,2,3,6]], combine_data.iloc[:,-1], test_size=0.3, random_state=0)# intializing modelrandom_forest_reg1=RandomForestRegressor(n_estimators=901, max_depth=5, min_samples_split=16, min_samples_leaf=1, max_features='sqrt', n_jobs=-1)# fitting modelrandom_forest_reg1.fit(X_train,y_train)random_forest_reg1.score(X_train,y_train)
0.650728501356922random_forest_reg1.score(X_test,y_test)
0.5218627277628385prediction=random_forest_reg1.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3260.998026487038
Dumping model into pickle
import pickle
pickle_out = open(“Random_forest_regressor.pkl”,”wb”)
pickle.dump(random_forest_reg1, pickle_out)
pickle_out.close()
Streamlit framework
import pickle
import streamlit as stpickle_in = open(“Random_forest_regressor.pkl”,”rb”)
random_forest_regressor=pickle.load(pickle_in)def welcome():
return “ welcome all”def predict_AQI(Average_Temperature,Maximum_Temperature,Minimum_Temperature,Atm_pressure_at_sea_level,Average_wind_speed):
prediction=random_forest_regressor.predict([[ Average_Temperature,Maximum_Temperature,Minimum_Temperature, Atm_pressure_at_sea_level,Average_wind_speed]])
print(prediction)
return predictiondef main():
st.title(“Hyderabad AQI prediction”)
html_temp = “””
<div style=”background-color:green;padding:20px”>
<h2 style=”color:white;text-align:center;”>AQI prediction ML App </h2>
</div>
“””
st.markdown(html_temp,unsafe_allow_html=True)
Average_Temperature= st.text_input(“Average_Temperature “,”Type Here”)
Maximum_Temperature = st.text_input(“Maximum_Temperature “,”Type Here”)
Minimum_Temperature = st.text_input(“Minimum_Temperature “,”Type Here”)
Atm_pressure_at_sea_level = st.text_input(“Atm_pressure_at_sea_level “,”Type Here”)
Average_wind_speed = st.text_input(“Average_wind_speed “,”Type Here”)
result=””
if st.button(“Predict”):
result=predict_AQI(Average_Temperature,Maximum_Temperature,Minimum_Temperature,Atm_pressure_at_sea_level,Average_wind_speed)
st.success(‘The output is {}’.format(result))
if st.button(“About”):
st.text(“Lets LEarn”)
st.text(“Built with Streamlit”)if __name__==’__main__’:
main()
using the above streamlit framework we can deploy our model into any cloud platform
(viii) Conclusion:
Thank you for your interest in the blog. Please leave comments, feedback, and suggestions if you feel any
Github: https://github.com/jani-excergy/Complete_Data_Science_Life_Cycle_Projects
(ix) References :
Wikipedia: https://simple.wikipedia.org/wiki/Air
Krish Naik youtube channel