Hyderabad AQI Prediction Project From Scratch To Deployment

Janibasha Shaik
Analytics Vidhya
Published in
12 min readSep 14, 2020

Air Quality Index prediction

Table of contents:

(i) Introduction

(ii) The Motivation for the project

(iii) Data Collection

(iv) Data Pre-processing

(v) Feature Importance

(vi) Model Building

(vii) Deployment with streamlit

(viii) Conclusion

(ix) References

(i) Introduction

Air is a mixture of many gases and dust particles. It is the clear gas in which living things live and breathe. Air is a mixture of about 78% of nitrogen, 21% of oxygen, 0.9% of argon, 0.04% of carbon dioxide, and very small amounts of other gases.

Air quality is measure with the Air Quality Index(PM 2.5)

PM 2.5 is a fine particulate matter that is an air pollutant that is a concern for people’s health when levels in the air are high.

(ii) The Motivation For The Project:

Due to COVID, there is no transportation throughout the world because of that air is so fresh. I am currently living in Hyderabad city it’s the most polluted city in India but recent days here also the air is so fresh, I am thinking about it how air is so fresh nowadays, what about last year how was the air quality last year from that thought this project born.

(iii) Data Collection:

I collect the data from the https://en.tutiempo.net/climate/ws-431280.html from 2013 to 2018 using requests module

I created a function to collect the data from the website in the HTML format

import os 
import time
import sys
import requests
def data_collection():
for year in range(2013,2019): # for loop for year range
for month in range(1,13): # for loop for month range
if (month<10):
# Condition for if month number below 10
url=’https://en.tutiempo.net/climate/0{}-{}/ws-431280.html'.format(month,year)
else:
# Condition for 10 to 12 months
url=’https://en.tutiempo.net/climate/{}-{}/ws-431280.html'.format(month,year)

collected_texts=requests.get(url) # using requests we get the html data into collected_texts variable
collected_text_utf=collected_texts.text.encode(‘utf=8’) # our html contains so many data types so we use utf8 encoding

# after getting data we need to store the data in a directory so for that we create Html_data directory with year directory
if not os.path.exists(“Data_collection/Html_data/{}”.format(year)):
os.makedirs(“Data_collection/Html_data/{}”.format(year))

# To store that we need to open the directory
with open (“Data_collection/Html_data/{}/{}.html”.format(year,month),’wb’) as Result:
Result.write(collected_text_utf)

sys.stdout.flush()
if __name__==”__main__”:
start_time=time.time()
data_collection() # function calling
stop_time=time.time()
print(‘Time Taken {}’.format(stop_time-start_time)) # Time taken to store the data

Executing the above function I collected data in HTML format from 2013 to 2018

Executing the above function it creates directory Html_data in that directory, we have each year data from 2013 to 2018 in separate folders

In Html data we have T =Average Temperature (°C) ,TM = Maximum temperature (°C) ,Tm == Minimum temperature (°C) , SLP = Atmospheric pressure at sea level (hPa) , H = Average relative humidity (%) ,VV = Average visibility (Km) ,V = Average wind speed (Km/h) ,VM =Maximum sustained wind speed (Km/h).

We also need PM 2.5 value, PM 2.5 values are collected from paid API from 2013 to 2018

if you want PM 2.5 data visit below GitHub link

https://github.com/jani-excergy/Complete_Data_Science_Life_Cycle_Projects/tree/master/Data_collection

PM 2.5 (AQI) is a dependent feature

T,Tm,TM,SLP,H,VV.VM, V are independent features

Our dependent feature belongs to real value so now we need to solve the regression problem

We have PM 2.5 values for each hour so first, we need to convert that values form hour’s to a day

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def Day_values_2013():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2013.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

average.append(avg)
return average
def Day_values_2014():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2014.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

average.append(avg)
return average
def Day_values_2015():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2015.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

average.append(avg)
return average
def Day_values_2016():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2016.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

average.append(avg)
return average
def Day_values_2017():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2017.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

average.append(avg)
return average
def Day_values_2018():
temp_i=0
average=[]
for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2018.csv’,chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row[‘PM2.5’])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

average.append(avg)
return average
if __name__==”__main__”:
lst2013=Day_values_2013()
lst2014=Day_values_2014()
lst2015=Day_values_2015()
lst2016=Day_values_2016()
lst2017=Day_values_2017()
lst2018=Day_values_2018()
plt.plot(range(0,365),lst2013,label=”2013 data”)
plt.plot(range(0,364),lst2014,label=”2014 data”)
plt.plot(range(0,365),lst2015,label=”2015 data”)
plt.plot(range(0,365),lst2016,label=”2016 data”)
plt.xlabel(‘Day’)
plt.ylabel(‘PM 2.5’)
plt.legend(loc=’upper right’)
plt.show()

Executing the above function we get PM 2.5 per day values for each year

We get a dependent feature

Now we need to extract independent features from the Html, for that I am using Beautifulsoup to parse the Html data into a CSV file

After collecting independent features into CSV we need to add dependent feature PM 2.5 to that CSV file

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
from Hours_to_Day import Day_values_2013,Day_values_2014,Day_values_2015,Day_values_2016,Day_values_2017,Day_values_2018
import sys
from bs4 import BeautifulSoup
import os
import csv
def html_scraping(month,year):

# To scrap the data we need to give the path of flies
file_path=open(‘Data_collection/Html_data/{}/{}.html’.format(year,month),’rb’)

#After scraping the data we need to store it in a variable

Scaraped_data=file_path.read()

# Now I create two empty lists for future purpose

Sample_data=[]
Finale_data=[]
# Now I intialise the beautifulsoup class beautifulsoup(Scrapedtext,filetype)

soup=BeautifulSoup(Scaraped_data,”lxml”)

# We need the table data from html so we loop through the table tag and it’s class from scraped html data

for table in soup.findAll(‘table’,{‘class’:’medias mensuales numspan’}):

# In table data we need body of the table to find the features so we loop through the table body

for tbody in table:

# In table body we need the rows to get features data so we loop through the table rows

for tr in tbody:

# Now we extract the row data

Extract_data=tr.get_text()

# Now we append the row data into Sample_data list

Sample_data.append(Extract_data)


# If we manually check in the html we have 15 features so to check the if we are getting 15 features or not
# No of row

No_rows=len(Sample_data)/15


# Now to get the feature values first we need to go through the rows so we loop through the rows
for iterate in range(round(No_rows)):

# Creating empty list store feature values of each rows

lst=[]

# we loop through the feature to get each value

for i in range(15):
# we add the each row data in to empty lst


lst.append(Sample_data[0])

#Now we remove data from Sample_data

Sample_data.pop(0)

# Now we we add the each row values of 15 features in Finale list

Finale_data.append(lst)
Length_Finale_data=len(Finale_data)

Finale_data.pop(Length_Finale_data-1)
Finale_data.pop(0)

# Now we remove the empty features from table beacause this features doesn’t contain any value

for feature in range(len(Finale_data)):
Finale_data[feature].pop(6)
Finale_data[feature].pop(13)
Finale_data[feature].pop(12)
Finale_data[feature].pop(11)
Finale_data[feature].pop(10)
Finale_data[feature].pop(9)
Finale_data[feature].pop(0)

return Finale_data
# Once scrapiing the html table data
# html table features are independet variables
# Hours_to_Day PM2.5 feature is dependent features
# we need to combine both independent features and dependent features in csv file
# for that we write a data
def combine_dependent_independent(year, cs):
for i in pd.read_csv(‘Data_collection/Html_scraping_data/real_’ + str(year) + ‘.csv’, chunksize=cs):
df = pd.DataFrame(data=i)
mylist = df.values.tolist()
return mylist
if __name__==”__main__”:

# We need to create a directory to store the csv files
if not os.path.exists(“Data_collection/Html_scraping_data”):
os.makedirs(“Data_collection/Html_scraping_data”)


# After creating directory we need to write the csv file
# we need years from 2013 t0 2018 so for that we create loop for year
for year in range(2013,2019):
final_data=[]
with open (“Data_collection/Html_scraping_data/real_”+str(year)+”.csv”,’w’) as csvfile:
writting_csv=csv.writer(csvfile, dialect=’excel’)
writting_csv.writerow([‘T’,’TM’,’Tm’,’SLP’,’H’,’VV’,’V’,’VM’,’PM2.5'])


# To add the data to the csv files we call the html_scraping function
for month in range(1,13):
temp=html_scraping(month, year)
final_data=final_data+temp

# To get PM2.5 avg values we need to call the corresponding function
# For the we dinamically write it with getattr

dependent=getattr(sys.modules[__name__], ‘Day_values_{}’.format(year))()

# To add the dependent feature PM2.5 to the independent features
for i in range(len(final_data)-1):
final_data[i].insert(8,dependent[i])

with open(‘Data_collection/Html_scraping_data/real_’ + str(year) + ‘.csv’, ‘a’) as csvfile:
wr = csv.writer(csvfile, dialect=’excel’)
for row in final_data:
flag = 0
for elem in row:
if elem == “” or elem == “-”:
flag = 1
if flag != 1:
wr.writerow(row)


# We call the combine_dependent_independent function to combine the both
data_2013 = combine_dependent_independent(2013, 600)
data_2014 = combine_dependent_independent(2014, 600)
data_2015 = combine_dependent_independent(2015, 600)
data_2016 = combine_dependent_independent(2016, 600)
data_2017 = combine_dependent_independent(2017, 600)
data_2018 = combine_dependent_independent(2018, 600)


# combining the all years data into single csv
total=data_2013+data_2014+data_2015+data_2016+data_2017+data_2018

with open(‘Data_collection/Html_scraping_data/Real_Combine.csv’, ‘w’) as csvfile:
wr = csv.writer(csvfile, dialect=’excel’)
wr.writerow(
[‘T’, ‘TM’, ‘Tm’, ‘SLP’, ‘H’, ‘VV’, ‘V’, ‘VM’, ‘PM 2.5’])
wr.writerows(total)


df=pd.read_csv(‘Data/Real-Data/Real_Combine.csv’)

After scrapping and appending PM 2.5 values to the CSV, we get each individual year CSV files

We combine all years data into a single CSV

Now finally we got the Real_combine.csv file

We use this data to build ML application to predict future AQI

Real_combine.csv sample data

(iv) Data Pre-Processing :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Reading csv filecombine_data= pd.read_csv(r'C:\Users\Desktop\janibasha\Complete Data Science life cycle\Data_collection\Html_scraping_data\Real_combine.csv')
# checking no of numerical features
combine_data.info()
# To get statistical information
combine_data.describe()
# Now we need check null values
combine_data.isnull()
combine_data.isnull().sum()
# we also visualize null with seabornsns.heatmap(combine_data.isnull(),yticklabels=False)
Visualizing null values in the data frame

From the visualization, we came to know we don’t have any null values in the data frame.

We have all numerical features no categorical features so there is no need of encoding

# checking outliers combine_data.boxplot(column=’Tm’)
plt.show()
Outliers checking

Similarly, we can check outliers for every feature

# Multivariate anlaysissns.pairplot(combine_data)
Multivariate analysis between Dependent(y-axis) vs independent(x-axis) features

If we observe the above analysis there is no linear relation between independent and dependent features so linear algorithms don’t give good results

# We also check the corelation between dependent and independent featurecombine_data.corr()
relation =combine_data.corr()
relation_index=relation.index
sns.heatmap(combine_data[relation_index].corr(),annot=True)
correlation heatmap

From the above heatmap, we get an idea about the relationship between a dependent(PM 2.5) and Independent features(T,TM,Tm,SLP,H,VV,V,VM]

(v) Feature Importance :

We have 8 independent features, We don’t know which feature is important for the predict PM 2.5 value.

To know feature importance we use ExtraTreesRegressor (model-based feature selection)

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as acc
reg= ExtraTreesRegressor()reg.fit(X_train,y_train)reg.feature_importances_feat_importances = pd.Series(reg.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()
Top 5 features to predict PM 2.5 (model-based feature selection)

(vi) Model Building :

we have only 8 features so for the model building I consider all features and check the performances of different models

At the time of deployment, we train the model for the top 5 features and deploy it on any cloud platform

(1) KNN Regressor :

Hyperparameter tuning (K)

After k=17 there is a stable condition so we take K=17

# weighted knnweighted_tuned_reg = KNeighborsRegressor(n_neighbors=17,weights='distance')weighted_tuned_reg.fit(X_train,y_train)weighted_tuned_reg.score(X_train,y_train)
1.0
# performance of model on test dataset
weighted_tuned_reg.score(X_test,y_test)
0.5238963653617966
# cross validation
from sklearn.model_selection import cross_val_score
score=cross_val_score(weighted_tuned_reg,X,y,cv=5)
# cross validation perfomance
score.mean()
0.43669012578295
# Model evalutation
prediction=weighted_tuned_reg.predict(X_test)
# Comparing predicted PM2.5 and labeld PM 2.5
plt.scatter(y_test,prediction)
print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3247.1281849543693
scatter plot between y_test and y_prediction

(2) Linear, Lasso, and Ridge Regressor :

Linear Regressor

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(combine_data.iloc[:,:-1], combine_data.iloc[:,-1], test_size=0.3, random_state=0)
from sklearn.linear_model import LinearRegression# creating linear regression modelreg_model=LinearRegression(normalize=True)# fit independent varaibles to the dependent variables
reg_model.fit(X_train,y_train)
reg_model.score(X_train,y_train)
0.40996129270540077
reg_model.score(X_test,y_test)
0.3894144479464322
# for slopereg_model.coef_array([-8.71256785, -0.78565823, -0.64082652, 2.64932719, -1.44568242,0.27712587, -1.83610819, -1.01908448])# for interceptreg_model.intercept_
-2180.4321527938205
# cross validationfrom sklearn.model_selection import cross_val_score
score=cross_val_score(reg_model,combine_data.iloc[:,:-1],combine_data.iloc[:,-1],cv=5)
score.mean()
0.3229764710803792
prediction=reg_model.predict(X_test)# checking predicted y and labeled y
plt.scatter(y_test,prediction)
print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 4164.32350260401
Scatter between y_test and y_prediction

Ridge Regressor

# Randomsearch cvfrom sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randint
parameters={'alpha':randint(1e-8,100)}
ridge_reg_1=RandomizedSearchCV(reg_model_1,parameters,scoring='neg_mean_squared_error',cv=5)
ridge_reg_1.fit(combine_data.iloc[:,:-1],combine_data.iloc[:,-1])
ridge_reg_1.best_score_
-4265.905962013921
ridge_reg_1.best_params_
{'alpha': 75}
prediction=ridge_reg_1.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))MSE: 3912.4550098306554

Lasso Regressor

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
# Intializing the model
reg_model_2=Lasso()
# hyper parameter range
hyperparameters_range={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]}
# Searching best hyper parameterlasso_reg=GridSearchCV(reg_model_2,hyperparameters_range,scoring='neg_mean_squared_error',cv=5)lasso_reg.fit(combine_data.iloc[:,:-1],combine_data.iloc[:,-1])lasso_reg.best_params_
{'alpha': 5}
lasso_reg.best_score_
-4249.911163771522
prediction = lasso_reg.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3922.586481270448

(3) Decision Tree Regressor :

from sklearn.tree import DecisionTreeRegressor# creating Decision tree regression modelreg_decision_model=DecisionTreeRegressor()# Hyper parameters range intialization for tuningparameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9,11,12],
"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }
# calculating different regression metricsfrom sklearn.model_selection import GridSearchCVtuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=10,verbose=3)# function for calculating how much time take for hyperparameter tuningdef timer(start_time=None):
if not start_time:
start_time=datetime.now()
return start_time
elif start_time:
thour,temp_sec=divmod((datetime.now()-start_time).total_seconds(),3600)
tmin,tsec=divmod(temp_sec,60)
print(thour,":",tmin,':',round(tsec,2))
X=combine_data.iloc[:,:-1]y=combine_data.iloc[:,-1]from datetime import datetimestart_time=timer(None)tuning_model.fit(X,y)timer(start_time)tuning_model.best_params_{'max_depth': 9,
'max_features': None,
'max_leaf_nodes': 20,
'min_samples_leaf': 8,
'min_weight_fraction_leaf': 0.1,
'splitter': 'random'}
tuning_model.best_score_
-3621.0007087939457
# Model Evaluationprediction=tuning_model.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))MSE: 4589.011220616968

(4) RandomForest Regressor

# Hyperparameter tuning with RandomizedSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Hyparameter rangesfrom scipy.stats import randintparameters = {'n_estimators': randint(100,1200),
'max_features': ['auto','sqrt'],
'max_depth': randint(5,40),
'min_samples_split': randint(2,30),
'min_samples_leaf': randint(1,10) }
# Model for tuningbase_learner=RandomForestRegressor()# Tuningtuned_model= RandomizedSearchCV(estimator = base_learner, param_distributions = parameters,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs =-1)tuned_model.fit(X_train,y_train)tuned_model.best_params_{'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 16,
'n_estimators': 901}
tuned_model.best_score_
-3425.3665578465598
# Predicting X_test values using tuned_model
prediction=tuned_model.predict(X_test)
plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3308.584324808751
Scatter plot between y_test and y_prediction

(5) Xgboost Regressor :

import xgboost as xgb# Hyperparameter tuning with RandomizedSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Hyparameter rangesfrom scipy.stats import randintparameters = {'n_estimators': randint(100,1200),
'learning_rate': [0.001,0.002,0.003,0.005,0.01,0.04,0.05,0.1,0.2,0.3,0.4,0.5,0.6],
'max_depth': randint(5,40),
'subsample': [0.5,0.6,0.7,0.8],
'min_child_weight': randint(1,10) }
# Model for tuningbase_learner=xgb.XGBRegressor()# Tuningtuned_model= RandomizedSearchCV(estimator = base_learner, param_distributions = parameters,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs =-1)tuned_model.fit(X_train,y_train)tuned_model.best_params_{'learning_rate': 0.005,
'max_depth': 5,
'min_child_weight': 8,
'n_estimators': 611,
'subsample': 0.6}
tuned_model.best_score_
-3656.933662545248
# Predicting X_test values using tuned_model
prediction=tuned_model.predict(X_test)
print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3458.1210809592762
Scatter plot between y_test and y_prediction

(vii) Deployment :

Comparing MSE of all models

KNN: 3247.1281849543693

Linear Regression: 4164.32350260401

Ridge Regression: 3912.4550098306554

Lasso Regression:3922.586481270448

Decision Tree Regressor: 4589.011220616968

Random Forest Regressor:3308.584324808751

Xgboost Regressor:3458.1210809592762

Random forest is a Generalized model compare to KNN so I deploy Random Forest Regressor with the top 5 features using streamlit

Random Forest Regressor for Top 5 features

# Taking top 5 features for depolyment 
# top 5 features taken from the extratree regressor
X_train, X_test, y_train, y_test = train_test_split(combine_data.iloc[:,[0,1,2,3,6]], combine_data.iloc[:,-1], test_size=0.3, random_state=0)# intializing modelrandom_forest_reg1=RandomForestRegressor(n_estimators=901, max_depth=5, min_samples_split=16, min_samples_leaf=1, max_features='sqrt', n_jobs=-1)# fitting modelrandom_forest_reg1.fit(X_train,y_train)random_forest_reg1.score(X_train,y_train)
0.650728501356922
random_forest_reg1.score(X_test,y_test)
0.5218627277628385
prediction=random_forest_reg1.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3260.998026487038

Dumping model into pickle

import pickle
pickle_out = open(“Random_forest_regressor.pkl”,”wb”)
pickle.dump(random_forest_reg1, pickle_out)
pickle_out.close()

Streamlit framework

import pickle
import streamlit as st
pickle_in = open(“Random_forest_regressor.pkl”,”rb”)
random_forest_regressor=pickle.load(pickle_in)
def welcome():
return “ welcome all”
def predict_AQI(Average_Temperature,Maximum_Temperature,Minimum_Temperature,Atm_pressure_at_sea_level,Average_wind_speed):


prediction=random_forest_regressor.predict([[ Average_Temperature,Maximum_Temperature,Minimum_Temperature, Atm_pressure_at_sea_level,Average_wind_speed]])
print(prediction)
return prediction
def main():
st.title(“Hyderabad AQI prediction”)
html_temp = “””
<div style=”background-color:green;padding:20px”>
<h2 style=”color:white;text-align:center;”>AQI prediction ML App </h2>
</div>
“””
st.markdown(html_temp,unsafe_allow_html=True)
Average_Temperature= st.text_input(“Average_Temperature “,”Type Here”)
Maximum_Temperature = st.text_input(“Maximum_Temperature “,”Type Here”)
Minimum_Temperature = st.text_input(“Minimum_Temperature “,”Type Here”)
Atm_pressure_at_sea_level = st.text_input(“Atm_pressure_at_sea_level “,”Type Here”)
Average_wind_speed = st.text_input(“Average_wind_speed “,”Type Here”)
result=””
if st.button(“Predict”):
result=predict_AQI(Average_Temperature,Maximum_Temperature,Minimum_Temperature,Atm_pressure_at_sea_level,Average_wind_speed)
st.success(‘The output is {}’.format(result))
if st.button(“About”):
st.text(“Lets LEarn”)
st.text(“Built with Streamlit”)
if __name__==’__main__’:
main()

using the above streamlit framework we can deploy our model into any cloud platform

(viii) Conclusion:

Thank you for your interest in the blog. Please leave comments, feedback, and suggestions if you feel any

Github: https://github.com/jani-excergy/Complete_Data_Science_Life_Cycle_Projects

(ix) References :

Wikipedia: https://simple.wikipedia.org/wiki/Air

Krish Naik youtube channel

--

--

Janibasha Shaik
Analytics Vidhya

machine-learning fascinates the world, I exploit the machine learning to solve the real-world problem statements