Hyderabad AQI Prediction Project From Scratch To Deployment

Janibasha Shaik

Published in

Analytics Vidhya

12 min readSep 14, 2020

Air Quality Index prediction

Table of contents:

(i) Introduction

(ii) The Motivation for the project

(iii) Data Collection

(iv) Data Pre-processing

(v) Feature Importance

(vi) Model Building

(vii) Deployment with streamlit

(viii) Conclusion

(ix) References

(i) Introduction

Air is a mixture of many gases and dust particles. It is the clear gas in which living things live and breathe. Air is a mixture of about 78% of nitrogen, 21% of oxygen, 0.9% of argon, 0.04% of carbon dioxide, and very small amounts of other gases.

Air quality is measure with the Air Quality Index(PM 2.5)

PM 2.5 is a fine particulate matter that is an air pollutant that is a concern for people’s health when levels in the air are high.

(ii) The Motivation For The Project:

Due to COVID, there is no transportation throughout the world because of that air is so fresh. I am currently living in Hyderabad city it’s the most polluted city in India but recent days here also the air is so fresh, I am thinking about it how air is so fresh nowadays, what about last year how was the air quality last year from that thought this project born.

(iii) Data Collection:

I collect the data from the https://en.tutiempo.net/climate/ws-431280.html from 2013 to 2018 using requests module

I created a function to collect the data from the website in the HTML format

import os 
import time
import sys
import requestsdef data_collection():
 for year in range(2013,2019): # for loop for year range
 for month in range(1,13): # for loop for month range
 if (month<10):
 # Condition for if month number below 10
 url=’https://en.tutiempo.net/climate/0{}-{}/ws-431280.html'.format(month,year) 
 else:
 # Condition for 10 to 12 months
 url=’https://en.tutiempo.net/climate/{}-{}/ws-431280.html'.format(month,year)
 
 collected_texts=requests.get(url) # using requests we get the html data into collected_texts variable
 collected_text_utf=collected_texts.text.encode(‘utf=8’) # our html contains so many data types so we use utf8 encoding
 
 # after getting data we need to store the data in a directory so for that we create Html_data directory with year directory
 if not os.path.exists(“Data_collection/Html_data/{}”.format(year)):
 os.makedirs(“Data_collection/Html_data/{}”.format(year))
 
 # To store that we need to open the directory
 with open (“Data_collection/Html_data/{}/{}.html”.format(year,month),’wb’) as Result:
 Result.write(collected_text_utf)
 
 sys.stdout.flush()if __name__==”__main__”:
 start_time=time.time()
 data_collection() # function calling 
 stop_time=time.time()
 print(‘Time Taken {}’.format(stop_time-start_time)) # Time taken to store the data

Executing the above function I collected data in HTML format from 2013 to 2018

Executing the above function it creates directory Html_data in that directory, we have each year data from 2013 to 2018 in separate folders

In Html data we have T =Average Temperature (°C) ,TM = Maximum temperature (°C) ,Tm == Minimum temperature (°C) , SLP = Atmospheric pressure at sea level (hPa) , H = Average relative humidity (%) ,VV = Average visibility (Km) ,V = Average wind speed (Km/h) ,VM =Maximum sustained wind speed (Km/h).

We also need PM 2.5 value, PM 2.5 values are collected from paid API from 2013 to 2018

if you want PM 2.5 data visit below GitHub link

https://github.com/jani-excergy/Complete_Data_Science_Life_Cycle_Projects/tree/master/Data_collection

PM 2.5 (AQI) is a dependent feature

T,Tm,TM,SLP,H,VV.VM, V are independent features

Our dependent feature belongs to real value so now we need to solve the regression problem

We have PM 2.5 values for each hour so first, we need to convert that values form hour’s to a day

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsdef Day_values_2013():
 temp_i=0
 average=[]
 for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2013.csv’,chunksize=24):
 add_var=0
 avg=0.0
 data=[]
 df=pd.DataFrame(data=rows)
 for index,row in df.iterrows():
 data.append(row[‘PM2.5’])
 for i in data:
 if type(i) is float or type(i) is int:
 add_var=add_var+i
 elif type(i) is str:
 if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
 temp=float(i)
 add_var=add_var+temp
 avg=add_var/24
 temp_i=temp_i+1
 
 average.append(avg)
 return averagedef Day_values_2014():
 temp_i=0
 average=[]
 for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2014.csv’,chunksize=24):
 add_var=0
 avg=0.0
 data=[]
 df=pd.DataFrame(data=rows)
 for index,row in df.iterrows():
 data.append(row[‘PM2.5’])
 for i in data:
 if type(i) is float or type(i) is int:
 add_var=add_var+i
 elif type(i) is str:
 if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
 temp=float(i)
 add_var=add_var+temp
 avg=add_var/24
 temp_i=temp_i+1
 
 average.append(avg)
 return averagedef Day_values_2015():
 temp_i=0
 average=[]
 for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2015.csv’,chunksize=24):
 add_var=0
 avg=0.0
 data=[]
 df=pd.DataFrame(data=rows)
 for index,row in df.iterrows():
 data.append(row[‘PM2.5’])
 for i in data:
 if type(i) is float or type(i) is int:
 add_var=add_var+i
 elif type(i) is str:
 if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
 temp=float(i)
 add_var=add_var+temp
 avg=add_var/24
 temp_i=temp_i+1
 
 average.append(avg)
 return averagedef Day_values_2016():
 temp_i=0
 average=[]
 for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2016.csv’,chunksize=24):
 add_var=0
 avg=0.0
 data=[]
 df=pd.DataFrame(data=rows)
 for index,row in df.iterrows():
 data.append(row[‘PM2.5’])
 for i in data:
 if type(i) is float or type(i) is int:
 add_var=add_var+i
 elif type(i) is str:
 if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
 temp=float(i)
 add_var=add_var+temp
 avg=add_var/24
 temp_i=temp_i+1
 
 average.append(avg)
 return averagedef Day_values_2017():
 temp_i=0
 average=[]
 for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2017.csv’,chunksize=24):
 add_var=0
 avg=0.0
 data=[]
 df=pd.DataFrame(data=rows)
 for index,row in df.iterrows():
 data.append(row[‘PM2.5’])
 for i in data:
 if type(i) is float or type(i) is int:
 add_var=add_var+i
 elif type(i) is str:
 if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
 temp=float(i)
 add_var=add_var+temp
 avg=add_var/24
 temp_i=temp_i+1
 
 average.append(avg)
 return averagedef Day_values_2018():
 temp_i=0
 average=[]
 for rows in pd.read_csv(r’C:\Users\Unify\Desktop\janibasha\Complete Data Science life cycle\Data_collection\AQI\aqi2018.csv’,chunksize=24):
 add_var=0
 avg=0.0
 data=[]
 df=pd.DataFrame(data=rows)
 for index,row in df.iterrows():
 data.append(row[‘PM2.5’])
 for i in data:
 if type(i) is float or type(i) is int:
 add_var=add_var+i
 elif type(i) is str:
 if i!=’NoData’ and i!=’PwrFail’ and i!=’ — -’ and i!=’InVld’:
 temp=float(i)
 add_var=add_var+temp
 avg=add_var/24
 temp_i=temp_i+1
 
 average.append(avg)
 return averageif __name__==”__main__”:
 lst2013=Day_values_2013()
 lst2014=Day_values_2014()
 lst2015=Day_values_2015()
 lst2016=Day_values_2016()
 lst2017=Day_values_2017()
 lst2018=Day_values_2018()
 plt.plot(range(0,365),lst2013,label=”2013 data”)
 plt.plot(range(0,364),lst2014,label=”2014 data”)
 plt.plot(range(0,365),lst2015,label=”2015 data”)
 plt.plot(range(0,365),lst2016,label=”2016 data”)
 plt.xlabel(‘Day’)
 plt.ylabel(‘PM 2.5’)
 plt.legend(loc=’upper right’)
 plt.show()

Executing the above function we get PM 2.5 per day values for each year

We get a dependent feature

Now we need to extract independent features from the Html, for that I am using Beautifulsoup to parse the Html data into a CSV file

After collecting independent features into CSV we need to add dependent feature PM 2.5 to that CSV file

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
from Hours_to_Day import Day_values_2013,Day_values_2014,Day_values_2015,Day_values_2016,Day_values_2017,Day_values_2018
import sys
from bs4 import BeautifulSoup
import os
import csvdef html_scraping(month,year):
 
 # To scrap the data we need to give the path of flies
 file_path=open(‘Data_collection/Html_data/{}/{}.html’.format(year,month),’rb’)
 
 #After scraping the data we need to store it in a variable
 
 Scaraped_data=file_path.read()
 
 # Now I create two empty lists for future purpose
 
 Sample_data=[]
 Finale_data=[]# Now I intialise the beautifulsoup class beautifulsoup(Scrapedtext,filetype)
 
 soup=BeautifulSoup(Scaraped_data,”lxml”)
 
 # We need the table data from html so we loop through the table tag and it’s class from scraped html data
 
 for table in soup.findAll(‘table’,{‘class’:’medias mensuales numspan’}):
 
 # In table data we need body of the table to find the features so we loop through the table body
 
 for tbody in table:
 
 # In table body we need the rows to get features data so we loop through the table rows
 
 for tr in tbody:
 
 # Now we extract the row data 
 
 Extract_data=tr.get_text()
 
 # Now we append the row data into Sample_data list
 
 Sample_data.append(Extract_data)
 
 
 # If we manually check in the html we have 15 features so to check the if we are getting 15 features or not
 # No of row
 
 No_rows=len(Sample_data)/15
 
 
 # Now to get the feature values first we need to go through the rows so we loop through the rows
 for iterate in range(round(No_rows)):
 
 # Creating empty list store feature values of each rows
 
 lst=[]
 
 # we loop through the feature to get each value 
 
 for i in range(15):
 # we add the each row data in to empty lst
 
 
 lst.append(Sample_data[0])
 
 #Now we remove data from Sample_data
 
 Sample_data.pop(0)
 
 # Now we we add the each row values of 15 features in Finale list 
 
 Finale_data.append(lst)Length_Finale_data=len(Finale_data)
 
 Finale_data.pop(Length_Finale_data-1)
 Finale_data.pop(0)
 
 # Now we remove the empty features from table beacause this features doesn’t contain any value
 
 for feature in range(len(Finale_data)):
 Finale_data[feature].pop(6)
 Finale_data[feature].pop(13)
 Finale_data[feature].pop(12)
 Finale_data[feature].pop(11)
 Finale_data[feature].pop(10)
 Finale_data[feature].pop(9)
 Finale_data[feature].pop(0)
 
 return Finale_data# Once scrapiing the html table data
# html table features are independet variables
# Hours_to_Day PM2.5 feature is dependent features
# we need to combine both independent features and dependent features in csv file
# for that we write a datadef combine_dependent_independent(year, cs):
 for i in pd.read_csv(‘Data_collection/Html_scraping_data/real_’ + str(year) + ‘.csv’, chunksize=cs):
 df = pd.DataFrame(data=i)
 mylist = df.values.tolist()
 return mylistif __name__==”__main__”:
 
 # We need to create a directory to store the csv files 
 if not os.path.exists(“Data_collection/Html_scraping_data”):
 os.makedirs(“Data_collection/Html_scraping_data”)
 
 
 # After creating directory we need to write the csv file 
 # we need years from 2013 t0 2018 so for that we create loop for year
 for year in range(2013,2019):
 final_data=[]
 with open (“Data_collection/Html_scraping_data/real_”+str(year)+”.csv”,’w’) as csvfile:
 writting_csv=csv.writer(csvfile, dialect=’excel’)
 writting_csv.writerow([‘T’,’TM’,’Tm’,’SLP’,’H’,’VV’,’V’,’VM’,’PM2.5'])
 
 
 # To add the data to the csv files we call the html_scraping function 
 for month in range(1,13):
 temp=html_scraping(month, year)
 final_data=final_data+temp
 
 # To get PM2.5 avg values we need to call the corresponding function 
 # For the we dinamically write it with getattr
 
 dependent=getattr(sys.modules[__name__], ‘Day_values_{}’.format(year))()
 
 # To add the dependent feature PM2.5 to the independent features 
 for i in range(len(final_data)-1):
 final_data[i].insert(8,dependent[i])
 
 with open(‘Data_collection/Html_scraping_data/real_’ + str(year) + ‘.csv’, ‘a’) as csvfile:
 wr = csv.writer(csvfile, dialect=’excel’)
 for row in final_data:
 flag = 0
 for elem in row:
 if elem == “” or elem == “-”:
 flag = 1
 if flag != 1:
 wr.writerow(row)
 
 
 # We call the combine_dependent_independent function to combine the both 
 data_2013 = combine_dependent_independent(2013, 600)
 data_2014 = combine_dependent_independent(2014, 600)
 data_2015 = combine_dependent_independent(2015, 600)
 data_2016 = combine_dependent_independent(2016, 600)
 data_2017 = combine_dependent_independent(2017, 600)
 data_2018 = combine_dependent_independent(2018, 600)
 
 
 # combining the all years data into single csv
 total=data_2013+data_2014+data_2015+data_2016+data_2017+data_2018
 
 with open(‘Data_collection/Html_scraping_data/Real_Combine.csv’, ‘w’) as csvfile:
 wr = csv.writer(csvfile, dialect=’excel’)
 wr.writerow(
 [‘T’, ‘TM’, ‘Tm’, ‘SLP’, ‘H’, ‘VV’, ‘V’, ‘VM’, ‘PM 2.5’])
 wr.writerows(total)
 
 
df=pd.read_csv(‘Data/Real-Data/Real_Combine.csv’)

After scrapping and appending PM 2.5 values to the CSV, we get each individual year CSV files

We combine all years data into a single CSV

Now finally we got the Real_combine.csv file

We use this data to build ML application to predict future AQI

(iv) Data Pre-Processing :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns# Reading csv filecombine_data= pd.read_csv(r'C:\Users\Desktop\janibasha\Complete Data Science life cycle\Data_collection\Html_scraping_data\Real_combine.csv')
# checking no of numerical features
combine_data.info()# To get statistical information
combine_data.describe()# Now we need check null values
combine_data.isnull()
combine_data.isnull().sum()# we also visualize null with seabornsns.heatmap(combine_data.isnull(),yticklabels=False)

Visualizing null values in the data frame

From the visualization, we came to know we don’t have any null values in the data frame.

We have all numerical features no categorical features so there is no need of encoding

# checking outliers combine_data.boxplot(column=’Tm’)
plt.show()

Similarly, we can check outliers for every feature

# Multivariate anlaysissns.pairplot(combine_data)

Multivariate analysis between Dependent(y-axis) vs independent(x-axis) features

If we observe the above analysis there is no linear relation between independent and dependent features so linear algorithms don’t give good results

# We also check the corelation between dependent and independent featurecombine_data.corr()
relation =combine_data.corr()
relation_index=relation.indexsns.heatmap(combine_data[relation_index].corr(),annot=True)

From the above heatmap, we get an idea about the relationship between a dependent(PM 2.5) and Independent features(T,TM,Tm,SLP,H,VV,V,VM]

(v) Feature Importance :

We have 8 independent features, We don’t know which feature is important for the predict PM 2.5 value.

To know feature importance we use ExtraTreesRegressor (model-based feature selection)

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as accreg= ExtraTreesRegressor()reg.fit(X_train,y_train)reg.feature_importances_feat_importances = pd.Series(reg.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()

Top 5 features to predict PM 2.5 (model-based feature selection)

(vi) Model Building :

we have only 8 features so for the model building I consider all features and check the performances of different models

At the time of deployment, we train the model for the top 5 features and deploy it on any cloud platform

(1) KNN Regressor :

After k=17 there is a stable condition so we take K=17

# weighted knnweighted_tuned_reg = KNeighborsRegressor(n_neighbors=17,weights='distance')weighted_tuned_reg.fit(X_train,y_train)weighted_tuned_reg.score(X_train,y_train)
1.0# performance of model on test dataset
weighted_tuned_reg.score(X_test,y_test)
0.5238963653617966# cross validation 
from sklearn.model_selection import cross_val_score
score=cross_val_score(weighted_tuned_reg,X,y,cv=5)
# cross validation perfomance
score.mean()
0.43669012578295# Model evalutation
prediction=weighted_tuned_reg.predict(X_test)# Comparing predicted PM2.5 and labeld PM 2.5
plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3247.1281849543693

scatter plot between y_test and y_prediction

(2) Linear, Lasso, and Ridge Regressor :

Linear Regressor

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(combine_data.iloc[:,:-1], combine_data.iloc[:,-1], test_size=0.3, random_state=0)from sklearn.linear_model import LinearRegression# creating linear regression modelreg_model=LinearRegression(normalize=True)# fit independent varaibles to the dependent variables
reg_model.fit(X_train,y_train)reg_model.score(X_train,y_train)
0.40996129270540077reg_model.score(X_test,y_test)
0.3894144479464322# for slopereg_model.coef_array([-8.71256785, -0.78565823, -0.64082652,  2.64932719, -1.44568242,0.27712587, -1.83610819, -1.01908448])# for interceptreg_model.intercept_
-2180.4321527938205# cross validationfrom sklearn.model_selection import cross_val_score
score=cross_val_score(reg_model,combine_data.iloc[:,:-1],combine_data.iloc[:,-1],cv=5)score.mean()
0.3229764710803792prediction=reg_model.predict(X_test)# checking predicted y and labeled y
plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 4164.32350260401

Ridge Regressor

# Randomsearch cvfrom sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randint
parameters={'alpha':randint(1e-8,100)}ridge_reg_1=RandomizedSearchCV(reg_model_1,parameters,scoring='neg_mean_squared_error',cv=5)
ridge_reg_1.fit(combine_data.iloc[:,:-1],combine_data.iloc[:,-1])ridge_reg_1.best_score_
-4265.905962013921ridge_reg_1.best_params_
{'alpha': 75}prediction=ridge_reg_1.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))MSE: 3912.4550098306554

Lasso Regressor

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV# Intializing the model 
reg_model_2=Lasso()# hyper parameter range 
hyperparameters_range={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]}# Searching best hyper parameterlasso_reg=GridSearchCV(reg_model_2,hyperparameters_range,scoring='neg_mean_squared_error',cv=5)lasso_reg.fit(combine_data.iloc[:,:-1],combine_data.iloc[:,-1])lasso_reg.best_params_
{'alpha': 5}lasso_reg.best_score_
-4249.911163771522prediction = lasso_reg.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3922.586481270448

(3) Decision Tree Regressor :

from sklearn.tree import DecisionTreeRegressor# creating Decision tree regression modelreg_decision_model=DecisionTreeRegressor()# Hyper parameters range intialization for tuningparameters={"splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }# calculating different regression metricsfrom sklearn.model_selection import GridSearchCVtuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=10,verbose=3)# function for calculating how much time take for hyperparameter tuningdef timer(start_time=None):
    if not start_time:
        start_time=datetime.now()
        return start_time
    elif start_time:
        thour,temp_sec=divmod((datetime.now()-start_time).total_seconds(),3600)
        tmin,tsec=divmod(temp_sec,60)
        print(thour,":",tmin,':',round(tsec,2))X=combine_data.iloc[:,:-1]y=combine_data.iloc[:,-1]from datetime import datetimestart_time=timer(None)tuning_model.fit(X,y)timer(start_time)tuning_model.best_params_{'max_depth': 9,
 'max_features': None,
 'max_leaf_nodes': 20,
 'min_samples_leaf': 8,
 'min_weight_fraction_leaf': 0.1,
 'splitter': 'random'}tuning_model.best_score_
-3621.0007087939457# Model Evaluationprediction=tuning_model.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))MSE: 4589.011220616968

(4) RandomForest Regressor

# Hyperparameter tuning with RandomizedSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Hyparameter rangesfrom scipy.stats import randintparameters = {'n_estimators': randint(100,1200),
               'max_features': ['auto','sqrt'],
               'max_depth': randint(5,40),
               'min_samples_split': randint(2,30),
               'min_samples_leaf': randint(1,10)  }# Model for tuningbase_learner=RandomForestRegressor()# Tuningtuned_model= RandomizedSearchCV(estimator = base_learner, param_distributions = parameters,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs =-1)tuned_model.fit(X_train,y_train)tuned_model.best_params_{'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 16,
 'n_estimators': 901}tuned_model.best_score_
-3425.3665578465598# Predicting X_test values using tuned_model
prediction=tuned_model.predict(X_test)plt.scatter(y_test,prediction)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3308.584324808751

(5) Xgboost Regressor :

import xgboost as xgb# Hyperparameter tuning with RandomizedSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Hyparameter rangesfrom scipy.stats import randintparameters = {'n_estimators': randint(100,1200),
               'learning_rate': [0.001,0.002,0.003,0.005,0.01,0.04,0.05,0.1,0.2,0.3,0.4,0.5,0.6],
               'max_depth': randint(5,40),
               'subsample': [0.5,0.6,0.7,0.8],
               'min_child_weight': randint(1,10)  }# Model for tuningbase_learner=xgb.XGBRegressor()# Tuningtuned_model= RandomizedSearchCV(estimator = base_learner, param_distributions = parameters,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs =-1)tuned_model.fit(X_train,y_train)tuned_model.best_params_{'learning_rate': 0.005,
 'max_depth': 5,
 'min_child_weight': 8,
 'n_estimators': 611,
 'subsample': 0.6}tuned_model.best_score_
-3656.933662545248# Predicting X_test values using tuned_model
prediction=tuned_model.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3458.1210809592762

(vii) Deployment :

Comparing MSE of all models

KNN: 3247.1281849543693

Linear Regression: 4164.32350260401

Ridge Regression: 3912.4550098306554

Lasso Regression:3922.586481270448

Decision Tree Regressor: 4589.011220616968

Random Forest Regressor:3308.584324808751

Xgboost Regressor:3458.1210809592762

Random forest is a Generalized model compare to KNN so I deploy Random Forest Regressor with the top 5 features using streamlit

Random Forest Regressor for Top 5 features

# Taking top 5 features for depolyment 
# top 5 features taken from the extratree regressorX_train, X_test, y_train, y_test = train_test_split(combine_data.iloc[:,[0,1,2,3,6]], combine_data.iloc[:,-1], test_size=0.3, random_state=0)# intializing modelrandom_forest_reg1=RandomForestRegressor(n_estimators=901, max_depth=5, min_samples_split=16, min_samples_leaf=1, max_features='sqrt', n_jobs=-1)# fitting modelrandom_forest_reg1.fit(X_train,y_train)random_forest_reg1.score(X_train,y_train)
0.650728501356922random_forest_reg1.score(X_test,y_test)
0.5218627277628385prediction=random_forest_reg1.predict(X_test)print('MSE:', metrics.mean_squared_error(y_test, prediction))
MSE: 3260.998026487038

Dumping model into pickle

import pickle
pickle_out = open(“Random_forest_regressor.pkl”,”wb”)
pickle.dump(random_forest_reg1, pickle_out)
pickle_out.close()

Streamlit framework

import pickle
import streamlit as stpickle_in = open(“Random_forest_regressor.pkl”,”rb”)
random_forest_regressor=pickle.load(pickle_in)def welcome():
 return “ welcome all”def predict_AQI(Average_Temperature,Maximum_Temperature,Minimum_Temperature,Atm_pressure_at_sea_level,Average_wind_speed):
 
 
 prediction=random_forest_regressor.predict([[ Average_Temperature,Maximum_Temperature,Minimum_Temperature, Atm_pressure_at_sea_level,Average_wind_speed]])
 print(prediction)
 return predictiondef main():
 st.title(“Hyderabad AQI prediction”)
 html_temp = “””
 <div style=”background-color:green;padding:20px”>
 <h2 style=”color:white;text-align:center;”>AQI prediction ML App </h2>
 </div>
 “””
 st.markdown(html_temp,unsafe_allow_html=True)
 Average_Temperature= st.text_input(“Average_Temperature “,”Type Here”)
 Maximum_Temperature = st.text_input(“Maximum_Temperature “,”Type Here”)
 Minimum_Temperature = st.text_input(“Minimum_Temperature “,”Type Here”)
 Atm_pressure_at_sea_level = st.text_input(“Atm_pressure_at_sea_level “,”Type Here”)
 Average_wind_speed = st.text_input(“Average_wind_speed “,”Type Here”)
 result=””
 if st.button(“Predict”):
 result=predict_AQI(Average_Temperature,Maximum_Temperature,Minimum_Temperature,Atm_pressure_at_sea_level,Average_wind_speed)
 st.success(‘The output is {}’.format(result))
 if st.button(“About”):
 st.text(“Lets LEarn”)
 st.text(“Built with Streamlit”)if __name__==’__main__’:
 main()

using the above streamlit framework we can deploy our model into any cloud platform

(viii) Conclusion:

Thank you for your interest in the blog. Please leave comments, feedback, and suggestions if you feel any

Github: https://github.com/jani-excergy/Complete_Data_Science_Life_Cycle_Projects

(ix) References :

Wikipedia: https://simple.wikipedia.org/wiki/Air

Krish Naik youtube channel

Hyderabad AQI Prediction Project From Scratch To Deployment

(i) Introduction

(ii) The Motivation For The Project:

(iii) Data Collection:

(iv) Data Pre-Processing :

(v) Feature Importance :

(vi) Model Building :

(vii) Deployment :

(viii) Conclusion:

(ix) References :

Written by Janibasha Shaik