Predicting Air Quality Index using Python

Himani Bansal
DataFlair
Published in
4 min readFeb 27, 2024

In today’s rapidly urbanizing world, monitoring and predicting air quality has become imperative for public health. The Air Quality Index (AQI) serves as a crucial metric and offers insights into the levels of various pollutants in the atmosphere.

This project aims to harness the power of Python and Machine Learning to predict the AQI values based on relevant environmental parameters. By utilizing the data-driven techniques and machine learning algorithms (Linear Regression, Random Forest Regressor, etc.), we endeavor to create a robust model that takes environmental parameters as input and predicts the AQI of it. In this python project we are going to use Random Forest Regressor. Let’s build it.

Predicting Air Quality Index using Python
Predicting Air Quality Index using Python

Random Forest Regressor

The Random Forest Regressor is a powerful machine learning algorithm that leverages an ensemble of decision trees to predict numerical outcomes. By aggregating predictions from multiple trees, it enhances the accuracy of the model.

This algorithm is particularly effective in complex prediction tasks, offering reliable and robust results. Its adaptability to diverse data types and ability to capture relationships make it an invaluable asset for regression tasks.

Dataset

The air quality dataset encompasses key environmental parameters: Temperature, Humidity, Wind Speed, PM2.5 and PM10 particulate matter concentration, ozone levels and the AIR Quality Index. Analyzing these variables provides insights into the complex interactions influencing air quality, aiding in pollution assessment and public health initiatives.

Prerequisites For Predicting Air Quality Index using Python

Proficiency in Python and Machine Learning, along with meeting the specified system requirements, is essential for effective utilization of the tools.

  • Python 3.7 and above
  • Any python editor (VS code, Pycharm, Jupyter, etc.)

Installation

Open windows cmd as administrator

  1. To install the scikit-learn library run the command from the cmd.
pip install scikit-learn

2. To install the joblib library run the command from the cmd.

pip install joblib

Let’s Implement It

  1. Import all the libraries.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load the Dataset.

data = pd.read_csv('air.csv')

3. It shows the initial 5 rows of the dataset.

data.head()

Output of this step

Shows the Initial
Shows the Initial

4. Creates the visualization of the whole dataset using seaborn and matplotlib.

custom_palette = sns.color_palette("husl", 6)
sns.set(style="whitegrid")
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 12))
fig.suptitle('Air Quality Visualization', fontsize=16)
sns.histplot(data['Temperature(C)'], kde=True, ax=axes[0, 0], color=custom_palette[0])
axes[0, 0].set_title('Temperature Distribution')
sns.histplot(data['Humidity(%)'], kde=True, ax=axes[0, 1], color=custom_palette[1])
axes[0, 1].set_title('Humidity Distribution')
sns.histplot(data['Wind_Speed(km/h)'], kde=True, ax=axes[1, 0], color=custom_palette[2])
axes[1, 0].set_title('Wind Speed Distribution')
sns.histplot(data['PM2.5(µg/m³)'], kde=True, ax=axes[1, 1], color=custom_palette[3])
axes[1, 1].set_title('PM2.5 Distribution')
sns.histplot(data['PM10(µg/m³)'], kde=True, ax=axes[2, 0], color=custom_palette[4])
axes[2, 0].set_title('PM10 Distribution')
sns.histplot(data['O3(ppm)'], kde=True, ax=axes[2, 1], color=custom_palette[5])
axes[2, 1].set_title('O3 Distribution')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Output of this step

Creates the Visualization
Creates the Visualization

5. This line of code drops the Date column which is not necessary in prediction.

data = data.drop('Date', axis=1)

6. It assigns all the features to Independent variable X except AQI.

X = data.drop('AQI', axis=1)

7. It assigns the AQI to target variable y

y = data['AQI']

8. Divides the dataset into training and testing subsets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

9. Generates a Random Forest Regressor model with 100 trees and fits it to training data using default parameters.

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

10. This line of code predicts the target variable using the trained random forest regressor model on the test data.

y_pred = model.predict(X_test)

11. It calculates the Error matrix and performance of the model.

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

accuracy = model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2%}")

Output of this step

The Error matrix and performance
The Error matrix and performance
The Error matrix and performance output
The Error matrix and performance output

12. Once the model is trained it saves the model.

model_filename = "air_quality_model_rf.joblib"
joblib.dump(model, model_filename)
print(f"Random Forest Model saved as {model_filename}")

13. Loads a pre-trained model and uses new input data to predict Air Quality Index and prints the result.

import pandas as pd
import joblib
model_filename = "air_quality_model_rf.joblib"
model = joblib.load(model_filename)
new_input = pd.DataFrame({
'Temperature(C)': [21.24],
'Humidity(%)': [63.29],
'Wind_Speed(km/h)': [21.70],
'PM2.5(µg/m³)': [8.55],
'PM10(µg/m³)': [27.05],
'O3(ppm)': [0.0731]
})
predicted_aqi = model.predict(new_input)
print(f"Predicted AQI: {predicted_aqi[0]:.2f}")

Output

Loads a pre-trained model
Loads a pre-trained model
Loads a pre-trained model output
Loads a pre-trained model output

Conclusion

In conclusion, the Random Forest Regressor model developed in python demonstrates promising accuracy in predicting air quality index (AQI).

Leveraging a robust ensemble of decision trees, the model effectively captures complex relationships within the data. Through training and validation it exhibited reliable performance in forecasting AQI values. This approach holds potential for enhancing air quality monitoring and contributing to a more informed and proactive approach in addressing environmental challenges.

--

--

Himani Bansal
DataFlair

Doing my Best to Explain Data Science (Data Scientist, technology freak & Blogger)