# Creating a Model for Weather Forecasting Using Linear Regression

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on — the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.creates a model that forecasts weather/temperature based upon some features mainly including Humidity, Ppm, and Air quality index AQI or PM2.5. We searched for a lot of datasets that contain all those features but found none. What we found were two different datasets containing the following features: -

**Data-set-1**

This data set contains features like Weather temperature, humidity, and AQI(PM2.5) as the target variable. This data set was too short just **150** entries.

**Data-set-2:**

This data set contains almost 24 features including Ppm, Humidity, and weather temperature as a target variable.

**Solution**

We cannot record real-time data due to pandemic, nor do we have any data set that contains all required features in one place. One possible solution was to combine both data sets so that we can create a final data set that contains all the required features. In order to do so, we must create two separate models. One model that is trained on data-set-1 and predicts the AQI(PM2.5) value and the other data set into which the values of PM2.5 will be embedded to get the final desired data set.

**Program Structure**

We created a linear regression model and train it on data set-1 to predict PM2.5 values. Before that, we plotted a heat map to check the correlation between features and target variables and found out that only temperature and humidity were in some correlation with PM2.5 (target variable ). So the model was trained using these two features. The following shows the heat map of it.

We saved this model as a pickle file to use it later. Pickle is a python module used to store objects. We save the model at maximum accuracy because every time we ran the program accuracy varies a little bit, so it is always a nice practice to save the best accuracy model to use it again.

Data-set2 now needed to be embedded with PM2.5 values. So we picked temperature and humidity columns from dataset-2 and give it to our trained linear regression model to get values of PM2.5. in this way we created a final data set that now has all features including Ppm, Humidity, PM2.5.Now we trained another linear regression model on this final data set with Temperature as the target variable. As before we again plotted the heat maps to check the correlation of features and target variables to throw out unnecessary features.

The model when trained gave 93% accuracy which is quite good. But as it is not real data rather it is just sample data so the model might not predict very accurately on real-time data. To overcome this, we must retrain our model on real-time data, and then it will be good to go.

Here is the code snippet of the training method that uses here.

lis_drop = [ 'Date2','Time3','Weather_Temperature6', 'Exterior_Entalpic_120','Exterior_Entalpic_221', 'Exterior_Entalpic_turbo22','Day_Of_Week\n'

'Lighting_Comedor_Sensor11' , 'Lighting_Habitacion_Sensor12', 'Precipitacion13' , 'Meteo_Exterior_Crepusculo14']features = []

for i in data:

# print(i)

if i not in lis_drop:

features.append( i )print(len(features))

print (features)x = np.array(data[features ])

y = np.array(data['Weather_Temperature6'])print(x.shape , y.shape)#x_train , x_test , y_train , y_test = sklearn.model_selection.train_test_split(x , y , test_size = 0.1 )

#print(x_train.shape , y_train.shape)

#print(x_test.shape , y_test.shape)while True:

x_train , x_test , y_train , y_test = sklearn.model_selection.train_test_split(x , y , test_size = 0.2 )

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)

acc = linear.score(x_test , y_test)

if int(acc*100) > 94:

#print(acc*100)

breakpredictions = linear.predict(x_test)for i in range(len(predictions)):

print('PREDICTED WEATHER : '+str(predictions[i]) ,'\t','ACTUAL WEATEHR : '+str(y_test[i]))print(acc)

# Conclusion

From these code snippets, we can train the data and get an approximately 93% accurate model for weather prediction. Also, forget the more accuracy need to improve the algorithm with neural networks with the Keras LSTM model. That will be worked fine rather than going with the linear regression.