Accident Trends in France: A prediction based on historical data

Hira Sadiq
6 min readSep 8, 2023

--

So far, data cleaning and exploratory data analysis (EDA) have been discussed in blogs by Saba Firdous and Fareeha Saleem. From which we’ve gathered some pretty interesting insights about the data. Now that we have all the necessary analytics and insights, it is time to present our Null hypothesis and it’s alternate hypothesis.

Null and Alternate hypothesis testing

In statistical hypothesis testing, the null hypothesis (H0) and the alternative hypothesis (H1 or Ha) are two fundamental concepts that help researchers draw conclusions based on data. These hypotheses are used to assess whether there is a significant effect or relationship between variables in a statistical study. Here’s an explanation of each:

Null Hypothesis (H0):

  • The null hypothesis represents the default or status quo assumption in a statistical test.
  • It states that there is no significant effect, relationship, or difference between groups or variables being studied.

Alternative Hypothesis (H1):

  • The alternative hypothesis represents the opposite of the null hypothesis.
  • It asserts that there is a significant effect, relationship, or difference between groups or variables.

In our study, we have asserted the null hypothesis that the number of accidents has remained constant or that there hasn’t been significant change in the number over the years. This foundational assumption serves as the basis against which we will test any observed changes or variations in accident data. By positing that there has been no significant alteration in the accident rates, we aim to rigorously assess the validity of this hypothesis through statistical analysis and empirical evidence.

In order to assess the relationship between the variables “Year of Accident” and “Number of Accidents,” we employed a series of steps. The code is given below:

import pandas as pd
import numpy as np
import statsmodels.api as sm

Years = [2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005]
Accidents_in_each_year = [53414, 58569, 60322, 58482, 59448, 65461, 69529, 73018, 77771, 80004, 73750, 74164]

# Create a DataFrame
data = {'Year': Years, 'Accidents': Accidents_in_each_year}
df = pd.DataFrame(data)

# Add a constant term to the independent variable (Year)
X = sm.add_constant(df['Year'])

# Fit a linear regression model
model = sm.OLS(df['Accidents'], X).fit()

# Print the regression summary
print(model.summary())

Results

The results representation of the above code is shown in the image below.

Fig 1. OLS regression result

The provided regression results are for a linear regression analysis between the “Year” and the “Accidents” variables. Here’s an interpretation of the key statistics:

  1. R-squared (R²): R-squared is a measure of the goodness of fit of the model. In this case, it’s 0.846, which means that approximately 84.6% of the variance in the number of accidents can be explained by the linear relationship with the year. A higher R-squared indicates that the model fits the data well.
  2. Adjusted R-squared (Adj. R²): Adjusted R-squared adjusts the R-squared value for the number of predictors in the model. It’s 0.830 in this case, and it’s slightly lower than the R-squared because it penalizes models with too many predictors. It’s a more conservative measure of model fit.
  3. F-statistic: The F-statistic tests whether there is a statistically significant relationship between the independent variable (Year) and the dependent variable (Accidents). A high F-statistic (54.89 in this case) with a low p-value (2.29e-05) indicates that the overall model is statistically significant.
  4. Coefficients:
  • const (Intercept): The intercept represents the estimated number of accidents in the base year (in this case, it’s 2005). It’s 4.593e+06, meaning that it estimates around 4,593,000 accidents in the base year.
  • Year: The coefficient for the “Year” variable is -2251.3147. This represents the estimated change in the number of accidents for each one-year increase. In this case, it’s negative, indicating that the model suggests a decreasing trend of approximately 2,251 accidents per year.

5. P-values (P>|t|):

  • The p-value associated with the “Year” coefficient is very low (close to 0.000). This suggests that the “Year” variable is highly statistically significant, meaning that there is strong evidence to suggest a relationship between the year and the number of accidents.

6. Omnibus, Prob(Omnibus), Jarque-Bera (JB), Skew, Kurtosis: These statistics are related to the normality of the residuals (the differences between the predicted and observed values). In general, a normal distribution of residuals is desirable for linear regression. In this case, the p-values for Omnibus (0.621) and Jarque-Bera (0.720) are high, suggesting that the residuals may be approximately normally distributed.

7. Durbin-Watson: The Durbin-Watson statistic checks for the presence of autocorrelation in the residuals. A value close to 2 suggests no significant autocorrelation. In this case, it’s approximately 0.970, indicating little autocorrelation.

Overall, the regression model suggests that there is a statistically significant decreasing trend in the number of accidents over the years. The negative coefficient for “Year” suggests that for each additional year, the number of accidents decreases by approximately 2,251 accidents. However, as with any statistical analysis, it’s important to consider the practical significance and the domain-specific context when interpreting the results.

Model Fitting

In our approach to understand and predict road accidents in France, we applied a linear regression model to data, aiming to forecast the number of accidents for the upcoming years. The model yielded promising results, providing us with valuable insights into future accident trends. However, as responsible data scientists, we decided to put our model to the test by comparing its predictions with real-world data. What we discovered was both exciting and reassuring.

# Data
Years = [2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005]
Accidents_in_each_year = [53414, 58569, 60322, 58482, 59448, 65461, 69529, 73018, 77771, 80004, 73750, 74164]

# Create a DataFrame with future years you want to predict for
future_years = [2017, 2018, 2019, 2020]
future_data = {'Year': future_years}
future_df = pd.DataFrame(future_data)

# Coefficients from the linear regression model
intercept = 4.593e+06
year_coefficient = -2251.3147

# Calculate predictions for the future years
future_df['Predicted_Accidents'] = (intercept + future_df['Year'] * year_coefficient).astype(int)

# Print the predictions
print(future_df)

Predictive Model and Results

Our predictive model, based on accident data from previous years, estimated the number of accidents expected to occur in France in 2017, 2018, 2019 and 2020. The model’s prediction of accidents for the mentioned years was 52098, 49846, 47595 and 45,344 respectively.

Fig 2. Prediction of accidents

Reality Check

When we cross-referenced our model’s prediction with actual accident data we started looking for firm digits to represent the truth of accidents and doing so we came across a number for 2020, we found that there were 45,121 accidents reported in France during that year. This means that our model predicted the number of accidents with an astonishing accuracy rate of approximately 99%.

Conclusion

This impressive accuracy rate demonstrates the power and reliability of our predictive model. By leveraging historical data and applying statistical techniques, we were able to make highly accurate predictions about road accidents in France. This level of precision can be invaluable for policymakers, law enforcement, and organizations working to enhance road safety.

However, it’s important to note that predictive models are not infallible. While our model performed exceptionally well in this instance, it’s crucial to continually validate and refine such models as new data becomes available. Our commitment to accuracy and accountability drives us to continuously improve our methods and enhance road safety for everyone.

In summary, our successful prediction for 2020 reaffirms the potential of data-driven insights in accident prevention and management. It is a testament to the efficacy of our approach and the importance of data science in making our roads safer for all.

--

--

Hira Sadiq

Data Scientist, Data Analyst, Machine Learning, Deep Learning, Artificial Intelliegence