Water Quality Prediction using ML: A Simple Guide with Scikit-Learn and Decision Trees

Navigating Water Quality Prediction: A Practical Guide using Scikit-Learn and Decision Trees

Simran Kaushik
9 min readFeb 6, 2024

Water quality prediction using machine learning (ML) has emerged as a powerful and innovative approach to address the challenges associated with monitoring and managing water resources. ML algorithms are employed to analyze vast datasets containing diverse water quality parameters, such as chemical concentrations, temperature, and turbidity. By leveraging historical data, these models can learn complex patterns and relationships, enabling them to make accurate predictions about future water quality conditions. The integration of sensor data, satellite imagery, and environmental variables enhances the predictive capabilities of ML models, allowing for real-time monitoring and timely decision-making. This predictive modeling not only aids in identifying potential contamination events but also supports proactive measures to safeguard water supplies. The application of ML in water quality prediction represents a transformative step towards sustainable water management, offering a more efficient and cost-effective way to ensure the safety and integrity of our precious water resources.

Decision Trees

Decision trees are a popular and intuitive machine learning algorithm used for both classification and regression tasks. The algorithm works by recursively partitioning the dataset into subsets based on the most influential features, creating a tree-like structure. Each internal node of the tree represents a decision based on a feature, and each leaf node represents the predicted outcome or target variable. Decision trees are advantageous due to their interpretability, as the resulting model is easily understandable and can be visualized.

The construction of a decision tree involves selecting the best features to split the dataset at each node, typically using criteria such as Gini impurity or information gain. Decision trees are robust to outliers and handle both numerical and categorical data efficiently. However, they are prone to overfitting, especially with deep trees. To mitigate this, techniques like pruning or using ensemble methods like Random Forests can be employed. Decision trees find applications in various fields, including finance, healthcare, and natural language processing, owing to their simplicity and effectiveness in capturing complex decision-making processes.

Mathematics of Decision Trees

The mathematics behind decision trees involves the principles of information theory and statistical measures to determine the optimal splits at each node of the tree. The key concepts include entropy, information gain (or Gini impurity), and recursive partitioning.

  1. Entropy: Entropy is a measure of uncertainty or disorder in a dataset. In the context of decision trees, it is used to quantify the impurity of a node. A node with low entropy means the data is pure (all instances belong to the same class), while high entropy indicates a mix of classes. The goal is to minimize entropy by selecting features that effectively split the data into subsets of higher purity.
  2. Information Gain (or Gini Impurity): Information gain is a metric used to evaluate the effectiveness of a feature in reducing uncertainty (entropy) within a node. The decision tree algorithm selects the feature that maximizes information gain, as it contributes the most to refining the decision boundaries. Gini impurity is an alternative measure often used, and the split is chosen to minimize impurity.
  3. Recursive Partitioning: The construction of a decision tree is a recursive process. At each node, the algorithm evaluates different features and selects the one that maximizes information gain or minimizes impurity. This process is repeated for each subset, creating a tree structure until a stopping criterion is met, such as a maximum depth or a node containing samples from a single class.
  4. Pruning: To prevent overfitting, decision trees are often pruned after construction. Pruning involves removing branches that do not contribute significantly to improving predictive accuracy on unseen data. This is achieved by assessing the impact of removing a subtree and considering the trade-off between model complexity and performance.

The mathematical foundation of decision trees is deeply rooted in these information theory and statistical concepts. The goal is to build a tree that optimally organizes the data, making predictions based on the most relevant features while avoiding overfitting to noise in the training data.

Practical Example in Daily Life

In our daily lives, a practical example of utilizing decision trees is evident in the choice of transportation modes. When deciding whether to drive, take public transit, or walk to work, individuals consider factors such as weather conditions, distance to work, parking availability, and personal preferences. Imagine a decision tree where the initial split is based on weather conditions — rainy, sunny, or snowy. Subsequent splits occur depending on additional factors, such as parking availability or distance to work. For instance, on a rainy day, the decision tree may guide individuals to consider parking availability, leading them to either drive or opt for public transit. This structured decision-making process simplifies complex choices, enabling individuals to navigate their daily commute efficiently based on the prevailing circumstances.

Advantages and Disadvantages of Decision Trees

Advantages

  • Interpretability and Simplicity: Decision trees offer a transparent and easy-to-understand representation of decision-making processes. The visual nature of the tree allows users to grasp complex relationships and outcomes intuitively.
  • Versatility with Data Types: Decision trees can handle both numerical and categorical data, making them versatile for a wide range of applications. This flexibility allows them to be applied to diverse datasets without extensive preprocessing.
  • Automatic Variable Selection: The algorithm automatically selects the most relevant features and their thresholds for decision-making, reducing the need for manual feature engineering. This makes decision trees suitable for tasks with large and complex datasets.

Disadvantages

  • Overfitting: Decision trees are prone to overfitting, especially when they are deep and capture noise in the training data. To mitigate this, techniques like pruning or using ensemble methods are employed, but careful parameter tuning is essential.
  • Instability to Small Variations: Small changes in the data can result in different tree structures, leading to instability. This sensitivity can make decision trees less robust compared to other algorithms, and it is crucial to validate their performance on different datasets.
  • Biased Towards Dominant Classes: In classification tasks with imbalanced class distributions, decision trees tend to be biased towards the dominant class. This can affect their accuracy, particularly when dealing with minority classes, and additional techniques may be required to address class imbalance.

Libraries Utilized in This Project

For this project, following libraries will be leveraged to facilitate various aspects of our work:

  1. Pandas: To manipulate and analyze structured data efficiently (1.5.3).
  2. Seaborn: To enhance the aesthetics of our visualizations built on top of Matplotlib (0.11.1).
  3. Scikit-learn: A comprehensive machine learning library for model building and evaluation (1.2.2).

Water Quality Prediction using Decision Tree

This tutorial unfurls with a meticulously planned sequence of strategic steps:

  • Data Collection
  • Exploratory Data Analysis (EDA)
  • Model Training
  • Model Prediction
  • Model Evaluation

Data Collection

To demonstrate Decision Tree, dataset has been taken from Kaggle. It is a versatile machine learning library in Python, which offers a collection of “toy datasets” that serve as invaluable resources for learning, testing, and prototyping machine learning algorithms.

Step 1: Installation of the required libraries

import pandas as pd
import seaborn as sns

Step 2: Load the dataset

df = pd.read_csv('dataset.csv')

Step 3: Use .head() and .tail() function to have a glimpse of the dataset

The .head() will return initial 5 rows and .tail() will return last 5 rows. To look at more rows, specify the number of rows, like dataframe_name.head(10), this will return initial 10 rows of the dataframe.

df.head()

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial initial step in the machine learning (ML) pipeline. It involves analyzing and visualizing the dataset to understand its structure, patterns, and potential insights. Through EDA, data scientists can identify missing values, outliers, and patterns that influence model performance. Techniques such as summary statistics, data visualization, and correlation analysis help in gaining valuable insights into the data’s characteristics. EDA guides the feature engineering process, aids in selecting appropriate ML models, and ensures a more informed approach to building robust and accurate machine learning models.

Let’s start by plotting a pairplot using seaborn to have comprehensive overview of the relationships across multiple features.

sns.pairplot(df,hue='Potability',palette='Set1')

A countplot is a fundamental data visualization tool commonly used in exploratory data analysis (EDA). It displays the counts of observations in categorical data, presenting a visual representation of the distribution of each category.

sns.countplot(x='Potability',data=df)

The .corr() function is a method used to calculate correlations among features, aiding in the identification of relevant features within a dataset. This technique is valuable for understanding the relationships between different variables, which is essential for feature selection. Additionally, the results obtained from .corr() can be effectively visualized using a heatmap. This visualization tool makes it easy to interpret the correlation matrix, allowing for quick identification of patterns and dependencies among features. The combination of .corr() and a heatmap provides a powerful means to explore data, uncover meaningful insights, and guide decisions in the feature selection process.

df.corr()
sns.heatmap(df.corr(),cmap='coolwarm')

Now, let’s meticulously examine the dataset to determine the count of NaN values, and subsequently, we’ll proceed to remove them. This step is crucial for ensuring data integrity and a cleaner dataset, setting the stage for more robust and accurate analyses.

count_nan = df['ph'].isnull().sum()
count_nan

The output is: 491

df2 = df.dropna()
df2.head()

The .dropna() method is a Pandas function used to remove missing (NaN) values from a DataFrame. When applied to a DataFrame, it drops any row that contains at least one NaN value. The method provides a convenient way to handle missing data by eliminating rows with incomplete information.

However, it’s important to note that blindly deleting all NaN values may not always be the best approach, as it can result in the loss of valuable information. It’s recommended to carefully assess the impact of missing values on your analysis and choose an appropriate strategy, such as imputation or handling missing values in a way that aligns with the goals.

Let’s print the length of both dataframes using the len() function.

len(df.index)

The output is: 3276

len(df2.index)

The output is: 2011

Model Training with Decision Trees

Model training is a fundamental step in machine learning where an algorithm learns patterns from a training dataset to make accurate predictions on new data. During training, the algorithm adjusts its internal parameters iteratively to minimize the difference between predictions and actual outcomes.

Step 1: Get the values of independent and dependent variables

X = df2.drop('Potability',axis=1)
y = df2['Potability']

Step 2: Splitting the Data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.30)

Step 3: Importing and Training the Decision Tree Model

from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

Decision Tree Model Prediction

Once the model is trained, use it to make classifications on the testing set.

predictions = dtree.predict(X_test)

Decision Tree Model Evaluation

Evaluate the model’s performance using appropriate measures, such as accuracy, precision and recall.

from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

After a thorough examination of the obtained results, it becomes evident that Decision Trees, boasting an accuracy of 59%, can serve as a viable tool for determining the potability of water. The utilization of Decision Trees in water quality assessment underscores their potential in providing insights into the safety of water sources.

However, it is crucial to acknowledge the existence of alternative algorithms that can be used for predicting the water quality. While Decision Trees offer commendable accuracy, exploring other machine learning algorithms may provide a more comprehensive understanding of the dynamics involved in water quality prediction.

For a more in-depth and comparative analysis of machine learning algorithms in the context of water quality prediction, you may refer to the research article titled “Water Quality Prediction Using Machine Learning.

The above mentioned research article presents an accuracy comparison of various ML algorithms for this purpose, delving into the intricacies of each algorithm and offering a detailed comparison of their accuracies in predicting water quality. The inclusion of multiple algorithms in this comparison not only broadens the perspective but also allows for a more nuanced evaluation of their respective strengths and weaknesses. As the field of water quality prediction continues to evolve, comparative analyses contribute significantly to advancing the understanding and improving the reliability of predictive models.

Thank you for exploring this tutorial! If you found it helpful, please consider liking, sharing, and subscribing for more blogs in the future. Stay tuned for additional insights and guides! For updates, you can also follow me on LinkedIn.

--

--

Simran Kaushik

I am an Analyst at KPMG and a participant of UiPath SDC 2024. Leveraging my expertise, I authored two influential books on Django and Machine Learning.