An Insight over Water Treatment Plans Dataset using Machine Learning EDA

Saahil Sharma
GreyAtom
Published in
3 min readJul 11, 2017

On a peaceful day, when everyone of us were sitting and reckoning about the topic for the day, Mayuresh Shilotri, our instructor came up with the idea of classifying each individual in the group of two’s to perform an Exploratory Data Analysis on the datasets he will provide us wherein we have to come up with good insights while studying the dataset in and out. I got the opportunity to explore the Water treatment plans dataset for the city of Barcelona, Spain.

Brief Overview —

Water Treatment consists of various procedures to be carried out to evaluate the consistency of water on every step. In a basic treatment plan, some of the common tasks are — Screening, Coagulation, Sedimentation, Filtration, Disinfection. These terms are pretty straight forward, therefore we will not go into detail about each and every step. Albeit, I would like to talk about the coagulation process which might be a doubtful or alien word for some people.

Coagulation is a process where small tiny sticky particles are added to the water to attract the dirt and stick to them which makes them heavier by weight and sinks at the bottom of the tank.

About the Dataset —

This Dataset has observations of the daily measures of the sensors in the treatment plan. To be precise, it contains the recordings of all the wastage, chemicals, particles, etc in certain amount of water being tested on a daily purpose for almost two years (excluding two months exactly).

While having 527 of the instances or rows of data, there are 38 attributes or features present in this. It contains some missing values as well.

Lets talk about some of the features in the dataset. We wont be covering about every feature as it is very time consuming plus some of the features are repeated in a different manner.

This helps to get an overview of what the domain is and how to deal with such datasets.

  1. Q-E (input flow to plant) — This basically describes the flow of water in the unit of L/hr and ranges from 10K-60K.
  2. ZN-E (input zinc to plant) — Zinc removal is one of the process involved in the treatment. Mind you, the objective is to “remove zinc, not add it”. Therefore, it gives the indication of how zinc is removed. The unit is unknown.
  3. PH-E (input pH to plant) — When the pH level is greater than 7, water contains the excessive amount of negative hydroxide ions. By adjusting the pH, we can remove the heavy metals and other toxic metals.

Key Findings —

Since, it has so many features, it can also be interpreted that this is the problem of its multi dimensional aspects.

As this is an unsupervised dataset, the traditional method of classification or regression will not apply here. We have to perform clustering on it which is also given in the UCI description.

We were able to grab a ready made solution available online where some data scientists have performed machine learning techniques to find some useful knowledge for increasing the business.

These techniques are —

  1. C4.5/inductive decision tree algorithm
  2. CN2 rule induction algorithm
  3. KNN istance based classification learning algorithm
  4. open case/case based classification

You can read about the whole paper here —

Some questions which comes to my mind after going through this exercise are —

  1. Will the missing data of two months make any impact on overall business and the model which was built?
  2. There are some missing values present. Would they have been important or will they effect the overall model we built? How can they be treated so that our model doesn’t get affected much.

Thank you for reading, I hope I have tried to do this exercise perfectly, if not, then at least good. :)

--

--