The core of a data mining task

Mohammad Amin Dadgar
Mar 4 · 5 min read
This image is taken from Freepik website

Hi everybody

At the time of posting this article, I am educating computer engineering and I’m a senior bachelor’s degree student. The previous term I had a course named Data mining. I loved the concepts of it and the result after extracting information from data, So I would like to share the base knowledge of being a data scientist/engineer that I learned with you.

Before we start we need to get to know some terms used in data mining:

Dataset: Collection of data presented as a table. Most of the time a CSV file containing rows and columns, columns are your fields ( or title ) of your data type, and rows are your actual data

image1. A dataset of houses and their total value

Data mining tasks: As far as I know there are two types of tasks, One is prediction, and Two is classification. Prediction means to predict a value, For example, to predict a house price. Classification means to classify, For example, if we have two classes of buyers and non-buyers the act of classification is to classify an account owner as a non-buyer or buyer.

Model: Always in our data mining tasks we always develop a model to predict or classify our new data.

Outcome: The variable that we predict or classify is our outcome variable. For example in prediction tasks we want to predict a house price, house price is our outcome value, Or in a classification task we want to classify a person as buyer or seller, the outcome value here is the variable that contains being buyer or seller. As you see in image1 the total value is our outcome variable( We want to predict total value ).

Unsupervised learning: If we don’t have the outcome value in the dataset we should use clustering algorithms to create clusters ( create classes ) and then we can train the model with the modified data.

Supervised learning: If our dataset had the outcome value, We can use supervised algorithms such as Linear regression or k-nearest algorithms. The dataset shown in image1 for predicting total price is supervised learning.

Overfitting: As we said above, For prediction or classification tasks we always train a model with fitting our data to it. There is always a chance of overfitting data with the model, Which means the model will fit the trained data and if we again insert data to it, the Model will give the exact outcome variable ( or our error rate is near zero ). So if we insert new data into the model, It can not predict or classify the outcome value accurately. To get a better understanding of this, We can say if a student memorizes his mathematics book problems, he cannot answer new math problems But if he learns the problems, He can solve new problems easily in his exam and get a good grade.

So let’s get started

The work is summarized in 7 steps

Find your goal, what you want from your data ?

The better data the better results

Find the best dataset that matches your goal, For example find a dataset that has least missing value.

Getting to know exactly what is your data columns is a very important part of a data mining task Because if you want to delete the uncorrelated value for data reduction, You have to know which columns are not useful for your task.

Before we start to use our data we need to use two works:

  1. Data reduction: If our dataset is very big or we have uncorrelated columns to outcome value, We need to delete those columns.
  2. Fill in the missing values: If we had some miss values in our dataset we would fill them with median or linear regression algorithm (Or any suitable algorithms). Of course, if we had a lot of missing values, it’s better to delete the columns.

Note that to fill the missing values or to reduce columns we need to know our data, if the column was highly correlated with the outcome value we won’t delete it. It’s highly important to know every aspect of your data.

Partition your data into 2 ( or 3 ) partitions

The first partition is your train data

The second partition is our validation data

train data definition: for our prediction or classification problem we need a model, we would train our model to fit our data

Validation data definition: we would validate our model( after it was fitted to train data ) to get to know its error and variance

Note the third partition also called test data is not used always and it would use for testing the performance of different models.

after partitioning the dataset, we would use the train data to fit a model to it, Several algorithms for the model are available, Ex: linear regression, clustering, principal component analysis (PCA), etc. After choosing a model and fitting it to train data, we would use validation data to evaluate the model’s performance. We can use several models and evaluate their performance using validation data.

After choosing the best model, our model is ready to use, insert the new data and predict an outcome value or get the class of the class.

So this was a brief explanation of the core of a data mining task, Of course, there are much more details about it and this was for a beginner start. Also, there is another system to divide the work explained in the referenced book on page 57.

Hope you got a bit of how this works.

There are a lot of libraries that can help you with that, ex: pandas, sklearn, numpy, TensorFlow, …

So if you had any questions or problem you can comment here. I will be happy to help 🙂.

Referenced book: Data mining for business analysis by Galit Shmueli, Peter C. Bruce, Peter Gedeck, Nitil R. Patel

The posted image is the picture of the book.

image2. Referenced book, Data Mining For Business Analysis

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To stay up to date on other topics, follow us on LinkedIn.

Mohammad Amin Dadgar

Written by

A senior student of computer engineering. Android application developer, Also worked with node js. My Github page link:

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To stay up to date on other topics, follow us on LinkedIn.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store