At the time of posting this article, I am educating computer engineering and I’m a senior bachelor’s degree student. The previous term I had a course named Data mining. I loved the concepts of it and the result after extracting information from data, So I would like to share the base knowledge of being a data scientist/engineer that I learned with you.
Before we start we need to get to know some terms used in data mining:
Dataset: Collection of data presented as a table. Most of the time a CSV file containing rows and columns, columns are your fields ( or title ) of your data type, and rows are your actual data
Data mining tasks: As far as I know there are two types of tasks, One is prediction, and Two is classification. Prediction means to predict a value, For example, to predict a house price. Classification means to classify, For example, if we have two classes of buyers and non-buyers the act of classification is to classify an account owner as a non-buyer or buyer.
Model: Always in our data mining tasks we always develop a model to predict or classify our new data.
Outcome: The variable that we predict or classify is our outcome variable. For example in prediction tasks we want to predict a house price, house price is our outcome value, Or in a classification task we want to classify a person as buyer or seller, the outcome value here is the variable that contains being buyer or seller. As you see in image1 the total value is our outcome variable( We want to predict total value ).
Unsupervised learning: If we don’t have the outcome value in the dataset we should use clustering algorithms to create clusters ( create classes ) and then we can train the model with the modified data.
Supervised learning: If our dataset had the outcome value, We can use supervised algorithms such as Linear regression or k-nearest algorithms. The dataset shown in image1 for predicting total price is supervised learning.
Overfitting: As we said above, For prediction or classification tasks we always train a model with fitting our data to it. There is always a chance of overfitting data with the model, Which means the model will fit the trained data and if we again insert data to it, the Model will give the exact outcome variable ( or our error rate is near zero ). So if we insert new data into the model, It can not predict or classify the outcome value accurately. To get a better understanding of this, We can say if a student memorizes his mathematics book problems, he cannot answer new math problems But if he learns the problems, He can solve new problems easily in his exam and get a good grade.
So let’s get started
The work is summarized in 7 steps
One. Define your purpose
Find your goal, what you want from your data ?
Two. Find the best dataset that matches your goal
The better data the better results
Find the best dataset that matches your goal, For example find a dataset that has least missing value.
Three. Get to know your data
Getting to know exactly what is your data columns is a very important part of a data mining task Because if you want to delete the uncorrelated value for data reduction, You have to know which columns are not useful for your task.
Four. Pre process and clean your data
Before we start to use our data we need to use two works:
- Data reduction: If our dataset is very big or we have uncorrelated columns to outcome value, We need to delete those columns.
- Fill in the missing values: If we had some miss values in our dataset we would fill them with median or linear regression algorithm (Or any suitable algorithms). Of course, if we had a lot of missing values, it’s better to delete the columns.
Note that to fill the missing values or to reduce columns we need to know our data, if the column was highly correlated with the outcome value we won’t delete it. It’s highly important to know every aspect of your data.
Five. Start to partition your data
Partition your data into 2 ( or 3 ) partitions
The first partition is your train data
The second partition is our validation data
train data definition: for our prediction or classification problem we need a model, we would train our model to fit our data
Validation data definition: we would validate our model( after it was fitted to train data ) to get to know its error and variance
Note the third partition also called test data is not used always and it would use for testing the performance of different models.
Six. Train a model
after partitioning the dataset, we would use the train data to fit a model to it, Several algorithms for the model are available, Ex: linear regression, clustering, principal component analysis (PCA), etc. After choosing a model and fitting it to train data, we would use validation data to evaluate the model’s performance. We can use several models and evaluate their performance using validation data.
Seven. The last step
After choosing the best model, our model is ready to use, insert the new data and predict an outcome value or get the class of the class.
So this was a brief explanation of the core of a data mining task, Of course, there are much more details about it and this was for a beginner start. Also, there is another system to divide the work explained in the referenced book on page 57.
Hope you got a bit of how this works.
So if you had any questions or problem you can comment here. I will be happy to help 🙂.
Referenced book: Data mining for business analysis by Galit Shmueli, Peter C. Bruce, Peter Gedeck, Nitil R. Patel
The posted image is the picture of the book.