How Important is Data in Machine Learning?

Suhas Maddali
Nerd For Tech
Published in
5 min readApr 13, 2021

We see data all around us. There are many companies that generate large amount of data that could be used for different machine learning and artificial intelligence purposes respectively. We see Amazon and Facebook or Google for that matter that generate data that is extremely large.

Each search in Google can result in lots of data being created and with a large base of users added to that, we see that there is a huge generation of data. Therefore, the data that is available to us is abundant and we have to ensure that we make the best use of it. Let us also understand the type of data that we have so that we get a good understanding of the feature engineering techniques that could be done.

Companies with good amount of data

In machine learning, one of the things that should be taken care of is the type of data given to the model. If we have more data, there is a higher chance for a machine learning algorithm to understand it and give accurate predictions to the unseen data respectively.

Often times, we have to perform feature engineering to the data in order to generate new features and columns. In addition to this, there is a possibility for the data to be not having any values for particular columns. Therefore, we have to fill the data and modify it so that it would be very useful for machine learning models for predictions respectively.

Types of data

There can be many forms of data that could be used for machine learning purposes. Here, we would be talking about the main types of data that we would be giving to the machine learning algorithms for predictions. The types of data are:

  1. Categorical data
  2. Numerical data
  3. Time series data
  4. Text data

These are the types of data that we would be working with for most of the machine learning applications. Let us now discuss the meaning of the above mentioned data types and give some examples so that it would be easy to understand the data that we are giving to the machine learning models for prediction.

1. Categorical data

In this type of data, we have different categories representing particular object respectively. Consider, for instance, the color of the car. There can be many colors such as green, blue or silver. Since the data is not numerical and consists of different categories of the color of the car, we call it a categorical data respectively. Below, we see that there are three categories of colors. We would be performing one hot encoding so that the categorical data is converted to numerical data for computation. We have to always remember that a machine learning algorithm would only take mathematical values as input and thus, perform computations. Hence, the entire data must be converted into some forms of numbers for computation respectively.

Converting categorical to numerical data

2. Numerical data

As the name implies, we would be working with just numbers in the case of numerical data. Below, we see that there are different set of features such as attack, defense and so on. In addition, we see that there are specific set of numerical values associated with them respectively. We see that they are floating point numbers but we still see that they are numerical in nature. Therefore, we might come across a data set that might contain numerical data features along with categorical features respectively.

Numerical data

3. Time series data

Sometimes, it is also required to use the time series information and the data associated with it to improve the performance of a machine learnig model. Therefore, we would take into consideration the features that contain the time series information respectively. In a time series data, we would take the values of certain output values for a specific set of time interval and link those results with our data. Therefore, we would be performing the machine learning operations with the time series information present in our data respectively. In the below image, we see a time series representation of a particular value in the data and the time interval between Jan 2019 to Jan 2020. We can use this feature and add it to our original data which might improve the accuracy of the machine learning model respectively.

Time series data

4. Text data

There is a lot of text data available to us in the form of posts, articles and blogs. We would be taking that data into consideration and apply different machine learning approaches to them. We would actually convert the text data into a mathematical vector by using different forms of vectorization so that we get mathematical forms of the text which could later be used by machine learning algorithms for prediction. We have different vectorizers in python such as BOW vectorizer and TFIDF vectorizer which convert the text at hand to their mathematical equivalent which, in turn, could be used by the machine learning models for predictions.

Text data

Conclusion

We see that there are different types of data. We would be considering the different types of data and giving the values to the machine learning models for predictions respectively. There are different data types such as text data, numerical data, categorical data and time series data respectively. We have to perform different feature engineering techniques for the different types of data features and ensure that they are in the numerical format so that the machine learning models could interpret and understand them respectively.

If you want to connect through LinkedIn where we can further discuss about machine learning and the latest advances, below is the link for my profile. Feel free to connect. Thanks.

LinkedIn: https://www.linkedin.com/in/suhas-maddali-b9b146136/

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

--

--

Suhas Maddali
Nerd For Tech

🚖 Data Scientist @ NVIDIA 📘 15k+ Followers (LinkedIn) 📝 Author @ Towards Data Science 📹 YouTuber 🤖 200+ GitHub Followers 👨‍💻 Views are my own.