Just starting with a Data Project!!!

Karthik Ravichandran
CodeX
Published in
4 min readSep 29, 2021

Today is a day that’s supposed to be a dull one- no motivation while getting out of bed and planned to do a lot of reviews and mundane tasks. However, the morning stand-up call changed the perspective of the day. All the managers, fellow employees, and friends are thrilled and excited by a piece of writing that I wrote, yesterday, in a mail; it was regarding the future work that the team is supposed to look into. There were three segments in a pipeline which was explained- data acquisition phase, data cleaning, and data modeling. As far as a data project is concerned these three are the main and core part of the project, though other processes like data wrangling, client discussion, and reinforcement of ideas, which will come in later stage, are not necessary at the beginning. As I got a few positive feedback from them, I want to share it in this online forum, one after the other, daily. To start with, let me give a head start on these three concepts in the following passages.

The data acquisition phase is a phase in which a team gets or fetches data from different sources. It can be either a manual extraction or an automatic task that gets the data without human intervention. This not only gives a larger amount of data but also helps in data analysis. For instance, if I’m creating a dog or cat classifier, we need to collect data from different websites. It can be just data pruning or a public request- one can request the public to get data from them. However, these methods have a few challenges like redundant data, noise during an acquisition , or hampering original data by a hack or other malicious activities. So, for this reason, one should carefully consider data cleaning and in some cases, it’s substantiated as a good practice.

As discussed in the above paragraph, data cleaning is a vital part of a data science project. It not only helps in creating a better model but also prevents us from misconceptions due to misleading data. Data cleaning is even used in many areas apart from this, for example, we always used to remove oddly-shaped laddo, a high-calorie food, which is made up of jaggery, widely available in the northern part of India, from a sweet pack so that the person see the uniformity. And another example is, in an inter-school competition, teachers prevent some students, who have very low skills in that domain, from participating in the competition, so that all the students from that school will look talented. Ya, these are a few weird analogies that I can think of to confuse you- pun intended. But you can think through it and you will get the point that we always prefer seeing uniformity rather than odds popping out, so similarly, even the model wants uniformity in the data. However, It’s improbable to consider the data cleaning process as an easy one and assumes it impacts positively - sometimes we may end up cleaning or removing important data, which’s, indeed, necessary for a model to capture the variance.

When it comes to modeling, it’s always better to revisit and log the previous two processes- data acquisition and data cleaning. Because, if the previous two steps are not done with utmost care then we may end up in a weak model, which is not useful on any occasion. Anyhow, data modeling is a different process- since a model is always considered as a black box, you neither can alter anything inside the box nor understand the content within, the only thing possible is to understand the characteristic of the model and do some analysis. Further, the model can be influenced or change the course of learning by altering and governing the input feature and tunning parameters.

Wait!!!! time’s up …..let’s look at all those points again, in detail, in the future posts

--

--

Karthik Ravichandran
CodeX
Writer for

Burgeoning data science researcher working in a Healthcare industry