Data Analytics 101 Series — The ‘Prepare’ Phase

Adith Narasimhan Kumar
Analytics Vidhya
Published in
3 min readApr 12, 2023

--

In continuation of my previous article, this one will cover the Prepare phase of the data analytics process. This is the second phase of the 6-phase process of any data analytics project.

After defining a problem and a problem statement of a business case, the next step is to prepare the data for analysis. The preparation phase is nothing but the collection of data that we’ll be analyzing. This step is vital as it decides the size, shape, and type of the data for the subsequent steps.

graphs showing the data analysis process. This includes the prepare phase as well
Photo by Luke Chesser on Unsplash

Types of data

There are 2 ways to classify the data that we collect based on the location of the data:

  1. Internal Data
  2. External Data

1. Internal Data

Internal data is the data that resides within the organization that you work for.

2. External Data

External data is the data that resides outside the organization that you work for.

The other way data is classified is based on the ownership of the data. Let me explain!

Data is classified into 3 categories based on its ownership. The categories are:

  1. First-Party Data
  2. Second-Party Data
  3. Third-Party Data

1. First-Party Data

Any data that is collected by the individual is called first-party data. This type of data collection involves conducting surveys, filling up forms, etc. The major advantage of this type of data collection is that the requirement can be tailored to the need of the problem statement we are trying to solve.

Also, the relevancy of the data to the problem statement is higher. This reduces the amount of work spent perfecting the data for analysis.

Examples of first-party data include data from CRM, customer feedback, and survey data.

2. Second-Party Data

Second-party data is the data collected and sold by another party. This party may be a business partner or an outsourced vendor. Second-party data may not be as relevant as first-party data since there is no connection from the vendor to your problem statement but is not the worst as well.

It is a faster way to collect data but a little riskier as you pay money for it and the data might not be completely useful.

Examples of second-party data include data purchased from external vendors or partners

3. Third-Party Data

Third-party data is data that is collected from external sources. These sources include but are not limited to, the ones from first-party and second-party categories. These can include surveys, feedback forms, data purchased from external parties, etc. Data from these sources are often stitched together to arrive at a final dataset.

A disadvantage of third-party data is that the boundary of the data might not be concrete and might be random. This causes the inclusion of outliers in the data which subsequently leads to the increased time spent in the process phase.

Conclusion

To conclude, there are various methods of gathering data for your analysis. There is no one size fits all situation here. The scale, urgency, cost, and various other factors influence the data collection process. All things considered, one thing to keep in mind is that the quality of data determines the quality of the insight you derive from that data.

Happy Learning!

--

--