Data Collection & It’s Techniques

Data Science Delight
ILLUMINATION
Published in
5 min readMay 26, 2024

Welcome to the comprehensive guide, where we’ll delve into various data-gathering techniques used in data science. From traditional methods like surveys to modern approaches like web scraping, we’ll cover a wide range of techniques to help you understand how data is collected and prepared for analysis.

Imagine your manager has given you a task to analyze data provided by the company, along with some questions like: “Analyze and tell me how our company will be profitable by next year?”.

Ask Questions:

For that, your first step would be to ask relevant questions just by looking at the data like: “What features are important and what features are not important for the analysis?, Do I need to do data preprocessing?, etc.”

Data Collection:

The next step is to collect the data which is known as data gathering. In this process, your task is to collect the data.

Now, do not assume that every time the data will be in .csv format. Sometimes they might ask you to collect data from API like online resources, or databases (like SQL), or the file might be in some other format like JSON or Excel, etc.

Sometimes you would need to do data scraping as well. This is where data gathering comes into the process. Let’s understand that in detail.

Photo by Rivage on Unsplash

Data Assessment:

In this process, your task is to gather basic information about the data, such as:

  1. Finding the number of rows/columns ( shape): to determine the dimension of the dataset.
  2. Data types of each column ( dtype()): to understand the type of data and interpret that column accordingly.
  3. Number of missing values in each column( sum.isnull()): checking missing values is important as they might cause bias, or inaccurate estimates in our analysis.
  4. Checking for duplicate values ( is_unique()): to avoid skewed results.
  5. Statistics about the data ( describe()): to get the mathematical overview for each column.

Why Data Gathering Is Important?

Data Gathering or Data Collection is the backbone of the data science process. This helps you to analyze the data, and make relevant decisions based on the data.

Photo by Myriam Jessier on Unsplash

Traditional Methods:

  • Surveys: It is a classic method of collecting data by asking a series of questions from a large population. For example, imagine a company wants to understand customer satisfaction. They could create a survey asking customers to rate their experience on a scale of 1 to 5.
  • Interviews: Interviews involve direct interaction between an interviewer and a respondent to gather detailed insights. This process involves a select group of individuals. Let’s say a researcher wants to understand people’s eating habits. They might conduct interviews asking individuals about their dietary preferences, cooking habits, and food choices.
  • Observations: It entails systematically watching and recording behaviors, events, or activities in their natural surroundings. For instance, a wildlife biologist might observe animal behavior in their natural habitat to study their social interactions and feeding patterns.

Advanced Techniques:

  • Web Scraping: Web scraping is a modern technique used to extract data from websites. Imagine a company that wants to monitor competitor prices. They could use web scraping to collect pricing information from competitor websites automatically.
  • API: API access allows retrieving data from web services through predefined interfaces. For example, a weather app might use an API provided by a meteorological service to fetch real-time weather data for a given location.
  • Database Querying: Database querying involves extracting data from relational databases using SQL queries. A company, for instance, might query its customer database to retrieve information about purchase history, demographics, and preferences.
  • File Parsing: File parsing entails reading data from various file formats such as CSV, Excel, JSON, or XML. A data analyst, for instance, might parse a CSV file containing sales data to analyze revenue trends over time.
  • Public Datasets: Public datasets are freely available datasets provided by government agencies, research institutions, or open data initiatives. Data scientists, for instance, might use public health datasets to analyze disease trends and patterns.
  • Social Media Scraping: It involves extracting data from social media platforms like Twitter, Facebook, or Instagram. A marketing team, for example, might scrape Twitter to analyze customer sentiment about a product launch.

Enhancing Online Sales Strategy:

Let’s consider an example where a retail company wants to improve its online sales strategy by analyzing customer behavior on its e-commerce platform.

In this scenario, we’ll explore how advanced data-gathering techniques can be applied:

  • Web Scraping: The retail company could use web scraping to gather data from competitor websites to analyze pricing strategies, product offerings, and customer reviews. By scraping data from competitor sites, they can gain insights into market trends and competitive positioning.
  • API Access: The company could integrate with third-party APIs such as Google Analytics or Adobe Analytics to access detailed information about user interactions on their e-commerce platform. These APIs provide data on website traffic, user demographics, browsing behavior, and conversion rates, allowing the company to optimize its online sales funnel.
  • Database Querying: By querying its transactional database, the company can extract valuable insights about customer purchase history, product preferences, and order fulfillment. They can analyze patterns such as repeat purchases, cross-selling opportunities, and seasonal trends to tailor their marketing and sales strategies.
  • File Parsing: The company may receive data feeds from suppliers or partners in various file formats such as CSV or XML. By parsing these files, they can extract product information, inventory levels, and pricing updates to keep their e-commerce platform up-to-date with the latest offerings.
  • Social Media Scraping: Social media scraping can be used to monitor conversations, mentions, and sentiments about the company’s brand and products across platforms like Twitter, Facebook, and Instagram. By analyzing social media data, the company can identify customer preferences, influencers, and emerging trends to inform their marketing campaigns.

By employing advanced data-gathering techniques like web scraping, API access, database querying, file parsing, and social media scraping, the retail company can gain deep insights into customer behavior and market dynamics, allowing them to make data-driven decisions to improve their online sales strategy and drive business growth.

Conclusion:

In conclusion, mastering data-gathering techniques is essential for any data science practitioner.

By leveraging the right tools and methods, one can unlock valuable insights from diverse data sources, paving the way for informed decision-making and impactful analysis.

Whether it’s scraping data from the web, accessing APIs, querying databases, parsing files, or leveraging public datasets, the possibilities are endless in the world of data science.

--

--

Data Science Delight
ILLUMINATION

Content Creator | Sharing insights & tips on data science | Instagram: @datasciencedelight | YouTube: https://www.youtube.com/channel/UCpz2054mp5xfcBKUIctnhlw