Understanding Data Collection: A Closer Look at Our Data Analysis Project

5 min readAug 6, 2023

previous blog : Our Distributor Data Analysis Project Journey

Data and Information:

Data refers to raw and unprocessed facts or figures, often in the form of numbers, text, images, or any other format. It lacks context and meaning on its own. On the other hand, information is the result of processing and organizing data into a meaningful and useful form. It provides insights, knowledge, and understanding that can be used for decision-making and analysis.

Data Collection:

Data collection is the process of gathering and accumulating data from various sources to create a comprehensive dataset. It is a crucial step in any research, analysis, or project, as the quality and relevance of the data directly impact the outcomes and insights obtained.

Why Data Collection:

Data collection can help improve services, understand consumer needs, refine business strategies, grow and retain customers, and even sell the data as second-party data to other businesses at a profit.

Multiple Ways of Data Collection:

Data can be collected through various methods, including surveys and questionnaires, observations, interviews, sensor data, web scraping, social media monitoring, utilizing secondary data, system-generated data (e.g., sensor data, images, EDI, web scraping), manually generated data (e.g., manual data entry, form filling), and semi-generated data (e.g., data scraping, data augmentation, manual annotation).

The 5 V’s of Big Data and Data Understanding:

Volume: Refers to the sheer amount of data generated and collected. In our project, the dataset comprises 513MB, with 64 folders and 722 files. This showcases a significant volume of data, considering it spans multiple years and months.
Velocity: Represents the speed at which data is generated, collected, and updated. In our case, the data is updated monthly, indicating a regular velocity of data flow.
Variety: Refers to the diversity of data types and sources. Our project deals with data stored in different file formats (.xls and .xlsx files) and is categorized by years and months, showing a variety of data organization.
Veracity: Focuses on the reliability and trustworthiness of the data. Since the data is collected automatically through the distributor’s own app, it can be considered highly trustworthy as it avoids manual errors or manipulation.
Value: Represents the usefulness and insights that can be derived from the data. The data you collected includes sales information for two brands, “Godrej” and “Marico,” which can provide valuable insights into their performance and market trends.

Data Understanding in our Project:

In our data visualization project, we collected a voluminous dataset from a distributor in Pune. The data was organized within a pendrive, with a structured arrangement into folders by years and further categorized by months. Each month folder contained different files, and in total, there were 64 folders with 722 files, amounting to 513MB of data.

A snippet displaying the files from the month of April 2018

The data provided by the distributor focused on three brands: “Godrej,” “Marico,” and another brand. However, due to insufficient data for the third brand, our analysis narrowed down to only “Godrej” and “Marico” to ensure the reliability and accuracy of our insights.

To begin the data understanding process, we used “Python” for data extraction and segregation, resulting in two distinct folders, namely “Input” and “Output.” The “Input” folder consisted of 108 files, while the “Output” folder contained 114 files.

#for input files

import os
import shutil
check_in = ['Purchase','Input','INPUT']
check_out = ['Output','OUTPUT']

source = r"C:\Users\hp\Desktop/ALL DATA/"
destin_IN = r"C:\Users\hp\Desktop/All_Input_Data/"
destin_OUT = r"C:\Users\hp\Desktop/All_Output_Data/"
check_1 = r"C:\Users\hp\Desktop/check_1/"

#req variables
punc = "/"
source_1 = ""
source_2 = ""

#iterating over list of files in specified directory -> 01
for x in os.listdir(source):
    source_1 = source
    #appending file name to source -> 02
    source_1 = source_1 + x
    # 01
    for x in os.listdir(source_1):
        source_2 = source_1
        source_2 = source_2 + punc + x + punc
        print(source_2)
        # 01
        for x in os.listdir(source_2):
            print(x)
            # check for keywords in the file name according to need -> 03
            for y in check_in:
                if y in x:
                    print(x)
                    # 03
                    if 'SUMMARY' not in x and 'Summary' not in x and 'GPI' not in x :
                        print(x)
                        # copying the file from sorce to destination 
                        shutil.copy(source_2 + x, destin_IN + x)

#for output files

import os
import shutil

#make list for input and output keywords
check_in = ['Purchase','Input','INPUT']
check_out = ['Output','OUTPUT']

#declaring sources
source = r"C:\Users\hp\Desktop/ALL DATA/"
destin_IN = r"C:\Users\hp\Desktop/All_Input_Data/"
destin_OUT =  r"C:\Users\hp\Desktop/All_Output_Data/"
check_1 = r"C:\Users\hp\Desktop/check_1/"

#req variables
punc = "/"
source_1 = ""
source_2 = ""

#iterating over list of files in specified directory -> 01
for x in os.listdir(source):
    source_1 = source
    #appending file name to source -> 02
    source_1 = source_1 + x
    # 01
    for x in os.listdir(source_1):
        source_2 = source_1
        source_2 = source_2 + punc + x + punc
        print(source_2)
        # 01
        for x in os.listdir(source_2):
            print(x)
            # check for keywords in the file name according to need -> 03
            for y in check_out:
                if y in x:
                    print(x)
                    # 03
                    if 'SUMMARY' not in x and 'Summary' not in x and 'GPI' not in x :
                        print(x)
                        # copying the file from sorce to destination
                        shutil.copy(source_2 + x, destin_OUT + x)

Input Folder includes:

Supplier Details: Information about the suppliers involved in the business transactions, such as their names, contact details, and other relevant data.
Details about Products Sold by Suppliers: Information about the products supplied by each supplier, including product names, codes, quantities, prices, and other relevant attributes.

Output Folder includes:

Retailer Details: Information about the retailers who purchased the products from the distributor, such as their names, contact details, and any other relevant data.
Details about Products Sold to Retailers: Information about the products sold to retailers, including product names, codes, quantities, prices, and other relevant attributes.

By understanding the data in terms of its volume, velocity, variety, and veracity, we can now proceed to apply data visualization techniques to derive valuable insights and actionable information. Through effective data visualization, we aim to transform this data into meaningful information, enabling stakeholders to make well-informed decisions and optimize business processes.

Next blog : Data Preprocessing : The Foundation of Effective Data Analysis