Exploring the Different Levels of Granularity in Data Product

Bryan Yang
A multi hyphen life
5 min readMar 5, 2023

From Raw Data to Aggregated Insights

After the data is verified and cleaned, the data is ready to be aggregated (data aggregation is the aggregation of data from fine particles into coarser particles) to reduce the amount of data and storage space according to the subsequent usage requirements.

In terms of user behavior collected by the app, as soon as a user triggers an event, the data is immediately sent back and stored in our database or data lake. This means that data is generated every second or even every few hundred milliseconds, and the amount of data accumulated is enormous. If 100 GB of data can be generated in a day, that would be 3 TB in a month and nearly 40 TB of data in a year, and this amount of data will cause considerable trouble and fees for subsequent data analysis or use. When the subsequent use of data does not require such detailed information, it can be organized into an acceptable granularity through data aggregation.

We have the following example data to discuss two aspects of data aggregation, the aggregation unit and the aggregation method.

Common aggregation units

Figure 1: Raw event data

Time

Time is the most commonly used unit for aggregation, and depending on how it is used, the time granularity can usually be divided into several levels.

  • Milliseconds — the rawest data generation time
  • Minutes — some less immediate monitoring can be collected every minute or every 15 minutes
  • Hours — Most of the data is aggregated in hours, so that the data is not too small and not too coarse, preserving the flexibility for subsequent use.
  • Days — Days are the most commonly used unit of time, especially for user behavior activities, such as daily active users, daily clicks, etc.
  • Months/Qualters/Years— This is mostly used for monthly reports, so don’t just save the data in such a coarse time unit, otherwise, you won’t be able to do a more detailed observation.

Location

When collecting user data, it often includes information about the user’s location at the time the signal was sent.

  • GPS — GPS can describe the user’s location very accurately, although there may be some errors depending on the collection device.
  • Base stations — In the case of telecommunication companies, the location of the base station to which the user is connected can be accurately known, and in metropolitan areas, the user’s location can even be pinpointed through multiple base stations.
  • Guadkey — Also known as the Bing Map Tile System, this is a system that connects maps into small squares to mark locations, with the smallest unit being 23 digits.
  • Township/County/Country — The most popular way to label locations.
    Other commonly used aggregation units

As long as the details can be omitted, they can be used as aggregation units, and some common ones are listed here for the reader’s reference.

Other Units

  • User/Device ID — Things done by the same user in a unit of time can often be aggregated together.
  • Event Type — As above, the time spent doing the same event in a unit of time can often be aggregated.

Aggregation Method

Since we are converting fine granularity into coarse granularity, only some of the information can be retained, so the aggregation method determines what information can be retained. When using aggregation formulas, you need to pay attention to whether the formulas really convey the meaning you want to express.

Figure 2: Aggregated data

count/ distinct count

Counting is the most simple and less risky way of aggregation. For example, in the example data in Figure 2, the count is used to count events such as “Open” and “Impression”. However, once the aggregated data is counted, the meaning will be different if it is counted again. You should be careful when using it.

Sum

If you want to calculate the number of events for each month, you can’t count the number of events directly again, but need to sum up the number of events counted each day. If the summed numbers are to be aggregated again (e.g. hour -> day, day -> month), the result can usually be obtained by summing them directly.

Various division methods (e.g. average, click-through rate)

Regardless of the particle size of the data to calculate this XX rate or average, it is necessary to reduce the division formula to the numerator and denominator at the level of the particle, and then divide. If we take the click rate (Click number / Impression number) as an example.

  • The formula for the daily click rate is:
    Daily Click total / Daily Impression total
  • The formula for the monthly click rate is:
    Monthly Click count summed / Monthly Impression summed

Other common aggregation formulas

  • Maximum/ minimum: It is often used as an aggregation formula to check whether there is any abnormality in the data, and it is recommended to calculate the maximum and minimum value as long as there is numerical data collected.
  • Median: the 50%th digit, you can know the midpoint of the data distribution.
  • The nth percentile: you can know the distribution of the data.
    Standard deviation: It is also used to check the status of data distribution. Since the formula of standard deviation is more complicated, more attention is needed in using it.

In conclusion, data aggregation is an essential process in data analysis, allowing for the reduction of data calculation time, storage space, and cost. The choice of aggregation unit and method depends on the data’s subsequent usage requirements, and caution must be exercised when using aggregation methods to avoid losing important information. When done correctly, data aggregation can provide valuable insights for decision-making across various industries.

References

https://www.ibm.com/docs/en/tnpm/1.4.2?topic=data-aggregation
https://www.import.io/post/what-is-data-aggregation-industry-examples/
https://improvado.io/blog/what-is-data-aggregation
https://www.jigsawacademy.com/blogs/data-science/data-aggregation/

--

--

Bryan Yang
A multi hyphen life

Data Engineer, Data Producer Manager, Data Solution Architect