Types of data product — Raw Data

Bryan Yang
A multi hyphen life
4 min readNov 4, 2022

Garbage in, garbage out

Just like rice is to rice, wheat is to noodles, the original data is the most basic existence of any data product.

When I was studying statistics in graduate school, the professor often used the phrase “garbage in, garbage out”. Statistics is the result of whatever data is put into it, but the prerequisite for this result to be meaningful lies in the quality of the data brought to the analysis. The quality and type of data will directly affect the final quality of the data product.

Garbage in, garbage out

Here are some familiar fallacies caused by data quality bias

Some trolls said that their annual income starts from $200,000.

There is always someone sharing their salaries on the internet such as it’s just three years after changing jobs or getting $100,000 just graduating from college. After seeing these, you may think that salaries in this world start at $100,000. We all know this is absolutely biased, but what is wrong with it? Let’s look at the following points.

Is the data fraudulent?

Any data collected from End Users are likely to be biased, whether by memory or by deliberate misinformation. Most of the surveys will investigate the basic information of users, including gender, age, occupation, income, and so on.

If it is a face-to-face interview, we can still identify gender and age from the eyes, but if it is a telephone interview or online survey, these cannot be directly confirmed, not to mention the difficulty of determining occupation and income. However, someone says that because of the anonymity of the Internet, respondents are more willing to provide truthful information for reference, but as the internet becomes less anonymous (and individuals may be backtracked by ID or IP), the truthfulness of such surveys is being tested.

Survivorship bias?

Survivorship bias is a fallacy in selection (Wiki-survivorship bias), Dcard is a social platform, and Dcard was used most by university students in Taiwan, so those who would stay and post on it were a group of people with good education and some work experience, so their starting salary would have been much higher than that of the government survey.

Sampling error due to a bad sampling

If we can’t survey everyone’s salary due to practical considerations, then let’s sample. Sampling itself is a good idea, but how do we do it? Distributing a survey online seems like a pretty random method, but it is also limited by the “all possible respondents that can be reached by the survey” limitation. For example, if you only put the survey on Dcard, only users on Dcard are likely to fill out the questionnaire. If you put the survey on Jitterbug, you’re reaching a different group of respondents.

Even if you use random dialing by phone, you can only reach people with phone numbers. If one person has a thousand phone numbers, the chances of that person receiving an interview will increase dramatically.

Even if the sample is well-designed, there are still sampling errors
Sampling itself carries a certain degree of statistical uncertainty. The appropriate statistical methods can help us the impact of this sampling error. But errors still occur! Medical tests themselves are also statistical in nature, so it is possible that a test like the one below may show pregnancy in a male or no pregnancy in a pregnant female. Multiple tests are needed to determine the situation (e.g. eye).

How about big data?

It is true that big data helps us to deal with the sampling problem, but big data cannot directly solve the problem of whether the data itself is true or false (of course, we can verify the truth of the data by comparing various kinds of data). However, when there is too much data, it will also cause troubles in data processing and calculation. Then it is necessary to reduce the complexity of data processing or calculation time by other means.

Other possible biases in data collection

Even if we use electronic products (e.g., APPs) to collect user usage data, such as clicks and browsing, which seem to be well-defined and automated, errors may still occur. For example, repeated clicks by users, false clicks caused by crawlers, or data transmission failure due to network transmission problems. There is no infallible way to collect data!

Handling of possible biases in raw data in conscious

We need the original data for analysis, but it is impossible to avoid all kinds of bias in the original data, and this is the reality we face. Therefore, we need to handle all kinds of “Insight” carefully when analyzing data.

For example, when you collect Dcard data through crawlers and find out the average salary is $100,000, you need to trace back to which pages the data you collected appeared, in what context people mentioned their salary, whether it is a context of joking or a context of the discussion and seeking knowledge.

When you find out that your app has lost 80% of its daily activity user(DAU), don’t be nervous yet, you can first list out the data used in your analysis report and see if there is any part of the data missing or wrongly handled.

--

--

Bryan Yang
A multi hyphen life

Data Engineer, Data Producer Manager, Data Solution Architect