Sampling Bias When Using Alternative Data


Key Takeaways
Unique Sampling Bias of Alternative Data
Users generating data with sensors are not likely to represent the population. Social media users are usually young and tech savvy representing rather a specific demographic group. Search data, on the other hand, can be used in econometric modelling more straightforwardly.
Tools to Control Bias
Academics and practitioners have been trying to use different tools in order to control the sampling bias when using alternative data. The nature of certain types of alternative data can be advantageous while reweighting and specific panel construction can also be used.
Case Study: EA Twitter Panel
Eagle Alpha’s panel was composed of almost 3,000 individuals on Twitter that have confirmed a successful pre-order of the Apple Watch. We identified these individuals from analyzing 3 million tweets using various data science techniques. The panel reported faster than expected delivery times, with 24% of those that originally had a June delivery time already receiving their watch.
Introduction
Alternative data including social media and web traffic data can be used for a variety of purposes — opinions, reviews, sentiment. However, the selection bias can be a real issue as these sources and users on them might not represent the overall population. This article discusses the selection bias when using alternative data including social media and search data.
Compared to traditional data, alternative sources allow us to integrate much greater volumes of data in our analysis. More data to work with can lead us to the identification of better insights. However, when using alternative data, it is important to control for the selection bias and report findings in an appropriate way.
Marketing professionals have been enjoying the Big Data revolution with new tools allowing them to target audiences much more efficiently, but the accuracy of cookie-based targeting still has to be improved (figure 1).
Fig. 1 Percent of Ad Impressions Delivered to Correct Demographic


Demographic characteristics of the targeted population are important whether you are trying to formulate an investment thesis or run a marketing campaign. Figure 2 shows the demographic characteristics of different platforms. This work is still in early stages of development, but new tools such as machine learning are being increasingly used within the asset management community. Machine learning, in particular, can be used with alternative data sources to automatically identify and cleanse the samples analyst are working with.
Fig. 2 Demographic Characteristics of Social Media Platforms


Alternative Data Characteristics in Relation to Sampling Bias
Let us make the definitions clear. What is meant by the selection/sampling bias in general? Martha K. Smith, Mathematics Professor from the University of Texas, explains: “A sampling method is called biased if it systematically favors some outcomes over others. Sampling bias is sometimes called ascertainment bias or systematic bias.” A sampling bias will occur due to the error associated with the sample selection. If a specific group within a sample is over- or underrepresented relative to others in the population, then the outcome will be biased.
Alternative data sources can help us extract a lot of hidden insights, but the analytical output will not be that useful if the sampling bias cannot be controlled. In their 2015 paper ‘Big Data and Urban Informatics: Innovations and Challenges to Urban Planning and Knowledge Discovery’, Thakuriah, TIlahun and Zellner discussed their methodologies when using Big Data.
Users that generate data using sensors are most likely not representing the population. Social media users are usually young and tech savvy representing rather a specific demographic group. Concentration might occur in certain areas, but patterns always change as the new technology and services are adopted by a larger portion of a population. As the researchers put it, “technology changes rapidly and there would always be the issue of the first adopters with specific, non-representative demographics and use patterns.”
Passive users do not generate enough data and that can also be an issue. Fake accounts and reviews are becoming prominent to imitate a positive or negative sentiment. Lack of independent behavior and herding behavior could also affect decisions of users. Therefore, it will be necessary to perform re-weighting of samples and the research scope will determine the specific weights.
Search data, on the other hand, can be used in econometric modeling more straightforwardly. Nick McLaren of the Bank of England presented benefits of this type of data and described how they could be used as economic indicators. This type of data is timely and can cover massive samples. The biggest differentiating characteristic compared to traditional surveying is that search data is a by-product of other activity. Inaccurate responses or non-responses to survey questions do not apply in this case. There is also a benefit of added flexibility to analyze unexpectedly arising issues rather than rely only on pre-determined survey questions.
There are some disadvantages when using search data too. Data is with a short backrun due to widespread internet use being a relatively new phenomenon. High levels of correlations are observed between internet use and factors like income and age. These issues might affect whether the sample is representative. There is a further problem of how different users go on about the search — they might enter completely different queries when they actually want to find out about the same topic. On the other hand, users can use same queries when they have different intentions. It is up to data scientists to develop a specific methodology in each particular case that will deal with these different issues, go through the white noise and deliver actionable insights.
Academic Studies on Big Data and Sampling
Academic researchers are constantly trying to work with new types of data establishing correlations and predictive powers. Data science and machine learning techniques have been used in recent years to analyze massive amounts of data and extract actionable insights. Pedro Domingos, Professor of Computer Science & Engineering from the University of Washington, summarizes key practical lessons learned from practitioners and researchers using machine learning. In particular, he mentions the importance of controlling bias and variance (figure 3).
Fig. 3 Bias and Variance in Dart-Throwing


In their 2014 paper ‘Data Mining with Big Data’, Wu et al. propose a HACE Theorem: Big Data starts with large-volume, heterogeneous, autonomous sources with distributed and decentralized control, and seeks to explorecomplex and evolving relationships among data. The researchers mention that data mining should be performed with care as biased data at each point can lead to biased models. They advise that an information exchange should be established when working with a wide variety of data sources and a global optimization can be achieved.
In his 2014 paper ‘Reducing Sampling Bias in Social Media Data for County Health Inference’, Culotta discusses the use of social media data for health monitoring with the issues like influenza, depression, dental pain, insomnia. Large sample sizes that could be collected with the use of social media is what’s appealing to researchers, especially to monitor diseases that can be transmitted in a fast way.
Culotta used lexical patterns in tweets with geo-locations to estimate county health statistics. He used LIWC (The Linguistic Inquiry and Word Count) and PERMA (Positive emotion, Engagement, Relationships, Meaning, and Accomplishment) lexicons. These lexicons were chosen for this particular case to answer the research question as the categories within the lexicons relate to health and personality.
These lexical patterns helped identify specific country characteristics, for example: “Counties that use more positive emotional terms (“happy”, “best”) tend to report greater socioemotional support on government surveys; counties that use more profanity and more frequently discuss sports and television tend to have higher obesity rates.“Statistics such as obesity, access to healthy foods, and diabetes were estimated based on the Twitter activity. Figure 4 shows the whole list of dependent variables used in regression models.
Fig. 4 Dependent Variables in Regression Models for Health Monitoring


Culotta used reweighting approaches based on race and gender to reduce the selection bias. For example, if a county population is 60% female, but the Twitter estimate of the study shows only 30%, then tweets from each female were counted twice. The researchers used machine learning techniques to infer the race and gender characteristics of their samples and then compared each county’s data with U.S. Census demographics. Survey reweighting was employed to adjust predictions. The approach yielded impressive results: held-out prediction error was reduced by 4.3% on average with improvements for 20 out of the 27 variables (figure 5).
Fig. 5 Symmetric Mean Absolute Percentage Error Improvement


Case Study — Eagle Alpha’s Apple Watch Panel
Our tools enable us to identify a relevant group of the population that we can draw insights from. Eagle Alpha’s Panel for the Apple Watch reported faster than expected delivery times, with 24% of those that originally had a June delivery time already receiving their watch.
This group was composed of almost 3,000 individuals on Twitter that have confirmed a successful pre-order of the Apple Watch. We identified these individuals from analyzing 3 million tweets using various data science techniques. From their public discussions on Twitter, we analyzed their Apple Watch related tweets to derive the key topics and views from them on this watch (figure 6).
Fig. 6 Topics from Eagle Alpha’s Apple Watch Panel


Our Apple Watch Panel had been unsurprisingly focused on the delivery status of their pre-ordered watch. Fortunately for them, 65% received their Apple Watch, while 25% were indicating that they would still need to wait until June 2015 for the shipment.
The topic modeling tool is based on state-of-the-art algorithms that our in-house data science team have fine-tuned in order to obtain the most accurate results for each of our research domains. Our tool enables Eagle Alpha research analysts to determine sets of topics from long-term datasets and short-term streams of breaking content.
Conclusion
Alternative data sources allow us to vastly increase the volume of data to be worked with. As Herbert I. Weisberg, the author of ‘Bias and Causation: Models and Judgement for Valid Comparisons’, says: “Unlike error related to random variability, bias cannot be assessed without external knowledge of the world.” Alternative data has its unique sampling bias and new techniques are being developed to tackle this problem.
Our goal is to use alternative data for the construction of models that can deliver a bigger picture perspective, control for the sampling biases and not fall into the trap of localized view (figure 7).
Fig. 7 The Blind Men and The Giant Elephant: The Localized (Limited) View of Each Blind Man Leads to a Biased Conclusion


Asset managers are increasingly incorporating alternative data into their investment process. Eagle Alpha enables asset managers to obtain alpha from alternative data. A subscription to Eagle Alpha provides asset managers with research, analytics and data. Get in [email protected]