The data myths and simplified problems: thoughts.

9 min readJun 24, 2019

After several years of cautious enthusiasm, the marketing and advertising technology sector is now embracing data in a big way. That’s the good news. The obstacle is that most companies and brands still lack the expertise necessary to analyze huge amounts of data and make it actionable, and the worst danger is that the buzzword is giving space, not surprisingly, to people and new companies that don’t have the capacity or knowledge, the result… bad decisions, wrong investments and lack of results.

Due to this situation, this text was made to address some problems and provide some ideas to consider when talking about the magical benefits of platforms, data, and algorithms. The topics on this text are, why don’t take for granted the data approach, trust on data and gurús.

Peanuts and beans

A modern data driven culture is a complicated entity. Thousands of entities engage in producing millions of different values. Many millions of people interact in all sorts of points and make decisions about which goods to buy/click/share. Let’s use peanuts as an example.

Peanuts must be harvested at the right time and shipped to processors who turn them into peanut butter, peanut oil, peanut brittle, and numerous other peanut products. These processors, in turn, must make certain that their products arrive at thousands of retail outlets in the proper quantities to meet demand, then people must select one on many and complete the buying process.

Because it would be impossible to describe the features of even this peanut market in complete detail, marketers have chosen to abstract from the complexities of the real world and develop rather simple models that capture the “essentials”.

Just as a road map is helpful even though it does not record every house or every store, models of the market for peanuts are also useful even though they do not record every minute feature of the peanut economy.

Even though these models often make heroic abstractions from the complexities of the real world, they nonetheless capture essential features that are common to all economic activities.

The use of models is widespread in the physical and social sciences. In physics, the notion of a “perfect” vacuum or an “ideal” gas is an abstraction that permits scientists to study real-world phenomena in simplified settings. In chemistry, the idea of an atom or a molecule is actually a simplified model of the structure of matter. Architects use mock-up models to plan buildings. Television technicians refer to wiring diagrams to locate problems. Marketing models perform similar functions. They provide simplified portraits of the way individuals make decisions, the way buyers behave, and the way in which these two groups interact to establish markets.

In the real world, for instance, it would be nearly impossible to determine the causal relationship between the increase of the price of a good (dependent variable) and the number of units demanded of it (independent variable), while also taking into account other variables that affect price. For example, the price of peanuts may rise if more people are willing to purchase it, and producers may sell it for a lower price if fewer people want it. But prices of peanuts may also drop if, for instance, the price of land to farm also drops, making it difficult to assume it was demand alone that caused the price change.

That is why marketers make assumptions, in economics, these assumptions are called “assumption of ceteris paribus”, a Latin phrase meaning “with other things the same” or “other things being equal or held constant”. It helps to isolate multiple independent variables affecting the dependent variable. However, if these other variables, such as prices of related goods, production costs and labor costs are held constant under the ceteris paribus assumption, it is simpler to describe the relationship between price and demand.

Types of data

Digital advertising has made it easier for us to engage potential buyers but it is still an imperfect discipline. After all, digital is far from being the perfect channel.

The problem is that a lot of advertisers aren’t creating the kind of content their audience wants to see. And there is a simple reason for this, they don’t know what it looks like. They may have access to the best market research and a wealth of CRM data but that is only giving them an impression of what their audience wants. And this impression is based mainly on what they know about their existing customers. If you want to create personalized and truly relevant ads, then you need to know what everyone, both your prospects and existing customers wants. In reality, most independent variables are dynamic in the market, they change over time.

Thus, the challenge of marketing is obvious. This challenge can be visualized in a two sides model. First, since all the variables are dynamic in the market, a marketer has to understand the direction and magnitude of all market forces that are having an impact on an organization’s results. Second, it is not enough to understand the interaction of forces, but it is required to intervene with management tools and techniques to manage this interaction of forces.

And to achieve this level of insight you need a whole lot of data. Today, in digital advertising, basically we have four different types of data:

The information in the CRM (email addresses, phone numbers, names etc).
Data your customers and prospects generate online (hashed device IDs, cookies etc).
Reports about sales, distribution and performance from POS.
Analysis and market research (Qualitative and Quantitative).

Bring these datasets together and you can build a view of your addressable market. And this enables you to do all sorts of interesting things.

For example, you can:

Match device IDs to existing customers.
Identify patterns in how unknown prospects are interacting with your apps, content and website.
Map how different segments of your audience move between devices and platforms.

In short, you can see how everyone who interacts with your online advertising is behaving both before and after the point of sale. It turns out that, the key fight in marketing is centered around how effectively these apparently independent variables can be intervened or tempered.

Be aware that even with all of this data and processes we are making an assumption of ceteris paribus.

Using Data to Optimize Advertising

The big promise of data to advertising is to improve accuracy of communication. Advertising is expected to become more relevant and less expensive as a result of less wastage. Different data is necessary depending on individual advertising goals. Basically, advertising activities can either be performance related or support the brand image. The greater the focus on immediate sales success, the more data is needed to promote individual customer contact and retargeting. But if increasing brand recognition is what matters, the focus will be more on general interest data and nonspeciﬁc messages.

Branding campaigns frequently aim to improve brand image or recognition. This is traditionally the domain of TV advertising. Therefore, the online advertising world has adopted indicators like net reach or gross rating points from TV advertising. The success of a branding campaign is judged by maximum contact with a given target group. In many cases, sociodemographics like age and gender determine the relevant segments. Data mining is used to make a valid prediction of these characteristics for as many online users as possible. Usually, the greater the reach, the less precise the forecasting of characteristics, and this is a trade-off that must be considered. If we assume that the data is valid, an advertiser can signiﬁcantly reduce its media costs this way. Advertising is delivered only to its target group, driving down wastage signiﬁcantly.

The validity of the user data is another fundamental problem. Verifying whether the cookies of online users describe them accurately is a service provided by third-party companies like Nielsen or Comscore. However, these companies use proprietary metrics, meaning that their measurements are not always consistent with one another. Tests in the USA and UK have shown that different validating companies assign different genders to one and the same online user. Some may categorize a user as male, while another categorizes the same person as female. As a result, even the data provider cannot be sure of the actual quality of the data in the lead up to an advertising campaign. The same applies to the data user. As long as there is no validation standard, the user cannot know which provider supplies good data, which means that they are taking a risk.

A simplified approach of issues and challenges in the ads algorithms

The other face of digital advertising is the distribution, and the challenge of effective web advertisement primarily involves placing relevant ads on user requested web pages. Those ads must be relevant to a page receiver, that is relevant to the page context and/or directly to the user. The web media use the advantage of having some information about the user. This enables them to choose the “right” ad for “the user”. Statistically and cumulatively one may determine a visitor’s interest for certain things, for example: search queries, social nets and related groups, email content exchange, bookmarks and back link, etc.

Of course, lots of privacy issues get involved here; that we will leave for others to discuss.

Don’t take numbers for granted, don’t trust humans.

When talking about data, organizations have difficulties evaluating the quality of the data and its reliability, raising a big question to the stakeholders as to “can you trust your data?”. People are worried about the authenticity of their organizational data or the data they intend to use. Now that everyone has realized that human judgement in a business context is poor, organizations are increasingly basing decisions on data driven facts. Given blind trust that today’s reign over data and big data solutions, businesses must not take those numbers for granted.

Believe it or not, a lot of things can go wrong. Even Google Analytics is prone to mistakes. Anything and everything starting from data collection to data integration, data interpretation to data reporting; should be questioned rigorously. For example, events not named; can lead decision analysts to commit errors while calculating results.

The numbers can be misinterpreted if the context is not understood completely. For example, the sales department, would die thinking why the conversion rate was not going up, even after making improvements to purchase funnel. Unless they remembered that also marketing department started an acquisition campaign, which did result in higher volume of visitors who were “less qualified” than earlier; and hence less conversions.

In an opposite situation, if the conversion rate had skyrocketed, no one would have questioned the positive numbers and the sales manager would have taken pride in the hike of the conversion rate.

Another factor to consider is the limited amount of data commonly organizations have, first if we are already making assumption of ceteris paribus, and then make decisions based on small samples of data (hundreds or thousands of values), the margin of error is huge and basically the decision is based on “guts” or apparently “informed decisions”.

Spreadsheets fundamentally lack of the properties essential to modern data work. To do good data work today, you need to use a system that is:

Reproducible

Don’t trust any number that comes without supporting code. That code should take from the raw data to the conclusions. Most analyses contain too many important detailed steps to communicate in an email or during a meeting. Worse yet, it is impossible to remember exactly what you’ve done in a point and click environment, so doing it the same way again next time is to flip a coin. Reproducible also means efficient. When an input or an assumption changes, it should be as easy as re-running the analysis.

Versionable

Code versioning frameworks, such as git, are now a must in the workflow of most technical teams. Teams without versioning are constantly asking questions like, “Did you send the latest file?”, “Can I be sure that my teammate selected all columns when he sorted?”, or “The bottom line numbers are different in this report; what exactly has changed since the first draft?”

These inefficiencies in collaboration and uncertainties about the calculations can be deadly to a data team. Sharing code in a common environment also enables the reuse of modular analysis components. Instead of four analysts all inventing their own method for loading and cleaning a table of users, you can share as a group the utils/LoadUsers() function and ensure you are talking about the same people at every meeting.

Scalable

There are hard technical limits to how large an analysis you can do in a spreadsheet. Excel 2013 is capped at just more than 1 million rows. It doesn’t take a very large business these days to collect more than 1 million observations of customer interactions or transactions. There are also feasibility limits. How long does it take your computer to open a million row spreadsheet? How likely is it that you will spot a copy-paste error at row 403,658? Ideally, the same tools you build to understand your data when you’re at 10 employees should scale and evolve through your IPO.

The data myths and simplified problems: thoughts.

Written by Andrés de la O