PopStats Part I- bastard introverted stats

How do we make stats talk to each other and interbreed ?

Shamik Sharma
4 min readAug 21, 2017

I have been going over a lot of data about India. It is incredible how much data there is.

Census statistics contain a bounty of information about where Indians live, their gender distribution, their marital status etc. Similarly, there are other extensive datasets about India for labor, education, nutrition, health, transport, finance etc. Besides, there are numerous market-research datasets about how Indians spend their money, how they use their mobile devices etc. Most large companies also have their own privately commissioned surveys about how Indians use products/services for their markets (e.g. how many people eat biscuits).

There are stats for everything.

Stats are introverted

Each statistical dataset comes from its own (vertical) point of view. The gender dataset talks about gender-ratios in various states of India. The population density dataset describes distribution of population in different settlements (tier-1 metros vs. small tribal settlements). The marketing datasets talk about distribution of household consumption levels (SEC-A, SEC-B etc.). Telecom datasets talks about mobile penetration in rural vs urban households. Each of these statistics is analyzing the same Indian population along a different dimension.

But they don’t talk to each other. Each of them stands alone, gloriously segmenting the same Indian population along its own dimension of focus.

Why are these stats not combinable ?

Why can’t we merge datasets so we can that we can answer questions across dimensions ? For example — can we tell how many SEC-A women mobile users there are in tier-3 towns who are above 45?

Bastard data

There are two problems with merging statistics. First, any survey can focus on a only a few dimensions, so you have a basic data-capture problem. Second, mathematically, one cannot combine distributions that are along different dimensions. For example, ~30% of Indians regularly wear sarees and ~50% of Indians are men. This does not mean 15% of Indian men regularly wear sarees. That’s bastard data — it comes from mingling two stats that were not meant to be married.

Lets take a more realistic example. Lets say a telco wants to determine the number of rural men with mobile access. They have the following stats (I’ve made these up for illustration):

  1. There are 1 billion Indians
  2. 70% of Indians live in villages
  3. 50% of Indians are men
  4. Mobile penetration is 30% in rural areas.
  5. Mobile penetration is 40% amongst men .

Here is one way to determine the number of rural, male, mobile-users. 70% of Indians are rural (700M), of whom 30% have mobile access (210M), of whom 50% are men. So the answer is 105M. Or, here is another way — there are 500M men in India (50%), of whom 40% have mobiles (200M), of whom 70% live in villages — so 140M. Neither do the stats agree, nor are they mathematically accurate, because we are trying to extrapolate information that is just not there in the original dataset. The stats we have are for all-men and for all-rural — but not for rural-men.

So we can all agree that bastard data is bad. Kill it ! Leave the corpse on a table.

But.. but… you know, on the other hand, 105–140M (say, 125M), is not only better than nothing — it actually sounds like a pretty reasonable estimate of rural Indian men with mobiles. Telcos would probably be happy to use that number to plan and invest accordingly. So lets resurrect that corpse and work with what we have.

Statisticians, sociologists and even computer scientists hate this kind of bastard data. Business-folks and product managers, on the other hand, love it, and use it all the time. As long as the dimensions you are combining are reasonably equitably distributed along each other, you will be “good enough” and wont end up with lots of saree-wearing men.

Because of these differences in perception, bastard data stays hidden in corporate Excel files in walled locations and never appears in papers, open-source datasets and computing systems. It never gets the opportunity to prove itself and become the King.

But, what if all the bastard stats could just be put into one big pile and interact and fight each other in a battle of the Bastards ? The good ones would win out, the bad ones would get fed to the dogs, and we would end up with much better approximations.

Its like the blind men touching an elephant and guessing what it is — they just need to talk to each other to figure it out.

A common stats language

Before the web there was FTP. Your friends would tell you, usually over Usenet groups, about some public folder where there was a treasure trove of great data (psst, it was always jpegs). You could just open up a FTP/Gopher client and go retrieve and enjoy. Who needed HTTP/HTML and links ?

We seem to be in a similar place with statistics today. We are all analyzing the same set of people (there are just <8B of us, after all) in a thousand different ways and not talking to each other about it. Its getting worse. Once we slice humans along even just 13 dimensions, each with just 6 possible values each, we end up with 6¹³ segments (13B). Thats more segments than there are people. [Related aside: there are now more indexes/funds than underlying stocks in the US stock market].

Why not have a common stats, a lingua franca, a taxonomy, by which you can refer to the common set of all people and specify how you have segmented the population (or some subset of it), and how it can be merged with other segmentation schemes that have already been done ?

PopStats

I have a very, very simplistic first cut at this problem — that I call PopStats (for Population Stats)

Read about PopStats in Part II of this series [click here] .

--

--

Shamik Sharma

TechExec @ Bangalore/BayArea. Ex-CPO/CTO Myntra. Built cool products/teams/biz at Lytro, StumbleUpon, RockYou, Yahoo! Co-founded Confluent (acq. Oracle).