PopStats Part II — hook up, stat !

4 min readAug 21, 2017

In Part 1 of this series, I described the problem I have with population statistics about India — stats about different dimensions (age, gender, location, income etc) are all disparate and can’t be combined with each other.

What I wanted

The idea is to have a single set of data-sets with which has:

Cross-dimension query-ability — Ability to ask any query that spans across multiple dimensions.

Age vs Gender for Indians living in Tier-1 (metros)

2. Extensibility — People should be able to add their own custom dimensions. Queries that combine the newly added dimension and previously existing dimensions should just work.

Number of Indians in each household income group (hhi) in Tier-1 (metros)

3. Reference-ability — There should be a standard taxonomy (aka namespace) for common dimensions (e.g. age,gender,location) and their possible units (e.g. gender = male|female). Custom segments should be able refer to these taxonomies and segment them further.

What I got

What I built is laughably simple.

I just make a list of every imaginary Indian (1.35B rows), and then start ascribing attributes to them, based on the distribution that a particular statistical dataset specifies. Its like your own little virtual Aadhar system.

Census 2011 says that 51% of Indians are male and 49% are female. Hence, I randomly pick 688.5M entries from the list and mark them as male and the other 661.5M entries and make them as female. Repeat the same thing for each distribution. The “built-in” ones are gender, age and location.

For any query, all that is needed is to scan all the 1.35B Indians one by one, and count how many of them match the given criteria. For example, the number of women, age 15–24, living in Tier-3 towns is just a scan resulting in 11M matches.

Young women living in Tier-3 towns

Linking stats together

The complexity and interesting part is when a custom segment needs to refer to other previously specified dimensions.

As we keep adding distributions, the data-set gets richer and richer. This additive ability is the key aspect.

For example, household-income (HHI) levels (or mobile-penetration) tend to be differently distributed across urban vs rural populations. Ketchup consumption is based on household income levels. And so on.

In such cases, the segment can specify a conditional distribution — i.e. use distribution X only if a prior condition is true about this person (e.g. this person lives in an urban location i.e. tier = t1, t2, t3 or t4 ). Conditions allow us to refer to and thereby link together different segments.

Distribution of people by household income-groups (hhi). Different distributions in Urban & Rural populations

PopStats

PopStats is not just an idea about how to link stats together. Its an actual codebase — albeit, very basic — only a couple of hundred lines of Python. It comes with a data-set of the Indian population distributed by age-gender-location to start with.

Using the provided data-set you can run queries like the ones mentioned in this article. You can also extend it with your own segments.

My hope is that people will use PopStats to add their own segments and see if the resulting data make sense. Especially when it is combined with other previously defined segments. If it doesn’t, some conditional distributions probably need to be added. Tools in the future could help segment-creators identify inconsistencies and suggesting alternate facts.

Once the segment makes sense, people can publish their own segment-files to some central public location, and others should be able to use their segments too. Good segments will get used more often and become “part of the platform”.

Everyone becomes a market-researcher ! Software eats sociology ! [Editor : Add some random other proclamation !]

The next part, Part III, covers how to get PopStats, use it and some tech nuances about it.

PopStats Part II — hook up, stat !

What I wanted

What I got

Linking stats together

PopStats

Written by Shamik Sharma