PopStats Part III — Try it out

Shamik Sharma
2 min readAug 23, 2017

--

In Part 1 of this article series, I described some ideas how statistical data-sets about population could be made more extensible, so that each data-set can add to each other, resulting in an ever-improving single data-set platform. In Part 2, I described an approach to doing this, called PopStats.

Pick it up

You can get PopStats from its github repository.

The README file contains instructions on how to install and use PopStats.

StatsML

PopStats segments the population - based on dimensions (e.g. age, gender, marital status) specified by the user. The user also has to specify how the population is distributed across that dimension (e.g. male: 51%, female : 49%).

The language to specify this (really just a configuration file format) is called StatsML. It also allows one dimension to refer to previously specified dimensions and specify conditional distributions.

PopStats comes with four StatsML files —

  1. age.sml — segments by age-groups. child (0–14), youth(15–24), adult(25–44), mature(45–59), old (60+)
  2. gender.sml — by gender (male and female)
  3. location.sml — by type of location (tier-1 to tier-8).
  4. hhi.sml — house-hold income level bands (sec_a0, sec_a1, sec-a, .. sec-e)

These 4 dimensions and the distributions within them are self-explanatory. The last one (HHI) uses conditional distributions. You can study it to create your own dimensions and add them to the population.

The repository also comes with a pre-segmented database — india.dat, that segments the Indian population, using the four dimensions above.

Extending PopStats

PopStats is a very basic tool, just to illustrate a way that population statistics can be combined. A lot more is needed to make it real. Some immediate thoughts below on how it can be improved.

  1. Web UI — Make it accessible over the Web, with visual charting etc.
  2. Sharing — Allow sharing stats to build up a common data-set. Tools to analyze stats for inconsistencies.
  3. StatsML — improve the way stats (*.sml) are specified and how they refer to each other. Allow richer conditional expressions that span multiple dimensions.
  4. Time — Distributions will change over time (e.g. house-hold incomes increase). StatsML should be able to specify how things will change. PopStats should be able to show changes in each segment across time.
  5. Multiple populations — Model multiple population groups and allow comparisons across corresponding demographic segments (e.g. mobile access in India vs China vs US)
  6. Many ideas how this could be used for corporate market-analysis.

If you are interested in this topic and have ideas on the above, reach out to me via comments/github/twitter.

While you are here, you may want to read some observations I had about Indian population, while looking through the stats. Read my post on that topic.

--

--

Shamik Sharma

TechExec @ Bangalore/BayArea. Ex-CPO/CTO Myntra. Built cool products/teams/biz at Lytro, StumbleUpon, RockYou, Yahoo! Co-founded Confluent (acq. Oracle).