A May 2016 report by ProPublica revealed a stark finding. A commonly used statistical tool used by criminal justice professionals to predict criminal recidivism and determine sentencing guidelines was inherently biased.

The program, Compas, supposedly slanted its predictions to assume African-American were more likely to commit another crime than Caucasians. Not just slightly, but substantially so. Their analysis accuses the software of labeling black defendants as future criminals at twice the rate as whites, and whites were more often labeled as low risk than black defendants.

But the numbers for those statistics appear to not be based on output of Compas’ model. In the notebook for ProPublica’s calculations (cells 50–55), the percentages are calculated using a number of values for whether the perpetrator recommitted a crime, how much time they spent in jail, and whether their score was valid–but not the actual value of their Compas score. …

# Coronavirus and the Fallacy of Increasing Sample Sizes, or Incidence Versus Prevalence

A growing total number of infected persons might seem like a sign of an epidemic, and it could be, but it could also just reflect the growing sample size.

For example, if a disease affects 10 percent of a uniformly distributed population, then a sample of 1,000 people might show around 100 people infected.

If the sample number of those being tested increased, then the number of infected persons in the sample is likely to increase at the same rate as the change in sample size.

From that same 10% infected population, a larger sample of 10,000 people would likely collect about 1,000 infected. A sample of 50,000 would collect 5,000 infected. So the number of infected increases in lock-step with the sample size. That applies whether the population sample size increases linearly or exponentially or any other way. …

# What Everybody Gets (Slightly) Wrong About the Monty Hall Problem

Admittedly, when I first heard the Monty Hall problem, I didn’t believe it.

Why would switching your decision help your odds if your choice is random? Should I be constantly switching lanes at the grocery store? What did this mean for game shows like Deal or No Deal with dozens of doors?

At the time, I wasn’t able to explain it well. Even as others explained the Bayesian logic behind it, something didn’t seem right.

When I tried to replicate it with programming code, as others said should prove the theory, I wasn’t able to.

Eventually I realized it: The Monty Hall problem assumes that when a door is revealed the first time, it is a loser every time. It’s always farm animals rather than a sports car or a bag of money. …

# Improved Simulation Performance with Recursive Vectorization in Python

Running a step-by-step simulation can be a time-intensive process if you’re dealing with large datasets, complex modeling logic, slow infrastructure, or all three. Compared to forecasting with regressions or classification, the speed can be exponentially slower.

But that’s just how it is. Simulations are necessary for modeling complex logical steps. And maybe that means switching to faster languages (C++, C#) or throwing hardware at the problem to cut down on processing time.

Yet using a more approachable programming language like Python can still extract decent performance using a trick I came across leveraging recursion.

Usually recursion is slower than looping in Python because of how it references the call stack. It’s usually only meant for problems where the functional logic is inherently recursive, like a Fibonacci sequence, where it keeps the code simple at the expense of performance. Otherwise recursion makes the code more complex. …

# Analyzing Executive Compensation with Distributions and Execucomp Data

There are plenty of stories these days about fatcat CEOs and the ridiculous salaries they get paid. Carlos Ghosn, the ex-head of Nissan currently on trial, is not the only one.

But something was always a little odd about this. At least for public companies, executives can’t really get away with robbing the till in plain sight. At least not easily. Investors have a say in how much CEOs get paid, and why would any investor put up with paying the CEO millions that would otherwise go to the company’s operations? Or stock price?

Maybe well paid CEOs do really make more profits? And maybe nobody cares? Either way, it was an idle curiosity for myself. …

# Predicting the Mega Millions with Gaussian Naïve Bayes

Some time ago, when I was both poor, had too much spare time, and the Mega Millions lottery payout was at some historic high, I tried looking at whether there were any trends to be gleaned.

After scraping the site, lo and behold, there was something interesting going on. Not enough to give up work and become a professional gambler, but something worth looking into.

Well, Data.gov has a data set of Mega Millions winning numbers form the New York State Lottery going back almost a decade and the oddities are still there.

For example, have a look at the plot of just one ball (of…

# Why Learn Data Science?

To answer this question — to paraphrase Connections host James Burke — involves a bit of a detour.

Previous to Flatiron, I was employed as a data journalist. It shares a substantial overlap with data science, although the terms certainly aren’t the same.

Data journalism doesn’t commonly have the same depth of analysis or prediction found in data science, and data science often doesn’t really get into the subjective nature of policy that’s not in the data — the nuance of events that only comes from research and reporting: talking to people who are intimately familiar with the subject and can pinpoint the real cause of the spike in hog imports that won’t be found in the data. …

# Trump Administration Drug Price Decline Began Before Generic Approvals

The Trump administration’s Council of Economic Advisors (CEA) headed by Kevin Hassett published a report in October of 2018highlighting the substantial decline in drug prices because of new policies to approve more generic drugs.

The report cites free market entry and robust competition as a pathway to lower drug prices.

While drug prices have seen a decline reflected in consumer price indexes and other measurements, the CEA’s own analysis shows that the drug price decline doesn’t appear directly related to increased approval of generics.

# Glass-Steagall’s Relevance: The Deregulation that Drove the Financial Crisis

The story of the 2007–2008 financial crisis has left out the role of deregulation.

In the wake of the crisis, financial experts dismissed the effect of eliminating banking regulations like Glass-Steagall in media outlets like NPR and the Financial Times since the law was mainly aimed at preventing banks from getting too large or “too big to fail,” which wasn’t essential to causing the crisis.

According to former Treasury secretary and chief economist for the World Bank Larry Summers, “virtually everything that contributed to the crisis was not affected by Glass-Steagall even in its purest form.”

But Glass-Steagall was not intended to prevent banks from becoming too large. It was originally created to prevent conflicts of interest between commercial and investment banking. …

# Many Generic Drug Prices Moving in Lock-Step

Data from the Centers for Medicare & Medicaid Services’ (CMS) National Average Drug Acquisition Cost (NADAC) data set shows that many generic drugs across different manufacturers synchronized about the same time in 2015.

Costs for equivalent drugs from different manufacturers suddenly became the same price at the exact same time.

Many of the drugs in the NADAC set — which surveys drug prices at the time of purchase from pharmacies — then moved in lock-step up or down in price in the exact amount, down to the hundredth of a cent, multiple times.

The majority of these movements started occurring in September of 2015. Before then, prices from one competitor to the next could be differ substantially and price changes would happen at different times. …