Part II: The math doesn’t make sense either

8 min readApr 23, 2018

In my previous post I addressed why humans make prediction hard — especially in highly irrational situations. In this part I will dive into why the math driving these predictions is broken.

Before we go any further on the math/computer science piece we need to agree on a few simple rules:

I won’t point to ‘algorithms’ to make my points. You shouldn’t either. Anyone who does is full of shit. It’s an unfair attempt to scare you into submission. It’s an appeal to authority. But most importantly it’s gross.

2. Basic statistics can be easy and approachable! The law of large numbers is imperfect but nonetheless a useful concept. The basic idea is simple — the larger the sample size the more likely it is to be representative of the behaviour in the universe. Another way to think of it is if you observe every potential scenario (infinite observations) your model knows exactly how each scenario will play out!

This is a lot of books. It is not anywhere near an infinite amount of information. It is a really small number of books.

3. You didn’t look at all the data. Law is a game of incomplete information. And it is dramatically more complex than no-limit poker. “But we can beat humans at no-limit poker by a wide margin” shout the predictors! I guess the game of law has a complete rule set and only 52 cards as well. I guess we don’t need lawyers anymore.

If you accept these three rules read on!

Definitions

We need to agree on a few terms before we go further (hey I was a lawyer at some point in the past).

Expert system — a model constructed using relevant features specified by an expert. These systems are used with little data.

Features — the things you specify in your model as points of data that assist in making a prediction ie. number of people you killed is a feature you would use to predict length of sentence in a murder trial

Little data — less than lots of data.

Lots of data — millions and million of examples for a model to learn from.

Unsupervised learning — an approach in machine learning where you provide a complex input and a simplified output (ie. win/loss) and allow math to sort out the middle piece (the features). This approach needs lots of data.

Math isn’t hard.

When you set out to make a prediction there are a lot of reasons you can fail before you even begin. There are many factors which can limit how meaningful your prediction will be. Im going to focus on three:
- Data quantity
- Representativeness of the data
- End prediction objective

There is also a theme that I will attempt to ham fistedly weave throughout this discussion which is the common fallacy that past events are predictive of future events (or put another way that the rules of the system remain constant).

Data quantity

You need lots of information to make predictions about complex events. How much data you may ask. There are rules of thumb that you can live with. For a good expert system with five features in a little data world you probably need a few hundred examples to start to make accurate predictions. What is an example? It is one case that confronts the same issue. You need a few hundred as a starting point. The few hundred should be representative (see below).

For each new issue you need a few hundred. If the law changes (for example the Supreme Court changes the law or there is a regulatory change) just turf out all the previous examples and start over.

The place old examples go when they are no longer relevant.

The volume of information that meets this criteria is not nearly large enough. Even in a high volume area like family law in Ontario there is at best ~6,000 cases. In employment law in Ontario there are ~18,000. There are only ~1,200 that cite Bardal the leading case on calculating termination pay since 1960. It is extremely likely there are a bunch more cases (double or triple?) that cite cases that cite Bardal. It’s still just not enough. Now for some simple math!!

First our assumptions:

An employee can work from 0–40 years (41 total configurations)
An employee can be aged 18–65 (47 total configurations)
Their job is one of four types: high demand, med-high demand, med-low demand, low demand (4 total configurations)
They live in a Canadian province or territory (13 total configurations)*

*note for the purposes of this thought experiment every province and territory is treated as unique but equal. I have ignored that we have two official languages as it makes the problem harder to solve. I have also ignored the civil law tradition of the province of Quebec intentionally because that’s even more complexity.

In order to have one sample for each possible configuration I would need a minimum of 100,204 cases (the math is 41x47x4x13). That’s a four feature model with limited complexity that ignores many relevant features. Can I get away without an example for each configuration? SURE! But the confidence in my predictions will necessarily suffer. Is a model based on 20% of the total possible configurations really predictive of anything? Is a single data point per example proof that the trend will continue?

Shrewd statisticians will recognize that there is one logical jump in my above calculation that unfairly inflates the possible configurations — the age/time worked problem. In reality it is impossible for an 18 year old to have 20 years of service at a company. To head that objection off with really crude math lets halve 41 and 47. We’re now at a much more attainable 23,920 examples required. Except there still aren’t that many examples.

To even get close to enough examples we have to decide that the law never changes. Which isn’t true. Past events are not predictive of future results. Don’t believe me? Ask patent unreasonableness. Or Lehman Brothers. Or literally every mutual fund.

Listen. Im done arguing with you. There just isn’t enough data. But in case that isn’t reason enough lets move along!

Representativeness of the data

The data is not representative of the real world. Full stop.

How am I so confident about this? Well for starters the number of long term unemployed people in Ontario was 64,100. To be unemployed would seem to imply they were employed previously. At least 64,100 people lost their job in the last 18 weeks.

There are 2.8889 18 week periods per year.

There is the potential for 185,000 people to lose their job in Ontario annually. This seems really low to me. In the last year there were 1,083 employment cases in Ontario referencing the Employment Standards Act. The cases represent about 0.5% of what amounts to a conservative estimate of employment churn. There is a simple message here:

The cases that end up in litigation are the anomaly.

This is a cow with horns. I am unwilling to predict that this is a bull.

End prediction objective

Finally it is really important to have perspective about what you are trying to predict. For example if your goal is to tell someone whether or not they were terminated you’re going to do a really good job! Of course! They’re on your god damn site!

100% accuracy in predicting future treats. Also an expert in reinforcement learning. Some dog somewhere.

Are you trying to tell someone the nearest case matches to their scenario? We did this and we used the k-nearest neighbour approach. It works ok because you aren’t trying to predict anything. You just measure how similar the person is based on a set of expert defined features. It isn’t predictive of shit. At best it is a helpful research tool.

If your goal is to tell someone how their situation will resolve you’re way off base. It’s just not possible. You can’t account for crazy. If the other side is absolutely nuts no amount of ‘algorithms’ or ‘all the data’ will tell you what will happen. To pretend otherwise makes you the crazy one.

For anyone reading this who disagrees I would love to be wrong. The legal system is messed up. It serves far too few people and consumes far too many resources. The average person can’t access legal services.

I still think it is dishonest to pretend that the anomalies are predictive of anything. And if prediction in the legal system is the end goal then let’s just move to decision making algorithms and get rid of those pesky humans. But lets acknowledge that algorithmic prediction and decision making necessarily steamrolls nuance*.

*The best example of poor prediction I experienced in my brief legal career was when two parties (one self represented I might add) were fighting over a measly Canada Pension Plan pension. The law was on the side of the Department of Justice, emotion was on the side of the self represented pensioner. The judge was acutely aware of the problems a decision would cause for the Canadian government but was also sympathetic to the pensioner.
The judges solution to the problem was to storm into the room and stare down the centre of the room and loudly proclaim to no one in particular “before you sit down I want you to know that if I have to come in here and make a decision you aren’t going to like it. I would suggest you try to settle this matter. I will return in ten minutes and I hope to hear that it is settled”. Then he stormed out slamming the door.
When he returned ten minutes later the Department of Justice lawyer had settled with the pensioner who had a smile on his face. I am certain the DoJ lawyer thought that the judge was speaking to him. He wasn’t.
You can’t predict this shit.