How the FBI could review 650,000 emails in 8 days

Bob Arens
5 min readNov 7, 2016

--

This story is right up my alley — I do text processing stuff for a living, and have managed data for large scale annotation operations, so I can speak authoritatively from experience that this is absolutely doable, and not even with cool new technology that the FBI has access to.

Tangent: how could someone get that many emails?

There’s an idea flying around that no one could possibly have gotten that many emails in a 5-year span of working with someone, the rate of messaging is just too high.

Well, let’s remember a couple of things. First off, these emails came from three different accounts, it wasn’t just work emails. Second, we don’t know anything about those emails, up to and including whether or not spam was included. Third, Huma Abedin was considered Clinton’s “gatekeeper”, which means two things. First off, she’d be CC’ed on just about everything sent to Clinton; classified email makes up a really small percentage of what goes on at the State Department. Anyone who’s worked a government job, even at the highest levels, knows how much crap gets sent around. If you haven’t, ask someone who has — it’s ridiculous. Fourth, since Abedin was the gatekeeper, people would email her to get to Clinton, Abedin would email Clinton, and Clinton would respond, tripling the traffic for what would be a normal exchange for anyone else. Finally, a ton of what appears in those emails is forwards from Abedin to herself; apparently Clinton prefers to read off of paper, so Abedin would forward stuff to her Yahoo account to print off. That might be its own issue, but it’d be on Abedin’s head, not Clinton’s.

In any case, I think we should expect a Congressional inquiry into this stuff, at which point we’ll know more about these numbers. I find them credible, as I’ve personally worked with sets this large for single individuals.

Duplicates

The FBI said a bunch of these were duplicates. Snowden is right on when he says deduplication takes minutes to hours, tops. As I said, we know that a lot of these emails were forwards from Abedin to herself. Furthermore, we know that a whole lot of these emails went to Clinton as well as Abedin — gatekeeper, remember? This means they’ve already been reviewed, and don’t need to be reviewed again. The end product of this step is a vastly reduced document set, the “novel set” of emails.

Targeting

Next up, you need to target your search. Remember, we’re dealing with three email accounts here, only one of which was a “work” account. The FBI already knows who they’re looking for correspondence with and the kind of thing they’re looking for, as they would have had to have provided this info to the judge that gave them the search warrant. This is going to filter out all of Abedin’s personal correspondence, her FitBit updates, her Amazon purchases, etc., etc. The set gets much smaller there too.

Manual review

So now we’re down to manual review — the FBI had eight days to conduct a review of these emails, so they must have been in a panicked frenzy, right?

Not really. We know they’ve got an email review infrastructure in place, since they already did that with her emails, so there’s not a lot of administrative or infrastructural overhead. Now, let’s look at the math.

In grad school, I got a bunch of friends to help me annotate some emails from the Enron dataset with the promise of free pizza. They would read the email and classify it into one of a few categories; the FBI is probably just flagging them for followup or not. These people, twentysomethings who were just helping out a friend as opposed to trained FBI agents, got through about three emails a minute. Let’s set that as the lower bound for what we’d expect the FBI to get through.

When I was producing metrics for annotation projects, I assumed that an annotator would be annotating for 6 hours in an 8 hour workday. Let’s say that the FBI put ten people on this (reasonable, given that they were less than two weeks away from the election and Comey had already dropped his bomb), so we have 60 person-hours per day.

Now, let’s assume the FBI isn’t in a horrendous rush. Of the eight days the FBI had to get this done, let’s stipulate that fully two of those days are lost to bureaucracy, report writing, or whatever. Either way, we’re going to just casually throw away 25% of the time the FBI has been allocated.

The math

Ten people, working six hours a day for six days, at a rate of three emails per minute, will get through 64,800 emails. That’s 10% of the 650,000 emails reported to have been on that laptop. In my experience, the FBI would likely be looking at a far smaller set, but that’s largely immaterial. What I’ve shown here is that, even if fully 10% of the emails were new, the FBI would have been able to review them in a quite leisurely manner.

Now, Comey said his staff had been working “around the clock” to review these emails. Let’s look at the math again, presuming that there are ten agents available at any one time during all 24 hours in a day. Let’s continue to just throw away two of our days, leaving us with six instead of eight. This quadruples volume to 259,200 emails, or about 40% of the initial set of 650,000. Again, that percentage is incredibly high considering the filtering, but still, the FBI can do it.

But the first investigation took a year!

Of course it did. They were trying to build a case to bring criminal charges against somebody, the FBI does not half-ass these things. Whether or not the New York field office is “Trumpland”, our justice system is adversarial, and it’s the government’s job to build as strong a case as they can. This takes time. They had to look at all of Clinton’s relevant emails, the players involved, make up timelines, follow networks of communicators… it goes on and on.

When they got to Abedin’s emails, they already knew what the Clinton world looked like. They didn’t have to do any of that legwork — all they needed to do was flag new information and follow up on it. They didn’t need to worry about theories of evidence, they didn’t need to think about how to build a case, there was none of the overhead of the original investigation.

In short: totally doable in eight days.

--

--