What does bad data look like?

Bertil Hatt
30 min readAug 31, 2018

--

Earlier this week, PyData Manchester organised a panel on how to extract value from data. I was kindly invited to give my opinion.

The debate was lively, thanks to the Slack channel.

Every panelist agreed that the main problem they faced to do useful data science was “bad data”. The idea data could be bad took the audience aback. Most of them were had significant training in Machine learning, academic or and on Kaggle. They had less experience doing data science for clients. One participant asked for common examples of bad data.

I thought it was a very important question. Bad data comes in many shapes and forms; it’s hard to pin exactly. After almost 20 years of dealing with it, I’m starting to see patterns. The question deserved at least 40 minutes of answers. I couldn’t improvise and steal all that time from the organisers. This is why I started penning down a draft of this post.

Please don’t be indulgent and point out all my mistakes. The document has minimal rights reserved. Feel free to copy, steal, take out of context and deform my position as much as you like.

This presentation evolves from the most glaring issue: no data at all, to irrelevant data. Then, we talk about various formatting issues and miscommunications. We go into real errors from representation errors. Finally, we talk about fraudulent data.

No data

The most common type of bad data is no data at all. If you read a Master’s in machine learning, you expect data science to be about collinearity correction and neural network convergence. It is far more common to have someone asks for help when they don’t have any data at all. It’s not “bad” data per se, rather “no” data.

Truth is: the word is ambiguous. “Data” is a Latin word that means “things given, gifts” but nowhere does it say who gives what to whom. Most people working with but not analysts or data engineers receive reports. They might not realise the steps of the processes upstream. Imagine a baker who asks you for some wheat floor. I mean, they must get it somewhere…

That absence of data comes in many forms.

“I have a new idea: how can Data Science help from day 1?”

The most emblematic form of that is the self-described “co-founder” of future company. They expect that a data scientist will magically provide insights and guarantee success.

In most cases, a data scientist is not relevant until after series B. First, an innovator creates a concept, tests it. Then they raise a series A to test the market fit, their ability to get someone else to provide a service of equivalent quality, etc. Once the innovator has proven the fit, they will want to scale. Everyone will not sit in the same room anymore. To keep control, the entrepreneur will need data processes, audit, reporting. Once you identify possible automation, data science becomes relevant, but not really before.

One common exception is companies providing data science as a service. In this case, how you are going to get that data matters: clients, innovative collection methods, etc. There is data there whose quality one should assess.

Don’t involve data science too early.

There is an internal version of that request. A product owner has a great, rather left-field idea. They want to make sure it will work, receive support and scale. They come to the Data science team, hoping fairy dust. First, they need to prove the idea fits with examples. Then they need to turn it into a process. Data science will find a way to automate that process, once it is at scale and collects data.

Internal or external project: Data science comes third. First, try your idea. Then start scaling a little, learning about your users. Once some process becomes inefficient, then data science can help.

Server disconnected

Managers of an on-line business hate downtime. The sunk-cost fallacy, or the fear of missing out, make that conversation uncomfortable. Stakeholders will demand to understand what happened. They will want a solution. (Pro-tip: it’s technical debt. A non-technical manager can and should help address it.)

One very common question is to ask the data team if they can help. The conversation goes:
« The server went down last night at 21:48, and… »
Actually, I don’t know what’s the rest of that sentence because I stoped listening. I interrupt:
« Was any data logged during the incident?
— No, of course, the… »
That’s when I stand up, thank them for the very short meeting, shake their hand and walk toward the door.

I will admit that this is not the nicest reaction. It is, however, a good way and a relevant opportunity to clarify: without data, no data science. You could interpolate the activity before the down-time, but that’s speculative. Actual sales lost (that you can’t list) and inferences (about custom you could have lost or who could have come back later, who might have gone to your competitor) are not the same. That’s a key concept in statistics. It’s not always clear to everyone. You need to make a bold point when differentiating the two. I’ve learnt that the hard way.

Data but not related, relevant

People with an idea often come with unrelated data. They have a product that works well, say, hairdressing, and want to expand into something related, like offering hairdressing at home. Nowhere in their data is any trace of their customers’ address, their price elasticity for the convenience or correlation between neighbours and time of the week when they are at home. I’m sure it’s a great idea but there is nothing Data science can help with.

A common real example of that is churn prevention: people leave your service, you want to know what to do about it. My first question is: Do you ask them about the reasons why they are leaving? If not, well, start collecting that. If there is an obvious problem, or a common complaint, you don’t need data science to deal with it. Understanding several distinct issues and improving through successive solutions is a lot of effort. Go do that first. No data science so far.

After thousands of complaints, you have matched customers’ context (country, device, previous purchases, used a voucher) and attempts at different, possibly contradictory solutions—for instance: offer a rebate or offer a premium service. That should represent enough data to make an informed guess. Can an experienced human predict which option would work best before they ask the customer? Think bold stereotypes: “Look, she’s spending a lot, she clearly wants an upgrade” vs. “People from that country always go for free shipping.” If a human can, then data science might be able too. With a prediction, you can automate the process. That’s rarely before the process has grown tedious.
Once again: data science comes third. Test the market fit first; scale and collect data second; data science comes once you have enough.

The previous examples are common situations before working for a client or an employer. They are glaring enough that you notice and postpone. But, if a prospect shows you metrics, reports, it looks like they have enough data. As a data scientist, the offer to work with them is tempting. You might decide to check the data later, once you have full access. There are more big surprises there, including more data missing.

Large chunks missing, no one knows why

A common occurrence is large missing sections of data. You can often recognise patterns to explain why but with data missing, that’s hard by construction. A first analyst or data engineer joining will see a lot of that, and you often can tell years later when they joined. Improvements to the code-release practices are also very obvious for analysts looking at archives: fewer chunks missing now and then.
In high-growth companies, improvements can be weeks ago. Working on short datasets can be a problem, especially if you want to predict seasonality.

All the previous examples are surprising but easy to resolve. Data isn’t there, so you have to look to replace it or find a proxy. If you can’t, you are out of luck—but it’s obvious.
Thurth is: you will spend most of your time with data but, more often than not, written wrong. Let’s explore bad data when you have data.

Date and time

Anyone who has manipulated actual data hates dates. Problems with formatting are common but not the only problem. If you want to make friends with data scientists fast, throw a joke about frustrating timestamps are. You’ll get instant sympathy and probably a couple of veteran stories.

Formatting

Should you put the day of the month before or after the month number? To non-analyst, who cares?
That specific convention is a common and infuriating way to lose an hour. Americans (who are wrong about every metric and standard) use MM-DD-YYYY. Most other countries prefer DD/MM/YYYY. The trick to tell is generally to look for a January 13th. There is more to it than the inversion: the separator ‘/’ vs. ‘-’ can also be different. Check for a trailing 0 for the first nine months or for the first day for the month. If you can’t find one, you are using D/M/YYYY which is different. It won’t break data but it will make alphabetic order wrong.
Some will argue that MMM, Mmm or MMMM i.e. using the month name, abbreviated to three letters, or not, is more human legible. That’s true but not relevant because you will always prefer the option to leverage the alphabetic order. It will also fail to parse localised data (from Non-English speaking countries). 02-Fev-2016 and 15-Okt-2013 are generally fine, but 27-Gui-2012 and 09-Lui-2001 might not. Make also sure you agree on two-character or four-character for the years. You don’t want to attribute your user growth to early Roman emperors.

You can also add the time of day to that. Most people agree to use semi-colons to separate hours from minutes and seconds. Some use 24 hours and others specify AM/PM.

In summary: Every single character in a date can and will hurt, even those not on the screen. Anyone sensible knows that the only proper way to format a date and time (a timestamp) is YYYY-MM-DD hh:mm:ss with a 24h standard. If they say otherwise, they are wrong and they should not be your friend.

For most of the strangeness there, you have Excel to blame. The spreadsheet converts any date-looking text that it sees without warning. That includes fractions (1/12), differences(3–4), intervals (12:35). It applies a local standard (US, UK, etc.) without asking for clarification. It doesn’t let their user correct or amend any conversion. When using in Excel, it is transparent thank to editable formats. Those formats don’t apply to the source data stored as a .csv. This leads to massive headaches.
Friends don’t let friends handle data processed through Excel.

The problem is not that data formats are odd. You can correct for that; most tools do so automatically. The biggest pain point is inconsistency. One service will send data in one way; another in another. Both send information to the same flow, into the same table. The inconsistent hodgepodge might become inseparable.

One surprising problem is microseconds. Most computers like to count and include those. It can be helpful when dealing with very fast (non-human-triggered) events. But date-handling software often doesn’t expect them. They can ignore the information. Some assume that the period in 2018–02–24 12:35:45.783 is a separator and treat the trailing digits as another column. A friend uses that problem in their interview process, to test people’s experience in handling data. They give candidates a dataset where some but not all timestamps include microseconds. The interviewer does not underline the problem. The goal is to count the number of events of a certain type per day. It sounds trivial. Yet, people can struggle for an hour with odd error messages about the number of columns and finding text in an integer field if they don’t expect the inconsistency.

There are more nuances. A common surprise is that some software converts dates to timestamps (date and time, typically midnight). They can do so assuming time zone, without specifying that they did. This isn’t exotic but it can take aback even experienced data wranglers who assumed that two events on the same date, even in different locations, would have the same date… The main source of pain there, and in general, is actually a treasure for the geography geeks: time zones.

Time zones and daylight saving time

Things happen at a time and date, somewhere. Time and date are not the same around the world. I was born in Tahiti, which explains my interest in oddities like the day-changing line. (It is a real thing; it has the wonkiest shape; it doesn’t move; it does allow for time-travel.) Any serious developer will record everything using the Coordinated Universal Time (UTC). If you want to know if it was in the morning or at night, you have to convert that to local time. For any behaviour-based analysis, you want to make the conversion; in particular, if you want to assign any human action before 4 am local time to the day prior.

25 countries have more than one timezones, often in overseas territories. Because of that, timezone can’t be available by country. They extend to regions but they are referenced by the closest city. To tell the timezone of any given address, you have to find the nearest metropolis (large enough to have their own timezone referenced). Localisation problems complicate that further: a computer tends to think that “München, “Muenchen” and “Munich” are three different words. International standards reference the city English name; the dataset might be using the (or one of the) local language.

It gets even more confusing with daylight saving time! One hour of the year doesn’t exist in local time. One hour exists twice, consecutively. All time intervals across the change are off too: how many hours between Saturday, April 19th, noon and Sunday 20th, noon? 23 hours. If you want to compare Day-to-day activities, that kind of kerfuffle gets annoying really fast.

A final blow to sanity: formatting. Do your remember YYY-MM-DD hh:mm:ss? There isn’t a consistent standard for format a time zone into a timestamp. You can use a three-letter code or one of the 25 letters of the alphabet (pronounced with the Nato alphabet) or the name of a city. My personal peeve? “BST” could mean either British Standard Time or British Summer Time — and those are the same half the year, therefore the error drops on you after months of summery, careless oversight, during which you’ve been copying your code everywhere.

Future, Jan 1st 1970, time travellers

Let’s assume that you have a dataset with events and their dates properly formatted. You can count per dates and find the number of a given event per day. All seems good—until you realise that a lot of things happened… ten years in the future? Or on January 1st 1970, which is surprising if you work for a company started in early 2004. Both distant future and ancient past are a common way to deal with events that didn’t happen or without a start: employees whose last day is in 2128? There are likely still working for the company. Some events don’t have a date: say, the end of an unfinished process, creation event for system users. If the schema won’t let you put an empty value there, you have to find a replacement. Or the data could be corrupt, that happens too.

You might also encounter time travellers. Some users received a parcel days before they ordered it. Some installed your game after they have achieved Level 5. Some left your service before they created an account. Every company that I worked for had at least one instance of those. There was a good reason every time (like a de-activated, then reactivated, account). But expect a lot of surprises when averaging duration of intervals that might include absurdly large negative duration.

Time of day, no dates

The worst that I have seen: the database had the time of day of an event without the date. It sounds simple but is the wrong approach. If a time happened on a date, store it as a timestamp, no matter what. Hours without date are relevant to represent opening hours, so you can’t avoid it entirely, but minimise that case. If your service is open across midnight, that convention can get hairy.

In a case that I worked on (and over which I grew white hair), the sessions started at a given time and day (recorded in two separate columns); the developer was skimp and wanted to save space, so he recorded only the time of the end of the event. He didn’t want the end of the session to happen after midnight UTC. To enforce that, the server sent a command at 23:59 UTC every night to log-off remaining sessions and just after, log back in everybody a minute later, invisibly. Session were neatly contained on the same UTC date.

When the server handling that got overwhelmed, those messages of log-off/log-on got queued for a couple of minutes, up to five. When that started, a few sessions had negative durations: enough to lower the overall total, but nothing heart-stopping. Soon it would get more inconsistent, and we discovered the problem far too late. And when I say “every night”: did I mention that Australians hated that feature? They would get 10 minutes of downtime around their lunch break, which wasn’t ideal. New developers scrapped the process. The skimpiness of the original developer remains as an inspiring company warning tale.

Numbers

A number is the most expected data type in a data set. Yet, I haven’t flagged any problem with numbers. That’s because they are less often bad. Rest assured: they are far from perfect.

As a string

The most common problem is number fields interpreted as character strings.

Datasets are often stored as a file, generally Comma-separated values (.csv). When copying and loading, numbers can be interpreted as a line of text rather than numeric values, a.k.a. as a ‘string of characters’, or strings. That’s easy to notice and fix, and generally not the fault of the source.

Categories

One common reason for numbers-as-strings is if they are not quantities but categories. One example is the number of doors in a car. The hatchback counts as an extra door, written “4+1”. Other types of door count as half, hence “2½” with a strange, non-numeric “½” oddity. The result is best dealt as categories. Another example is a poll about your number of dependents. To contain answers, one option is often to have “at least” a larger number of children. The information reads “5+”. If you look and credit rating, your assets could be negative, written: “< $0”. Those are separate groups: categories, not quantities.
They can be tricky because often the non-numeric cases are rare and you don’t see them on the first 10 lines. That’s one of many reasons why full-table description tools like Python’s pandas.DataFable.describe() is more helpful than opening the file.

Currencies

Quantities, if they are monetary amounts can include a currency code. Those also come with formatting lack of standards. The currency can be before or after the amount. Some write large numbers with multiples (k for 1,000; M for 1,000,000; G for 1,000,000,000). Currencies can before or after the number, with or without space. Currencies can even replace the decimal comma: 1$50, or might include subdivision: 1$50¢, or more often “¢70” or even inconsistencies like “¢.70” where context help decide if this a fraction of a cent, or a fraction of a dollar.
Currencies can be shared with their full name, partial name, a single typographic symbol (£) or a three-letter code (GBP). “$” can be ambiguous where they are 20 different international dollars. Once again, Excel is often the cause there. I’ve even seen someone decide to represent negative numbers in red. The colour or sign wasn’t stored on the .csv format of the file they shared.

Good practice dictates that the currency is stored in a separate column, with a consistent code, and the amount stored as a numeric value. Experienced developers prefer financial amount stored as an integer number of cents. This enforces the proper rounding rules. Of course, that leads to analysts overlooking that and reporting numbers inflated by 100.

I once had a dataset with the country of the operation but not the currency of a transaction. You would assume they were related, and they often are, but modern banking allows to decorrelate the two. Indonesian Rupiah and Japanese Yen are the two lowest denomination currencies. Amounts in either country should be very large to reflect that. This also makes it easier to check the foreign-exchange rates are not flipped. In that set, some transactions in Tokyo had a low amount, low enough that would make a lot more sense if it were dollars. One had a client name was John, and he paid with an US-based AmEx, I suspected we had inconsistencies. Other transactions were obviously in Yen, but many were in between. I reached out to the developers, asked for clarification. They rapidly agreed to postpone my project and include the currency in the data pipeline, rather than pull price list and six-month-old foreign exchange rates.

Commas and periods

Some countries conventionally use a comma to separate thousands, and a period to separate decimals. Others use a comma to separate decimals, and a period to separate thousands. Some prefer a unbreakable space for thousands. I don’t think it’s possible to parse a dataset using both conventions. Not every form takes that into account and gives consistent output from local formatting.

In the same vein, the CSV (comma-separated values) file format is incompatible with using commas as numeric separators (for decimals or thousands). At this point, it should feel obvious. Painfully obvious issues like this do happen, though, often. If you see it, ask to get the data as tabulation-separated. Some people use tabulations in forms but it’s a lot harder to type in a web form and often carries less meaning than 1. vs. 1,000. dollars.

Partial counts

One last surprise happens with counting things that are whole (customers, orders, sessions, etc.): those quantities might have decimals. So you end up asking about non-integer events and customers painfully cut in thirds. What actually happens is that, if users can cancel, one can estimate the likelihood of any cancellation happening; same for agents with a predictable unscheduled absence rate. Analytics reports and planned capacity demand using that speculation as a weight on individual future orders, attendance, etc.

Please note that so far, nothing that I mentioned are real data problems. Data sources that don’t follow conventions, or don’t log enough is inconvenient. Yet, if you pay attention, you won’t get to wrong results with it. Most “bad data” is rarely wrong in complicated way. It’s more often than someone was careless, and you must check and edit.

Categories

The real problems and costly mistakes start with categorical information.

Unclear grouping, refactors

Those sound easy but they represent the biggest frustration when exploiting datasets.

Possible categories are often not listed independently from the fact tables. You can aggregate your dataset to find the categorical options mentioned. Three types of problems stem from there:

  • The categorisation has changed while the data was being collected. You end up with events in Czechoslovakia, Slovakia, Czech Republic, and Czechia to sort. Some are homomorphic (name change, same entity); not always.
  • The categorisation is incomplete. Rare events might not have happened in your sample. You risk discovering new categories as you go. It’s not a problem for reporting. For modelling, you end up with a cold-start problem. To solve that, you end up with extensive rules about whether to consider the Isle of White, Gibraltar and the Isle of Man as part of either UK or GB.
  • There are too many categories to be usable. You need to regroup them into larger, sensical groups. For business trend forecasts, sales to San Marino, Andorra and Monte-Carlo go with the Italian, Spanish and French totals. If you care about taxes, not so.

So far, I’ve used geographic examples. Every business has its own entities, with types and exceptions. A car-related business has car types small and big, but also lorries, heavy-capacity vehicles, probably motorbikes, but possibly quads, forklifts, boats. Which one goes into “Other”? Food (in restaurants, supermarket) comes in many shapes. Is alcohol a food, or a drink? Is wine that needs to refrigerate) “fresh produce”? For the logistician, yes essentially; for marketing, no. If you compare content type on social media: Are .gifs an image, a video or an embedded link?

List of countries

Geographic entities are a particularly fun type of categories: cities, neighbourhoods, regions . Do you know the difference between a country, a territory and a sovereign, or a polity? You will need to know enough to classify activity. That includes, if you work for a web-based business, IPs (Internet protocol) addresses from the Åland islands, Antartica, Zero island or the Space station. For more business-relevant areas, check how the British Isles are different from the United Kingdom or England.

Every territory comes oddities, edge cases like Faroe, Bavaria, Zanzibar, Taiwan, Tobago, Cook Islands, Curaçao, Sardinia, New Caledonia, Hong Kong, Palestine, Guam, Mauritius, Palau, Christmas Islands or Antartica. Each of those has unique tax, political, geographic quirk.

Localised or user-typed information about cities or regions has a variety in spelling. Remember to merge Munich, Muenchen and München. Note that South Sudan is not the same as the South of Sudan (anymore, it used to be). It can get trickier to need to get the time zone in Сиби́рь. (There are several.) Are 深圳 and 盐田 the same thing? (Essentially, yes.) What is the country of नई दिल्ली? Even that can change. Did you know that Kazakstan changed its official alphabet from Cyrillic to to Latin?

Classifying traffic into types: “mobile” vs. browser, table (screen/window size, etc.)

The most difficult categories to handle are not placid geography, but device types. Technological advances has given browsers to cellphones and tablets. Smartphones now have apps that themselves have browsers. Phablets confuses all that. Each device and screen size can be emulated. Most services are available on an app, an different app that is actually a webpage. It’s also available on an actual webpage with properties that change with its dimension. How do you simplify all that to three types of interface? On a given dataset, is it consistent that the only client types are “mobile”, “browser” or “GoogleAgent”? Which one are clients using the app on a large-screen tablet? Technology will change and all will be proven inaccurate soon: I can already plug my phone to a portable screen with more pixels than my high-end laptop.

Assigning a device to any web traffic is hard, and you are better off expecting surprises there. Another exercise: imagine that each session has a device category (mobile vs. desktop). You want to assign each user to the same category. What do you do with people who use both? I’d recommend having three groups: mobile, desktop and both. “Both” being customers having used neither type more than twice more than the other type for the last 28 days. Or whatever definition that doesn’t mean users change categories too often. Every detail about this is up to debate; it could change, and if it does, you could face large inconsistencies.

Tracking users is getting harder because users are unhappy with it. There is too much tracking and not enough visibility on how it benefits users. They delete cookies, spoof user-agents, send deep links between their devices. None of that is likely to stop anytime. Expect more messiness and innovation in inferring who customers are, merging seemingly different profiles, “soft” logins. Those can lead to changes halfway through a dataset, often silent. For the sake of your sanity, document those changes.

Exceptions: -1, “Null”, Not-a-number

A very common way of expressing an inconsistent value is to use an absurd number. Imagine a dataset about tax reporting. One question asks your “number of dependents (including yourself)”. How many children does a judicial person have? What is the date of your death? The mailing address of a Navy ship? Some people decide to assign absurd values to those non-sensical questions, like “-1”, “01–01–1970” or “00000”. Those absurd values can contaminate your work if you are not careful and sanitise your data. You can average -1 with another integer number of dependents. You can calculate the (negative) age of someone. There are institutions with that postcode, and a naive join will find those.

Sometimes a price reads “13£ * nb of items”. This is not exactly false, but not a number that you can add without processing the text first.

Even more sneaky: some software, given no value, will force one, be it an empty string “”, or “NaN” meaning Not-a-number. In context, those are meaningful missing values. Taken out of that context, you will misinterpret the same values. When you extract, convert, adapt a dataset carelessly, things can fail. I’ve worked on a pipeline that passed from PostgreSQL through Python through another type of SQL. The initial result included both “NULL” and “NaN” as four- and three-character strings. Not what I expected, and it took us a while to notice.

Filtering those exceptions is important. Noticing them, i.e. knowing that they can come about, is essential.

Unstructured

Most of the data that you want to get is structured. It fits into columns, each column has a clear type, a legible name. Columns connect between tables with consistent IDs. That’s not always possible.

Event type | Event object | Event detail

Teams who work on different aspects might handle very different interaction types. They still need to write to the same event table. They need to agree on the columns on that table. A classic solution is a generic event structure:

  • Each event has clear features: an author, a service, a timestamp. Those are well-defined, structured columns.
  • They also have a more generic event type. Those need a generic typology with a couple more details, like a status. A common solution is a pseudo-hierarchy like ‘event type|event object|event detail’.

The later hardly constraints the pre-existing lack of coordination. Because of that, teams will have a widely different interpretation of what should go in the last columns. You can end up with inconsistent typologies; new types might appear that are missed because the intermediary classification has a catch-all clause, etc.
For instance, the team handling financial transaction has a clear provider, payment types, success or error types, and fit that constrained structure quite well. The team handling customer contact or complaints want to store the text of that message, maybe the type of call, or the notes by the agent, even probably the whole conversation with when each message was sent, the number of interactions. That will push the limits or the “event detail” column. Drama ensues. The team can’t decide between shoving details in that last column or use a separate table with a proper structure. One thing they will agree on: not telling the analyst who joins three months later.

JSon blobs & free text

More often than not, long conversations should be stored not as strings with limited length, but a more flexible format, often JSON. JSon is semi-structured, which means you can assign meaning, with quotes, dashes, and turn a text into arrays of values. This is perfect for situations where we can’t expect how long or how much detail will be attached to an object, like a conversation. If you want to implement in-game exchanges in a conversation, JSon let you simply add that information to each message. The flexibility is very tempting, very convenient; however processing that dataset after the fact, and after many changes in how things are implemented (because nothing constraints developers to respect a previous convention) can be a horrible experience.
This is not exactly bad data, as in “wrong”, but it can be tedious.

Free text, images and video files

Even more tedious are data stored in an unusable format. The classic examples are information stored as an image of a text. You would need to do a lot of optical character recognition (OCR) and parsing (like removing the column names at the top of every page, matching pages, etc.). You probably will need entity-recognition if the information is in full sentences. The “information” is there but it is very hard to explore, impossible to search without spectacular effort. A lot of non-profit and charity data-related work actually does that: transform documents into databases. Once information is structure, the rest of the work is far easier.

I can give two examples of a similar situation in my personal life.

Some years ago a black cab was very offended that I was riding my bike in London, near St-Pancras; the driver told me so loudly, demanding that I “go **** [myself]”, something for which, honestly, I don’t have the flexibility. Exasperated, he decided to take the matter in his own hand and run me over: turn his wheel in my direction and accelerate to make sure I couldn’t ride a bike ever, anywhere. My bike didn’t like it. Thankfully for me, it was one of those bikes you can rent by the minute, so I only had to push it to the nearest holding gate, and apologise to their helpline for the damage. (Without a working knee, it was still rough.) The person on the phone was not very surprised, but empathic and suggested that I call the Metropolitan Police to let them know about an attempted murderer on the loose. It was clear from the lack of reaction that they didn’t take note of the damage to track it further. They didn’t need proof that I was innocent, but what about the Police: surely, they can’t send someone in jail for ten years on my word alone?

London is known for its extensive security camera network. There were half a dozen pointed at the intersection where I almost lost my limbs. I had the exact minute, the exact location, several options: I called the Police, confident that I would take a dangerous murdering fanatic off the streets.
If you have the displeasure of having someone try to kill you or anything of the sort, I can save you 10 minutes of your life: the information is probably there, but “not really”, at least, that was the information that I was given on the phone, and later in an email. Whatever extensive surveillance system they have, the Met was unable to find the feed from a given camera at a given time. I can’t say that I encourage you to try to go on a rampage, but I have no argument to convince you that you shouldn’t because you’ll be caught: that criminal is still roaming the streets. Recent statistics on road fatalities seem to have noticed.

To summarise: even if privacy advocates are very concerned over the surveillance state, they shouldn’t: video surveillance over London is or rather was at the time, bad data.
What would it take to make it good? It really depends on your use: indexing it by time and date is an obvious one. Mapping where the cameras are, having a network of which camera sees at the left, right of each other camera would be an interesting geometric problem to solve. Interpreting motions, situations, detecting problems might help index further; recognising faces would get us closer to an Orwellian situation—but would be good, in the sense of usable, data.

Another, more recent personal example: I recently started to rent a flat. The realtor demanded that I provide three months of bank statements, not just pay slips, but the entirety of my spendings, detailed, over 90 days. That sounded shocking, but I don’t like sleeping rough so I obliged.
My bank has a very convenient interface: I can classify all my spending by merchant, type, currency, make detail statistics, they even provide a forecast! When I saw my “statement” a document listing the same information, I doubted that my realtor would enjoy all those features. For every transaction, the statement had a date, an amount and a description often truncated to 16 characters. For most of those, they didn’t make sense to me: it was the first letter of the official name of the franchisee operating my supermarket; the reference of my letting contract for the realtor fee; the name of the company operating the train ticket app that I use; etc. A lot of it was confidential in a way, but hardly any of it seemed usable.
My bank makes a lot of effort to associate every new transaction description to what it means for us, customers: supermarket are given their brand-name; bar tabs are tagged as ‘going-out’; electricity bill as ‘utilities’. The data on my raw statement isn’t usable. Like surveillance camera footage, it isn’t bad. The data that my bank collected in order to convert my statement into a usable, legible dataset, that is distinct, external third party information, and valuable.

We have two lessons here:

  • data isn’t good on itself: what you consider better depends on your intention with it; verifying claims requires little extra information; detection anti-social behaviour without citizen prompts, more so;
  • good data is often also based on external, relevant sources: weather, holidays, competitive information, economic maps, categorisation, help make sense of raw data.

To summarise so far, none of the above is exactly “wrong” data: they could be useful as there are. However, all those formats and partial information can be painful to work with and might hide bigger problems.

All those errors, missing data or bad format, are rather obvious on the screen when you see the right line. Sometimes, the issue is a bit more discreet and you need more context to catch it. There is bad data that you don’t catch on your first day of looking at it.

Lies

One possibility is that someone is misrepresenting who they are on the internet. It’s often what people expect to see when they imagine bad data, but it is not as overwhelming as formatting issues. Fraud has direct economic consequences; therefore service operators are actively detecting and enforcing against it; proper formatting is less beneficial in itself. It doesn’t blocking analytical exploration either, because you can analyse fraud.

Robot traffic and spam

The largest source of data for any service is traffic data. The largest part of that traffic is not commercially relevant: spam traffic, crawlers (often from competitors), DDoS attacks, bots trying to guess password repeatedly, etc.
Filtering those out is a whole data science project in itself. You should see spam-detection as part of the data-processing rather than failure from your provider — but good data often is spam-filtered. Working on non-filtered data and not expecting noise is prone to terrible interpretations.

Fraud, false transactions, fake accounts

Obviously non-human traffic is problematic, but the more expensive issues come from humans trying to cheat: they can use someone else’s credit card or identity; get deliveries at an address that isn’t theirs to avoid detection; they can spoof their location; create fake accounts to evade bans or inflate their number of friends; lie about their name, age, gender; upload profile photos that are not them; tell a sad story that isn’t true; etc.
Those efforts can be more or less transparent, visible, industrialised — and detecting them is important, unless you want to build models that conclude that the most reliable sign that a customer is very valuable is that they use several dozen credit cards each with a different name, or that the safest way to get someone to re-order is to not send them their parcel.

You typically find about those in two ways:

  • you understand the service, and how its incentives encourage misbehaviour; how to go around expected use should become obvious to people with the right mindset, and you can check if anyone does that;
  • you build a report, or a model, and notice absurd results: the best customers cancelling most of their orders, or ordering delivery to consistently different address; partners who disappear after for minutes, processing revenue larger than the GDP of their home country; etc.

Because there is no limit to human creativity, there is no limit to how nuanced those can be, but most are rather obvious to notice and to detect once you know what they are.

Inconsistent structure

The most time you spend dealing with missing data isn’t when data is missing, or hard to parse, or even when someone intentionally misled, but when it looks mostly consistent, but something subtle and undocumented obviously got wrong.

Wrong ID, wrong match, different namespace

There can be a mismatch between what people thought were equivalent concepts, say, a customer for an account; an order and a delivery for e-commerce; or a session and a search result. Both concepts have ids, and there might be similar but not equal if the service let users order without creating an account; have a failed order re-delivered; search several times in the same session.
If you assumed that the two were the same, you end up with very confusing results:

  • either locally absurd results: e.g. an Argentinian customer ordering from South Africa and paying in Rubles (but many times over); deliveries without a matching customer anywhere in the table; agents with negative work hours; or
  • overall totals that do not add up: cancelled orders that represent 20% more than what the reimbursements in the books say; more successful orders delivered than customers who received one; VAT higher than your revenue; etc.

That’s where you want to ask the product engineering for a detailed schema and grab a large pot of coffee. You might earn senior engineers temporary hatred and possibly their admiration for finding some very tough bugs.

No one knows

The real problem with a lot of those problems is when you don’t have someone able to walk you through how the information is matched in the first place. Not having extensive documentation on any dataset is a problem too.

This is why I typically consider that data set is bad unless it comes with a Story. A story is a two-to-ten page document (depending on the complexity of your service, roughly the age of the company in years) explaining what people do. An essential part of that document are keywords: the entities that are structural parts of your service schema, the words that you use daily, team names; also extremely instructive are those subtle distinctions: what is a basket, an order, a delivery; what does inactive mean; how do you classify partners; what are key ratios between each of those: average order per month per customer, 80/20 distributions of revenue, etc.

I would strongly recommend, when given a dataset, to start writing that story and expect to flag data errors from not being able to write it consistently. As a data scientist, you should easily be able to write that story from the dataset and explanations; your clients might not have the analytical skillset, but they can be feedback through the story that you are drafting and its holes.

My friend Ian Ozsvald recommends writing that document in a Jupyter notebook. I would concur: a notebook lets you easily mix formatted text with queries, distributions. If not everyone in your organisation knows how to read Python, SQL or R, I would recommend that you copy and export that to a larger audience: you want feedback, notably on organisational changes far from your day-to-day interest. I actively try to stay away from anyone who understands VAT because they are nerdier than me and I don’t know how to cope with that, but I have to admit: they have a lot of insights into how the company reclaims it, and whether it affects the numbers that I’m looking at.

Those can actually pass through a lot of common sniff test: data tables with those problems can load without missing values, they don’t have text where numbers should be, the first lines on your screen make sense — but further down the line, you might see odd things.
How do you make sure not to miss those?

Auditing and monitoring

The best way to detect those is to have extensive detection systems. There are two types of controls that you want to set up, as early as you have the data:

  • auditing makes sure that you understand the data structure and its constraint; it’s typically obvious checks, a bit like unit test in software development, or financial audit in accounting: the sum of revenue from each user is the same as per country and overall; users active is the same as users joined, minus users turned inactive; all that sound simplistic and ridiculous, but you will learn a lot about the actual schema of your service, and other business rules: when to apply foreign exchange rates, how to separate by dates, whether a user belongs to a country exclusively, etc.
  • monitoring: another very helpful way to understand a service is to make forecast of activity levels on every relevant metric and detect outlying events; first in the past, which should help you understand what happened, and ask questions about outages; then, once monitoring is automated, you can learn about how things interact: if a major provider in a country fails, that country’s activity will also fail.

Both are continuous process: you can’t really evaluate the quality of a dataset fully on the first day. A lot of the more obvious issues (those higher up in this long list, are easier to notice) but you won’t find that the conversion rate between the Danish Crown and the Euro (two notoriously close currencies) is out of whack until it meaningfully changes, and it rarely does.

Conclusion

There is a joke that I like to tell about data quality. A friend of mine was once asked if he spoke Arabic, to which he smugly responded: “I don’t know: I never tried.” Truth is: he was half-serious. But no, he actually didn’t.

Data quality is the same: if you are not sure your data is great, I confidently tell you that it’s not. All data is bad, and that’s ok. At least, until partially proven otherwise. Proving that, understanding the data is not the most glamorous or rewarding part of the job, but it is the most instructive, so be happy that you get to do it, every time you get the chance.

--

--