What does bad data look like?
Earlier this week, PyData Manchester organised a panel on how to extract value from data. I was kindly invited to give my opinion.
The debate was lively, thanks to the Slack channel.
Every panelist agreed that the main problem they faced to do useful data science was “bad data”. The idea data could be bad took the audience aback. Most of them were had significant training in Machine learning, academic or and on Kaggle. They had less experience doing data science for clients. One participant asked for common examples of bad data.
I thought it was a very important question. Bad data comes in many shapes and forms; it’s hard pin exactly. After almost 20 years of dealing with it, I’m starting to see patterns. The question deserved at least 40 minutes of answer. I couldn’t improvise and steal all that time from the organisers. This is why I started penning down a first draft.
Please don’t be indulgent and point out all my mistakes. The document has minimal rights reserved. Feel free to copy, steal, take out of context and deform my position as much as you like.
This presentation evolves from the most glaring issue: no data at all, to irrelevant data. Then, we talk about various formatting issues and miscommunications. We go into real errors from representation errors. Finally, we talk about fraudulent data.
The most common type of bad data is no data at all. If you read a MSc in machine learning, you expect data science to be about collinearity correction and neural network convergence. It is far more common to have someone asks for help when they don’t have any data at all. It’s not “bad” data per se, rather “no” data.
Truth is: the word is ambiguous. “Data” is a Latin word that means “things given, gifts” but nowhere does it say who gives what to whom. Most people working with but not analysts or data engineers receive reports. They might not realise the steps of the processes upstream. Imagine a baker who asks you for wheat floor.
That absence of data comes in many forms.
“I have a new idea: how can Data Science help from day 1?”
The most emblematic form of that is the self-described “co-founder” of future company. They expect that a data scientist will magically provide insights and guarantee success.
In most cases, a data scientist is not relevant until after series B. First, an innovator creates a concept, tests it. Then they raise a series A to test the market fit. Once you have proven the fit, you will want to scale. Everyone will not sit in the same room. To keep control, you will need data processes, audit, reporting. Once you identify possible automation, data science becomes relevant.
One common exception are companies providing a data science as a service. In this case, how you are going to get that data matters: clients, innovative collection method. There is data whose quality one should assess.
Don’t involve data science too early.
There is an internal version of that request. A product owner has a great, rather left-field idea. They wants to make sure it will work, receive support and scale. They come to the Data science team, hoping fairy dust. First they need to prove the idea fits with examples. Then they need to turn it into a process. Data science will find a way to automate that process, once it is at scale and collects data.
Internal or external project: Data science comes third. First, try your idea. Then start scaling a little, learning about your users. Once some process becomes inefficient, then data science can help.
Managers of an on-line business hate down-time. The sunk-cost fallacy, or the fear of missing out, makes is uncomfortable. Stakeholder will demand to understand what happened. They will want a solution. (Pro-tip: it’s technical debt. A non-technical manager can and should help address it.)
One very common question is to ask the data team if they can help. The conversation goes: “The server went down last night at 21:48, and…” Actually, I don’t know what’s the rest of that sentence because I stoped listening. I interrupt: “Was any data logged during the incident? — No, of course, the…” That’s when I stand up, thank them for the very short meeting, shake their hand and walk toward the door.
I will admit that this is not the nicest reaction. It is a good opportunity to clarify: without data, no data science. You could interpolating the activity before the down-time, but that’s speculative. Actual sales lost (that you can’t list) and inferences about custom you could have lost are not the same. That’s a key concept in statistics. It’s not always clear to everyone. You need to make a bold point when differentiating the two. I’ve learnt that the hard way.
Data but not related, relevant
People with an idea often come with unrelated data. They have a product that works well, say, shoes for one gender, and want to expand to the other. Nowhere in their data are any trace of gender, like first name.
A common real example of that is churn prevention: people leave your service, you want to know what to do about it. My first question is: Do you ask them about the reasons why they leave? If not, well, start collecting that. If there is an obvious problem, or a common complaint, you don’t need data science to deal with it. Understanding successive issues and improving several solutions is a lot of effort. After thousands of complaints, you have matched customers’ context and attempts at a solution. That should represent enough data to make informed guesses. Can an experienced human predict what could work before they ask the customer? If they can, then data science might be able too. With a prediction, you can automate the process. That’s rarely before the process has grown tedious. Once again: data science comes third. Test the market fit first; scale and collect data second; data science comes once you have enough.
The previous examples are common situations before working for a client or an employer. They are glaring enough that you notice and postpone. But, if a prospect shows you metrics, reports, it looks like they have enough data. As as data scientist, the offer to work with them is tempting. You might decide to check the data later, once you have full access. There are more big surprise there, including more data missing.
Large chunks missing, no one knows why
A common occurrence are large missing sections of data. You can often recognise patterns. A first analyst or data engineer joining will have visible impact. So will improvements in the code release practice. In high-growth companies improvements can be weeks ago. Working on short datasets can be a problem, especially if you want to predict seasonality.
All the previous examples are surprising but easy to resolve. Data isn’t there, so you have to look for another proxy. If you can’t, you are out of luck. There isn’t much to do. You will spend most of your time with data but, more often than not, written wrong.
Date and time
Anyone who has manipulated actual data hates date format. Problems with formatting are common. If you want to make friends with data scientists, joke about anything frustrating about timestamps. You’ll get instant veteran sympathy.
Should put the day of the month before or after the month number? That specific convention is a common and frustrating way to loose an hour. Americans, who are wrong about every metric and standard, use MM-DD-YYYY. Some other countries prefer DD/MM/YYYY. The trick is generally to look for January 13th. There is more to it than the inversion: the separator ‘/’ vs. ‘-’ are also different. Check for a trailing 0 for the first nine months or the first day fo the year. If not, you are using D/M/YYYY. It won’t break data but it will make alphabetic order wrong. Some will argue that MMM, or MMMM i.e. using the month name, abbreviated to three letters, or not, is more human legible. That’s true but not relevant because you will always want to leverage the alphabetic order. It will also fail to parse localised data (from Non-English speaking countries). 02-Fev-2016 and 15-Okt-2013 are generally fine, but 27-Gui-2012 and 09-Lui-2001 might not. Make also sure you agree on two-character or four-character for the years. You don’t want to attribute your user growth to Roman emperors.
You can also add the time of day to that. Most people agree to use semi-colons to separate hours from minutes and seconds. Some use 24 hours and others specify AM/PM. Every single character in a date can and will hurt. Anyone sensible knows that the only proper way to format a date is YYYY-MM-DD hh:mm:ss with a 24h standard. If they says otherwise, they are wrong and they should not be your friend.
For most of the strangeness there, you have Excel to blame. The spreadsheet converts any date-looking text that it sees without warning. That includes fractions (1/12), differences(3–04), intervals (12:35). It applies a local standard (US, UK, etc.) without asking for clarification. It doesn’t let they user correct or amend any conversion. When using in Excel, it is transparent thank to editable formats. Those formats don’t apply to the source data stored as .csv. This leads to massive headaches. Friends don’t let friends handle data processed through Excel.
The problem is not that data formats are odd. You can correct for that; most tools do so automatically. The biggest pain point is inconsistency. One service will send data in one way; another in another. Both send information to the same flow, into the same table. The inconsistent hodgepodge might become inseparable.
One surprising problem are microseconds. Most computers like to count and include those. It can be helpful when dealing with very fast (non-human-triggered) events. But date-handling software often don’t expect them. They can ignore the information. Some assume the period is a separator and treat the trailing digits as another column. A friend uses that problem in interview, to test people’s experience in handling data. They give candidates a dataset where some but not all timestamps include micro-seconds. The interviewer does not underline it. The goal is to count the number of event per day. It sounds trivial. It sounds elementary. Yet, people can struggle for an hour if they don’t expected the inconsistency.
There are more nuances. A common surprise is that some software converts dates to timestamps (date and time). They can do so assuming time zone, without specifying they did. This isn’t exotic but it can take aback even experience data wrangler. The main source of pain there, and in general, is actually a treasure for the geography geeks: time zones.
Time zones and daylight saving time
Things happen at a time and date, somewhere. Time and date are not the same around the world. I was born in Tahiti, which explains my interest in oddities like the day-changing line. It is a real thing; it has the wonkiest shape; it doesn’t move; it does allow for time-travel. Any serious developper will records everything using the Coordinated Universal Time (UTC). If you want to know if it was in the morning or at night, you have to convert local time. For any behaviour-based analysis, you want to do it. You also want assign any human action before 4am local time to the day prior.
25 countries have more than one timezones, often in overseas territories. Because of that, timezone can’t be available by country. They extend to regions but they are referenced by city. To tell the timezone of any given address, you have to find the nearest city (large enough to have their own timezone referenced). Localisation problems complicate that further. International standards reference the city English name. The dataset might be using local language.
It gets even more confusing with daylight saving time! One hour of the year doesn’t exist in local time. One hour exists twice. All time intervals across the change are off too. Final blow to sanity: formatting. Do your remember YYY-MM-DD hh:mm:ss? There isn’t a consistent standard for format a time zone to a timestamp. You can use three letters, or one letter of the alphabet (no one uses that) or the name of a city. My personal peeve: BST could mean either British Standard Time or British Summer Time.
Future, Jan 1st 1970, time travellers
Let’s assume that you have a dataset with events and their proper dates. You can add per day and find number of events per day. All is good until you realise that a lot things happened… ten years in the future? Or on January 1st 1970, which is surprising if you work for a company started in early 2004. Both far future and far past are common way to deal with event that didn’t happen. Some event don’t have a date: say, the end of an unfinished process, confirmation event of system users. If the schema won’t let you put an empty value, you have to find a replacement. Or the data could be corrupt.
You might also encounter time travellers. Some users received a parcel days before they ordered it. Some installed your game after they have achieved Level 5. Some left your service before they created an account. Every company that I worked for had at least one instance of those. There was a good reason every time (a de-activated, then reactivated account). But expect a lot of surprise when averaging absurd large negative intervals.
Time of day, no dates
The worst that I have seen: the database had the time of day of an event without the date. It sounds simple but is the wrong approach. If a time happened on a date, store it as a timestamp. Hours without date are relevant to represent opening hours. If your service is open across midnight, that can get hairy.
In my case, sessions started at a given time and day (recorded in two separate columns). The developer was skimp and wanted to save space. So he recorded only the time of the end of event. He didn’t want the end of the session to happen after midnight UTC. To enforce that, the server sent a command at 23:59 UTC every night to log-off remaining sessions. When said server got overwhelmed, those messages got queued for a couple of minutes. Bad things happened when we tried to measure the duration of those sessions. And when I say “every night”: did I mention that Australians hated that feature? New developers scraped that process. The skimpiness of the original developper remains as an inspiring company legend.
Number most expected type of data in a data set. Yet, I haven’t flagged any problem with numbers. That’s because they are less often bad. Rest assured: they are far from perfect.
As a string
The most common problem are numbers interpreted as strings.
Dataset are often stored as a file, generally Comma-separated values (.csv). When loading, numbers can be interpreted as a line of text rather than numeric values, aka as ‘strings of characters’, or strings. That’s an easy to notice and fix issue, generally not the fault of the source.
One common reasons for numbers-as-strings is if they are not quantities but categories. One example is the number of doors in a car. The hatchback counts as an extra door. Other types of door count as half. The result is best dealt as categories. Another examples are polls about your dependents. To contain answers, one option is often to have “at least” a larger number of children. The information reads “5+”. If you look and credit rating, your assets could be negative, written “< $0”. Those are separate groups; categories, not quantities.
They can be tricky because often the non-numeric cases are rare and you don’t see them on the first 10 lines. That’s one of many reasons why Python’s
pandas.DataFable.describe() is more helpful than opening the file.
Quantities, if they are monetary amounts can include a currency code. Those also come with formatting lack of standards. The currency can be before or after the amount. Some write large numbers with multiples (k for 1,000; M; G). Currencies can before or after that. Currencies can even replace the decimal comma.
Currencies can be a single typographic symbol (£) or a three-letter code (GBP). $ can be ambiguous where they are different international dollars. What matters is that the currency is in a separate column, and the amount stored as a numeric value. Experienced developper prefer financial amount stored as an integer number of cents. This enforce rounding rules. Once again, Excel is often the cause there.
I once had a dataset with the country of the operation but not the currency of a transaction. Indonesian Rupiah and Japanese Yen are the two lowest denomination currencies. Amounts there should be very large to reflect that. This makes easier to check the foreign-exchange rates are not flipped. In that set, some transactions in Tokyo had a low amount, low enough that would make a lot more sense if it were dollars. As the client name was John, and he paid with an US-based AmEx, I suspected we had inconsistencies. I reached out to the developers; I used that line to ask to include the currency in the data pipeline.
Commas and periods
Some countries use a comma to separate thousands, and a period to separate decimals. Others use a comma to separate decimals, and a period to separate thousands. Some prefer a unbreakable space for thousands. I don’t think it’s possible to parse a dataset using both conventions.
In the same vein, the CSV (comma-separated values) file format is incompatible with using commas as numeric separators (for decimals or thousands). At this point, it should feel obvious. It does happen, though. If you see it, ask to get the data as tabulation-separated.
One last surprise happens with counts of things that are whole (customers, orders, sessions, etc.). Those quantity can have decimals. So you ask about non-integer events. If users can cancel, one can estimate the likelihood of that happening. Analytics reports and plans capacity demand using that speculation as a weight
Please note that so far, nothing that I mentioned are real data problems. Data sources that don’t follow conventions, or don’t log enough is inconvenient. Yet, if you pay attention, you won’t get to wrong results with it. Most “bad data” is rarely wrong in complicated way. It’s more often than someone was careless, and you must check and edit.
The real problems and costly mistakes start with categorical information.
Unclear grouping, refactors
Those sound easy but they represent the biggest frustration when exploiting datasets.
Possible categories are often not listed independently from the fact tables. You can aggregate your dataset to find the categorical options mentioned. Three types of problems stem from there:
- The categorisation has changed while the data was being collected. You end up with events in Czechoslovakia, Slovakia, Czech Republic, and Czechia to sort.
- The categorisation is incomplete. Rare events might not have happened in your sample. You risk discovering new categories as you go. It’s not a problem for reporting. For modelling, you end up with a cold-start problem. To solve that, you end up with extensive rules about the Isle of White, Gibraltar and the Isle of Man.
- There are too many categories to be unusable. You need to regroup them into larger, sensical groups. For business trend forecasts, sales to San Marino, Andorra and Monte-Carlo go with the Italian, Spanish and French totals. If you care about taxes, not so.
So far, I’ve used geographic examples. Every business have their own entities, with types and exceptions. A car-related business has car types small and big, but also lorries, heavy-capacity vehicles, probably motorbikes, but possibly quads, forklifts, boats. Which one go into “Other”? Food (in restaurants, supermarket) comes in many shapes. Is alcohol a food, or a drink? Is wine that need to refrigerated) fresh produce? If you compare content type on social media Are gifs an image, a video or an embedded link?
List of countries
Geographic entities are a particularly fun type of categories: cities, neighbourhoods, regions . Do you know the difference between a country, a territory and a sovereign? You will need to know enough to classify activity. That includes IPs from the Åland islands, Antartica, Zero island or the Space station. For more business-relevant areas, check how the British Isles are different from the United Kingdom or England.
Every territory comes oddities, edge cases like Faroe, Bavaria, Zanzibar, Taiwan, Tobago, Cook Islands, Curaçao, Sardinia, New Caledonia, Hong Kong, Palestine, Guam, Mauritius, Palau, Christmas Islands or Antartica. Each of those has unique tax, political, geographic quirk.
Localised or user-typed information about cities or regions has a variety in spelling. Remember to merge Munich, Muenchen and München. Note that South Sudan is not the same as the South of Sudan (anymore). It can get trickier to need to get the time zone in Сиби́рь. (There are several.) Are 深圳 and 盐田 the same thing? (Essentially, yes.) What is the country of नई दिल्ली? (India.) Even that can change. Did you know that Kazakstan changed its official alphabet from Cyrillic to to Latin?
Classifying traffic into types: “mobile” vs. browser, table (screen/window size, etc.)
The most difficult categories to handle are not placid geography, but device types. Technological advances has given browsers to cellphones and tablets. Smartphones now have apps that themselves have browsers. Phablets confuses all that. Each device and screen size can be emulated. Most services are available on an app, an different app that is actually a webpage. It’s also available on an actual webpage with properties that change with its dimension. How do you simplify all that to three types of interface? On a given dataset, is it consistent that the only client types are “mobile”, “browser” or “GoogleAgent”? Which one are clients using the app on a large-screen tablet? Technology will change and all will be be proven inaccurate soon. Assigning a device to any web traffic is hard, and you are better off expecting surprises there. Another exercise: imagine that each session has a device category (mobile vs. desktop). You want to assign each users to the same category. What do you do with people who use both? I’d recommend having three groups: mobile, desktop and both. “Both” being customers having used neither type more than twice more than the other for the last 28 days). Every detail about this is up to debate; it could change, and if it does, you could face inconsistencies.
Tracking users is getting harder because users are unhappy with it. There is too much tracking and not enough visibility on how it benefits users. They delete cookies, spoof user agents, send deep links between their devices. None of that is likely to stop anytime. Expect more messiness and innovation in inferring who customers are. That can lead to changes halfway through a dataset, often silent.
Exceptions: -1, “Null”, Not-a-number
A very common way of expressing a inconsistent value is to use an absurd number. Imagine a dataset about a tax reporting. One question asks your “number of dependents (including yourself)”. How many children has a judicial person? What is the date of your death? The mailing address of a Navy ship? Some people decide to assign absurd values to those non-sensical questions, like “-1”, “01–01–1970” or “00000”. Those absurd values can contaminate your work if you are not careful and sanitise your data. You can average -1 with other integer number of dependent. You can calculate the (negative) age of someone. There are institutions with that postcode, and a naive join will find those.
Sometimes a price reads “13£ * nb of items”. This is not exactly false, but not a number that you can add without processing it first.
Even more sneaky: some software, given no value, will force one, be it an empty sting “”, or “NaN” meaning Not-a-number. In context, those are meaningful missing values. Taken out of that context, you will misinterpret the same values. When you extract, convert, adapt a dataset carelessly, things can fail. I’ve worked on a pipeline that passed from PostGreSQL through Python through another type of SQL. The initial result included both “NULL” and “NaN” as strings.
Filtering those exceptions is important. Noticing them, i.e. knowing that they can come about, is essential.
Most of the data that you want to get is structured. It fits into columns, each column has as clear type, a legible name. Columns connect between tables with consistent IDs. That’s not always possible.
Event type | Event object | Event detail
Teams who work on different aspects might handle very different interaction types. They still need to write to the same event table. They need to agree on the columns on that table. A classic solution is a generic event structure:
- Each event has clear features: an author, a service, a timestamp. Those are well-defined, structured columns.
- They also have a more generic event type. Those need a generic typology with a couple more details, like a status . A common solution is a pseudo-hierarchy like ‘event type|event object|event detail’.
Because the later hardly constraints the pre-existing lack of coordination, teams will have widely different interpretation of what should go in the last columns. You can end up with inconsistent typology; new types might appear that are missed because the intermediary classification has a catch-all clause, etc.
For instance, the team handling financial transaction has clear provider, payment types, success or error types, and fit that constrained structure quite well. The team handling customer contact or complaints want to store the text of that message, maybe the type of call, or the notes by the agent, even probably the whole conversation with when each message was sent, the number of interactions. That will push the limits or the “event detail” column. Drama ensues. The team can’t decide between shoving details in that last column, or use a separate table with a proper structure.
JSon blobs & free text
More often than not, long conversations should be stored not as strings with limited length, but a more flexible format, often JSON. JSon is semi-structured, which means you can assign meaning, with quotes, dashes, and turn a text into arrays of values. This is perfect for situations where we can’t expect how long or how much detail will be attached to an object, like a conversation. If you want to implement in-game exchanges in a conversation, JSon let you simply add that information to each message. The flexibility is very tempting, very convenient; however processing that dataset after the fact, and after many changes in how things are implement (because nothing constraints developers to respect a previous convention) can be a horrible experience.
This is not exactly bad data, as in “wrong”, but it can be tedious.
Free text, images and video files
Even more tedious are data stored in unusable format. The classic examples are information stored as an image of a text. You would need to do a lot of optical character recognition (OCR) and parsing (like removing the column names at the top of every page, matching pages, etc.). You probably will need entity-recognition if the information is in full sentences. The “information” is there but it is very hard to explore, impossible to search without spectacular effort. A lot of non-profit and charity data-related work actually does that: transform documents into databases. Once information is structure, the rest of the work is far easier.
I can give two examples of similar situation in my personal life.
Some years ago a black cab was very offended that I was riding my bike in London, near St-Pancrass; the drive told me so loudly, demanding that I “go **** [myself]”, something for which I don’t have the flexibility. Exasperated, he decided to take the matter in his own hand and run me over. My bike didn’t like it, but it was one of those you can rent by the minute, so I only had to push it to the nearest holding gate, and apologise to their help line for the damage. The person on the phone was not very surprised, but empathic and suggested that I call the Metropolitan Police. It was clear from the lack of reaction that they didn’t take note of the damage to track it further. They didn’t need proof I was innocent, but what about the Police: surely, they can’t send someone in jail for ten years on my word alone?
London is known for its extensive security camera network. There were half a dozen pointed at the intersection where I almost lost my limbs. I had the exact minute, the exact location, several option: I called the Police, confident that I would take a dangerous murdering fanatic off the streets.
If you have the displeasure of having someone try to kill you, or anything of the sort, I can save you 10 minutes of your life: the information is probably there, but “not really”, at least, that was the information that I was given on the phone, and later in an email. Whatever extensive surveillance system they have, the Met was unable to find the feed from a given camera at a given time. I can’t say that I encourage you to try to go on a rampage, but I have no argument to convince you that you shouldn’t because you’ll be caught: that murderer is still roaming the streets, which recent statistics on road fatalities seem to concur.
I would say that, even if privacy advocates are very concerned over the surveillance state, they shouldn’t: video surveillance over London is, or rather was at the time, bad data.
What would it take to make it good? It really depends on your use: indexing it by time and date is an obvious one. Mapping where the cameras are, having a network of which camera sees at the left, right of each other camera would be an interesting geometric problem to solve. Interpreting motions, situations, detecting problems might help index further; recognising faces would get us closer to an Orwellian situation—but would be good, in the sense of usable, data.
Another, more recent personal example: I recently started to rent a flat. The realtor demanded that I provide three months of bank statements, not just pay slips, but the entirety of my spendings, itemised, over 90 days. That sounded shocking, but I don’t like sleeping rough so I obliged.
My bank has a very convenient interface: I can classify all my spending by merchant, type, currency, make detail statistics, they even provide a forecast! When I saw my “statement” a document listing the same information, I doubted that my realtor would enjoy all those features. For every transaction, the statement had a date, an amount and a description, often truncated to 16 characters. For most of those, they didn’t make sense to me: it was the first letter of the official name of the franchisee operating my supermarket; the reference of my letting contract for the realtor fee; the name of the company operating the train ticket app that I use; etc. A lot of it was confidential in a way, but hardly any of it seemed usable.
My bank makes a lot of effort to associate every new transaction description to what it means for us, customers: supermarket are given their brand-name; bar tabs are tagged as ‘going-out’; electricity bill as ‘utilities’. The data on my raw statement isn’t usable. Like surveillance camera footage, it isn’t bad. The data that my bank collected in order to convert my statement into usable, legible dataset, that is distinct, external third party information, and valuable.
We have two lessons here:
- data isn’t good on itself: what you consider better depends on your intention with it; verifying claims requires little extra information; detection anti-social behaviour without citizen prompts, more so;
- good data is often also based on external, relevant sources: weather, holidays, competitive information, economic maps, categorisation, help make sense of raw data.
To summarise so far, none of the above are exactly wrong data: they could be useful as there are. However, all those formats and partial information can be painful to work with, and might hide bigger problems.
All those errors, missing data or bad format, are rather obvious on the screen when you see the right line. Sometimes, the issue is a bit more discreet and you need more context to catch it. There is bad data that you don’t catch on your first day of looking at it.
One possibility is that someone is misrepresenting who they are on the internet. It’s often what people expect to see when they imagine bad data, but it is not as overwhelming as formatting issues. Fraud has direct economic consequences; therefore service operators are actively detecting and enforcing against it; proper formatting is less beneficial in itself. It doesn’t blocking analytical exploration either, because you can analyse fraud.
Robot traffic and spam
The largest source of data for any service is traffic data. The largest part of that traffic are not commercially relevant: spam traffic, crawlers (often from competitors), DDoS attacks, bots trying to guess password repeatedly, etc.
Filtering those out is a whole data science project in itself; and you should see spam-detection as part of the data-processing rather than a failure from your provider — but good data often is spam-filtered. Working on non-filtered data and not expecting noise is prone to terrible interpretations.
Fraud, false transactions, fake accounts
Obviously non-human traffic is problematic, but the more expensive issues come from humans trying to cheat: they can use someone else’s credit card or identity; get deliveries at an address that isn’t theirs to avoid detection; they can spoof their location; create fake accounts to evade bans or inflate their number of friends; lie about their name, age, gender; upload profile photos that are not them; tell a sad story that isn’t true; etc.
Those efforts can be more or less transparent, visible, industrialised — and detecting them is important, unless you want to build models that conclude that the most reliable sign that a customer is very valuable is that they use several dozen credit cards each with a different name, or that the safest way to get someone to re-order is to not send them their parcel.
You typically find about those in two ways:
- you understand the service, and how its incentives encourage misbehaviour; how to go around expected use should become obvious to people with the right mindset, and you can check if anyone does that;
- you build a report, or a model, and notice absurd results: the best customers cancelling most of their orders, or ordering delivery to consistently different address; partners who disappear after for minutes, processing revenue larger than the GDP of their home country; etc.
Because there is no limit to human creativity, there is no limit to how nuanced those can be, but most are rather obvious to notice and to detect once you know what they are.
The most time you spend dealing with missing data isn’t when data is missing, or hard to parse, or even when someone intentionally misled, but when it looks mostly consistent, but something subtle and undocumented obviously got wrong.
Wrong ID, wrong match, different namespace
Typically, that happens when there was a mismatch between what people thought were equivalent concepts, say, a customer for an account; an order an a delivery for e-commerce; or a session and a search result. Both concepts has ids, and there might be similar but not equal if the service let users order without creating an account; have a failed order re-delivered; search several times in the same session.
If you assumed that the two were the same, you end up with very confusing results:
- either locally absurd results: e.g. an Argentinian customer ordering from South Africa and paying in Rubles (but many times over); deliveries without a matching customer anywhere in the table; agents with negative work hours; or
- overall totals that do not add up: cancelled orders that represent 20% more than what the reimbursements in the books say; more successful orders delivered than customers who received one; VAT higher than your revenue; etc.
That’s where you want to ask the product engineering for a detailed schema, and grab a large pot of coffee. You might earn senior engineers temporary hatred and possibly their admiration for finding some very tough bugs.
No one knows
The real problem with a lot of those problem is when you don’t have someone able to walk you through how the information is matched in the first place. Not having extensive documentation on any dataset is a problem too.
This is why I typically consider that data set is bad unless it comes with a Story. A story is a two-to-ten page document (depending on the complexity of your service, roughly the age of the company in years) explaining what people do. An essential part of that document are keywords: the entities that are structural parts of your service schema, the words that you use daily, team names; also extremely instructive are distinction: what is a basket, an order, a delivery; what does inactive mean; how do you classify partners; what are key ratios between each of those: average order per month per customer, 80/20 distributions of revenue, etc.
I would strongly recommend, when given a dataset, to start writing that story and expect to flag data errors from not being able to write it consistently. As a data scientist, you should easily be able to write that story from the dataset and explanations; your clients might not have the analytical skillset, but they can be feedback through the story that you are drafting and it’s holes.
Those can actually pass through a lot of common sniff test: data tables with those problem can load without missing values, they don’t have text where numbers should be, the first lines on your screen make sense — but further down the line, you might see odd things. How do you make sure not to miss those?
Auditing and monitoring
The best way to detect those is to have extensive detection systems. There are two types of controls that you want to set up, as early as you have the data:
- auditing makes sure that you understand the data structure and its constraint; it’s typically obvious checks, a bit like unit test in software development, or financial audit in accounting: the sum of revenue from each users is the same as per country and overall; users active is the same as users joined, minus users turned inactive; all that sound simplistic and ridiculous, but you will learn a lot about the actual schema of your service, and other business rules: when to apply foreign exchange rates, how to separate by dates, whether a user belongs to a country exclusively, etc.
- monitoring: another very helpful way to understand a service is to make forecast of activity levels on every relevant metric and detect outlying events; first in the past, which should help you understand what happened, and ask questions about outages; then, once monitoring is automated, you can learn about how things interact: if a major provider in a country fails, that country’s activity will also fail.
Both are continuous process: you can’t really evaluate the quality of a dataset fully on the first day. A lot of the more obvious issues (those higher up in this long list, are easier to notice) but you won’t find that the conversion rate between the Danish Crown and the Euro (two notoriously close currencies) is our of whack until it meaningfully changes, and it rarely does.
There is a joke that I like to tell about data quality. A friend of mine was once asked if he spoke Arabic, and he responded: “I don’t know: I never tried.” Truth is: he was half-serious.
Data quality is the same: if you are not sure your data is great, I confidently tell you that it’s not. All data is bad, and that’s ok. At least, until partially proven otherwise. Proving that, understanding the data is not the most glamorous or rewarding part of the job, but it is the most instructive, so be happy that you get to do it, every time you get the chance.