Stories by Noah Burbank on Medium

A Deep Dive on Text Classification at Salesforce

Noah Burbank — Wed, 09 Jun 2021 19:39:16 GMT

published on Towards Data Science

Putting from a Sand Trap (Image by Author)

We’re excited to announce that Noah Burbank, a Principal Data Scientist in Sales Cloud, has recently published a deep dive into text classification at Salesforce on Towards Data Science. The article, How to choose the right model for text classification in an organizational setting, applies counterintuitive advice from golf to Data Science: you might be better off skipping the sand wedge and using a putter to get out of sand trap.

In the article Noah goes through the technical trade-offs of three different text classification approaches: Regular Expressions, Machine Learning, and Deep Learning. But after the technical deep dive he reveals his key takeaway: the best model isn’t the one with the highest F1-score, the best model is

one that you can deploy with the resources you have, trained with the data you can get, in the amount of time that your boss has patience for

Read the full post on Towards Data Science

If you enjoy the post, you might want to read some of Noah’s other blog posts on the Salesforce Engineering Blog about designing NLP systems or about using open source data to preserve customer privacy.

A Deep Dive on Text Classification at Salesforce was originally published in Salesforce Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to choose the right model for text classification in an organizational setting

Noah Burbank — Wed, 09 Jun 2021 13:55:45 GMT

Notes from Industry

On the importance of understanding and disregarding technical considerations in applied machine learning, or, When to use your putter to get out of a sand trap.

Putting from a Sand Trap (Image by Author)

The typical response to an errant golf ball landing in a sand trap is to swear, sigh, and then trudge over to the sand trap with a sand wedge. It’s not a tough decision. Every golf bag has a sand wedge, and unlike all of the other decisions about which club to use — should I use a 5 iron? a hybrid? ironwood? a 6 iron? — the sand trap has a club designed, and named, specifically for it. So you might be surprised to learn that there are serious golf professionals who advice that, whenever possible, you should use a putter in a sand trap. Why? If there is a better tool, why recommend a putter? Because the best tool is not the one that most perfectly fits the job, it’s the tool that you’re best at using for the job.

If you’ve worked on the technical side of software, you’ve probably experienced an older engineer giddily telling you about some tool they wrote in awk 15 years ago that perfectly solves some problem you have. Knowing the right tool for the right job is why management consultants get paid so much to come up with plans. Management consultant services aren’t valuable because they’re the only people smart enough to solve hard problems, they’re valuable because they’ve seen a lot of similar problems, and they are smart enough to have learned from them. Your company might get acquired once, but the Big Three have literally centuries of combined M&A experience. The old engineer’s (the awktegenarian’s) advantage is not that he’s fluent in command line tools, or that he’s smarter than a junior engineer, it that he has seen this problem before.

But is using awk a good solution? The answer depends little on the performance of the awktegenarian’s program, and more on current organizational needs and capacities. If he leaves, or somebody else needs to maintain or modify the program, or the program needs to be explained to an executive, using awk (an ancient command line language) is worse than a spreadsheet, even if the spreadsheet is slower, cannot handle as much data, and requires that you manually edit the data for twenty minutes each time before you can use it.

At Salesforce my team develops and deploys models for a variety of NLP tasks, but our bread-and-butter work is text classification models for emails. Fitting a model is doubled: you fit your model to your data and you fit your model to your organization. In this post I’m going to walk through an example text classification problem to illustrate some of the insights we’ve learned in choosing the right model for the problem, from a technical data science perspective. And then I’m going to show how you throw that all out the window to fit your organizational constraints.

From a technical perspective, what model we choose ultimately depends on how well our training data captures the diversity of the data it’ll score in production. Salesforce has a massive diversity of customer organizations. There are organizations with two people and organizations with tens of thousands. They sell units, packages, licenses, and engagements. Some use a single, simple sales process, others have dozens of different pipelines for a diversity of products. Some companies are more likely than others companies to allow us to train models on their data. So when we create datasets we always know a priori our training data is both incomplete and biased, which is the primary thing we think about in choosing a model. Regardless of the model we choose for the task at hand, it needs to be deployed, scaled up, and maintained for the lifetime of the product at a reasonable cost. Salespeople need to be able to demo it to customers, and product managers need to be able to demo it to executives. These organizational needs typically outweigh technical ones.

Task and Corpora:

The example task for this blog post is to classify a sentence as a question/not a question. I chose this because it is a simple, well-defined task and we mostly agree about what a question is. Most interesting text classification tasks will involve a great deal of effort haggling about the definition, such as whether “Let’s get lunch sometime” is or is not a scheduling request.

For a corpus, I’m using a dataset prepared by Jack Kingsman. This dataset is particularly useful for our purposes because it contains a lot of questions (33.81% of sentences), the questions are easy to label (none are missing the question mark), and many of the questions (71.94% of questions) are ungrammatical, rhetorical questions, which gives this problem its substance. (Since I’m using this corpus for research purposes, the specific contents of the dataset do not matter beyond there being numerous questions with peculiar grammar.) There are three basic types of questions. First are wh-questions that use the Five Ws (19.55%), for example “What really happened when the wizards and warlocks revealed what they had?” Next are yes-no questions that yield a yes or no response (8.51%), like “Was it really from Yemen?” or “WILL YOU ANSWER?” Finally we have the ungrammatical-rhetorical questions (71.94%), like “Date posted (early)?” or “Coord w/ foreign actors for payment/money disperse?”

Since all questions in this dataset contain a question mark, I label each sentence as a question/not a question by simple presence of a terminal question mark. I then split the dataset 80/20 for training and testing. Since the primary distinction for determining what modeling approach best fits the data boils down to “is your problem complicated and how complete is your training data?” I manipulate the dataset to create different but analogous contexts in which to see how well the training data captures the diversity of the test data. In the first dataset, the test set is untouched. In the second dataset, I remove all of the question marks from the test data. In the third dataset, I remove the question marks and mask the Five W’s with an unknown token (“XXXXX”) in the test data. In the fourth dataset, I test only on yes-no questions (and non-questions), and train without any yes-no questions. I also remove the question marks from the test data. In the fifth dataset, I take the fourth dataset, but remove the question marks from the training set as well. The table below describes the datasets and their differences.

Bottom line up front:

My final recommendation for fitting a model to your data is pretty simple. Use Regular Expressions if your classes are tightly coupled to a small set of features. If separating your classes requires more complicated rules and your training data captures the diversity of your test data, then use traditional machine learning, and as last resort, use deep learning. Of course, “capturing the diversity” of the data isn’t a precise or quantifiable metric, so I’ll walk through the different datasets and show where a Random Forest is close enough (Datasets 1, 2, 3, and 5), and one where it isn’t (Dataset 4). This is summed up in the following flowchart:

It’s hard (read: impossible) to know, a priori, the answers to these questions without exploring the data. In a sense you use Regular Expressions to answer if your classes can be divided with a small set of features and so should use Regular Expressions, and you use machine learning to figure out if your training data captures the diversity of the data in the wild and so should use machine learning. The challenge here is not technical, but emotional. Once you create a model, especially one you worked hard on, it is difficult to delete it. But getting precious about your models prevents you from doing the right thing. Be willing to create models, and to do so carefully and thoughtfully, and then to throw them out. You must be willing to murder your darlings.

I’m in a way short circuiting the need to explore to learn about the data because we already know a lot about questions without exploring this data, wh-question words form a closed class, and the syntax of questions appear rarely outside of questions. This is a bit like starting a project by finding out that it is surprisingly well behaved after exploring it for a month, but it makes for a better blog post because we can go straight to the meat of the problem.

Start with Regular Expressions

If you can divide the classes with a small set of features, RegEx is the best way forward. If every question contains a terminal question mark, and terminal question marks appear nowhere else in the text, then using RegEx will be perfect. I could have just looked for the presence of a terminal question mark, but that’s an unrealistically bad strawman. Remember, we already know a lot about questions, and if we didn’t we would have explored the data until we did, and we know that people don’t always use question marks (eg in casual conversation, chat, grammatical errors, etc.) and that questions tend to use words from the Five W’s. So I made two RegEx classifiers: one that just looks for terminal question marks and the Five W’s, which I call the Brittle RegEx Classifier, and one that does the same but additionally looks for the sentence starting patterns of yes-no questions, which I call the Robust RegEx Classifier. The additional patterns beyond the presence of a terminal question mark improve recall at the expense of precision. For example, the sentence “We know what this means.” is a positive match for the Five W’s pattern, but is not a question and “Is this about the virus or something else” will produce a positive match for the Robus RegEx Classifier but not the Brittle one. The two RegEx classifiers illustrate the value of understanding your data.

On Dataset 1, where the test data still has question marks, both RegEx classifiers perform extremely well. On Dataset 2, where test cases are missing question marks, the recall and precision for both classifiers drops, but not to zero, thanks to the additional features beyond a terminal question mark. On Dataset 3, where test cases are missing question marks and question words are masked, the two models diverge in performance. The Robust RegEx Classifier’s recall takes a further drop, since only the yes-no question pattern will find questions now, and yes-no questions are a minority of questions, but the Brittle RegEx Classifier’s performance drops all the way to zero. On Datasets 4 and 5 the Robust RegEx Classifier has excellent performance again.

How could we improve the performance of the Robust RegEx Classifier’s performance on Datasets 1, 4, and 5, where it did well? Since the recall is nearly perfect, we want to improve precision, which is to say, find a way to reduce false positives. The first approach to doing so is to add exceptions to our pattern. I advise against this.

As you add exceptions, you’ll start to notice that some exceptions require further exceptions, and so forth. If you keep going down this road, you’re going to accumulate crippling technical debt. One of my favorite papers is a dissertation primarily about memory allocation for a program that does automatic hyphenation. The algorithm hyphenates via rules, exceptions, and exceptions-to-exceptions, about five layers deep. This kind of nested if-statement algorithm can be very powerful, and very fast, but it is notoriously difficult to debug or understand. Moreover, it’s easy to add rules for new cases and inadvertently make your performance worse without understanding why. RegExes have the added bonus of being significantly easier to write than they are to read, so as you create more exceptions, and more nuanced and detailed RegExes, your system becomes harder to understand, harder to debug, and harder to adapt to new cases.

The RegEx classifier’s performance plummeted on Datasets 2 and 3, and for good reason: the features picked out by our RegEx no longer correlated with the classes we wish to separate. So what is there to do about this? We would need to look at the data, and create more patterns that we think capture the difference, but once we start going too far and the patterns become unmanageable (a subjective assessment), it’s time to label data and let the data determine how to map features to classes. Even the best Regular Expressions sometimes need to be deleted. This brings us to the next approach: machine learning.

Develop an ML Model

Once you start down either the rules and exceptions and exceptions-to-your-exceptions path or the hand-tuned-features-and-class-separation path, it is time for a machine learned approach. If you find yourself trying to implement three level lexicographical ordering rules, you have made a mistake much earlier that led you to this point in your life. You need to have labeled data for a machine learned approach, and if you started making sufficiently complicated RexEx rules, you probably had a decent set of sample data anyway. Let’s see how a Random Forest model performed on our five datasets.

The intuition for the ML model’s performance is much simpler than that for the RegEx classifiers: the model performs better when the training and test data are more similar. Datasets 1, 2, and 3 all actually have the same training data, and deviate only on their test data, which sequentially moves further and further from the training set.

Removing question marks from the test set (Dataset 2) causes an appreciable dip in performance because the primary feature the model focuses on is the question mark. In fact, Question mark and period are the two most important features and by an order of magnitude. I chose a Random Forest model for two reasons: it performs well in a variety of contexts, and it provides features importances by measuring the out-of-bag error for each feature. Feature importances are, like many things, “kind of like a probability” but not quite — but for each feature the importance is a number between zero and one and they all add up to one. For a feature to be important, it must be commonly chosen to branch the little decision trees in our random forest, which means that it both powerfully separates the classes, and occurs relatively frequently. If your most powerful features don’t occur frequently (or at all) in your test set, performance drops. Below is a table of feature importances for a Random Forest model trained on each of the five datasets.

All models need the training data to resemble the data in the wild in some important way, and in this case important means in the way in which they are featurized. There are two ways to visualize this relationship. The first is to look at how deep down the list of less important features you have to go until you have features that appear in the majority of your test cases.

This graph is easier to read but it disguises that as you dig deeper into the features, they’re each less important, that is, they appear in more of your test cases, but they’re not very helpful features for separating your classes. Our features are subject to Zipf’s Law, which means that that term frequency is roughly inversely proportional to its rank in a frequency table. Discriminative features are very likely to be uncommon. The question identification task sidesteps some since the Five W’s and question marks are a closed set and appear relatively frequently, but nonetheless the lower ranked features are significantly less frequently. To try to capture this phenomenon, I’ve made another, harder to read plot, below:

The x-axis is the cumulative feature importance, and the y-axis is the percentage of test cases that contain any of the n-grams. For example, in Dataset 1 the first two features (‘?” and ‘.’) account for a little over 40% of the importance, and appear in almost every single test case. The ML model does well on Dataset 1 because the most important features frequently appear in test cases. Dataset 4, on the other hand, illustrates nearly the opposite: most important features do not show up nearly as many of the test cases, and that percentage starts to climb only as we get into increasingly less useful features.

Is this anything you could use, a priori, to determine that your training and test data are not similar enough? Probably not. You had to develop the model carefully and earnestly to determine that the model won’t do, and then you have to murder your darling model. Every time you have to get into the data. Hopefully this section helps build your intuition for understanding the interaction between what the model is learning from the featurization of the training data, and how readily it can apply that understanding to the test data.

One way to close the gap is to remove information from your training data to prevent overfitting on certain features that appear infrequently (or never) in your scoring data. Dataset 5 is identical to Dataset 4, except without any question marks in the training data. Performance improves dramatically and question marks disappear from the important features list. If you cannot modify your data, try to find more data sources, ideally ones that are directly from or at represent different dimensions of the data you’ll actually be scoring. Reach out to other people on your team or nearby teams and see if they can think up any examples that might be tricky for your model to handle. People on other teams love coming up with examples that show that they are smarter than your models. I found a surprising depth of counterexamples my coworkers can think up when I made our models available in a SlackBot (more about that later). Before you brush off TensorFlow, try to find more training data. If this is all you can do, then it’s time for deep learning.

Deep Learning as Last Resort

You’ve concluded that your training data doesn’t capture the data that you’ll see in the wild. There are a variety of real, excusable reasons as to why you’ve ended up in this position. Perhaps mistakes were made gathering the data. Perhaps your MSA only covers some of your users, so you get a biased sample from opt-in data. Perhaps you purchased an analogous dataset, but which is outdated, or unworkably idiosyncratic. Whatever the reason for the belief that your training data do not cover the data in the wild, and that you cannot augment or fix your datasets, it’s time for deep learning.

Deep learning is very much the rage, and I am far from an expert on it. Deep learning seems to handle a variety of tasks robustly, granted that you have enough data to allow it to optimize a bazillion parameters. I tried a simple Bidirectional LSTM model, and it does a pretty good job.

More importantly, it performs robustly across all five datasets. So why don’t I recommend that you just use deep learning every time and call it a day?

If you’ll recall, this blog post is about how to choose the right model. Part of that means choosing the model that fits your data. The other part of that is choosing a model that fits your organization. Regular Expressions have a lot of advantages in an organizational setting. They don’t require GPU’s to train and they can be deployed as part of whatever code you’re already deploying. If your executives or managers insist that a specific example works, you can make sure it works. And if we’re being brutally honest, Regular Expressions hardly only need a handful of positive examples for labeled data. Deep learning models need GPUs, tons of labeled data, and ample time to explore an enormous hyperparameter space. If you train a deep learning model, but your company has never deployed one before, or you yourself are going to be the one to deploy it, this might cost your project months or years of time. You might be better off with a garden variety ML model with a worse F1 score, but which can be deployed and maintained by other people in a timely manner.

If I’m going to tell you that the right model is one that you can deploy with the resources you have, trained with the data you can get, in the amount of time that your boss has patience for, why did I go through all of this detailed technical analysis? If you try to fit a model to the organization without understanding how to first fit it to the data, you will not succeed. I’ll share some examples from when I designed the signature parser model for Salesforce.

I had three organizational restraints that I needed my model to fit. The first is that the team that deployed the model wrote jobs for Kafka Streams in Kotlin. So I needed to package my model as a library deployable as a JAR. This meant that the previous two machine learning frameworks my team has used — MLlib and TensorFlow were out. I ended up using SMILE (written by Haifeng Li) because it provided the minimum machine learning functionality in Scala, and implemented the missing tools for my data science workflow myself. The second constraint was that a signature parser needs to be trained on Personally Identifiable Information (PII), but we do not use PII in customer data at Salesforce. Since a signature is PII, I ended up fabricating training and test data to have statistical resemblance to our customers’ data, but without actually using their data. I wrote about data fabrication here. Finally, I made the title scoring model internally available as a Slack Bot, and discovered that the product managers had a couple of titles that they just needed to see it score correctly. Theoretically, the correct way to solve this is to go through the laborious process of labeling more data and retraining the model, but that’s time consuming and there are a dozen other projects that more badly needed my attention, so I created the Otherwise Model. I put a simple RegEx in front of the model, which handled the special cases they needed, otherwise my ML model did the work.

PS If you’re wondering, here is a graph comparing the F1 scores of each model against each dataset. Notice that nowhere here does it mention that spinning up new infrastructure and components to support TensorFlow at large scale is non-trivial.

How to choose the right model for text classification in an organizational setting was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Gall’s Law to Gaul’s Law

Noah Burbank — Tue, 31 Mar 2020 15:16:01 GMT

“A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.”

- John Gall

“Gallia est omnia divisa in partes tres”

-Julius Caesar, Commentarii de Bello Gallico

If you’re going to build a complex system, you should start with a simple system. Seems obvious. And yet, if you’re building machine learning (ML) systems in a high pressure environment, you’re subjected to various pressures that can push you to act against this sage advice. Students of Latin will remember with awe (or gritted teeth) Julius Caesar’s habit of not only dividing Gaul into three parts, but nearly everything else into lists of three (think: “I came, I saw, I conquered”), a kind of “Gauls’ Law,” if you will. Inspired by a mishearing of Gall’s law (see above) as “Gaul’s Law,” I started breaking my natural language processing (NLP) projects down into three phases: end-to-end heuristics and labeling, some ML components, and systematic ML models.

In the first phase, I build a parameterized end-to-end system with heuristics and I label data. The entire system is parameterized. At first this will feel like a waste of time, because you only have one implementation for every component, but starting with this modularity makes subsequent phases much easier to implement, and, more importantly, easier to instrument and evaluate. Label data relevant to each component, write unit tests, and establish accuracy expectations. Labeled data are represented as little burlap sacks because they are precious, like burlap bags of gold.

In the second phase, I replace the worst performing heuristics with ML models.

In the final phase I replace subsystems (up to and including the entire system) with more complex models.

The foundation of this development process is the data and not the model. When you start with the data, you start with facts, and they will not become wrong and you won’t have to redo them. When you start with the model, you put together the first model much faster, but you run the risk of building the wrong model and needing to redo it, only now with more pressure because you wasted time.

To make this more concrete, I’ll show how I followed Gaul’s Law when building Salesforce’s Contact Information Parser. The Contact Information Parser is a library that detects, extracts, parses, and enhances unstructured contact data to create high quality Salesforce contact objects. Its uses include automatically creating contacts from email signatures and OCR business card data.

End-to-end Heuristics and Labeling

The initial version of the system was entirely built with heuristics. It detects spans of text that might be a contact, validates that contact candidate, and then applies a sequence of heuristics that detect individual contact fields like name, title, company. A resolver handled ties when a text span received multiple tags.

This phase takes the most time and effort because you need to build all of the scaffolding that holds the interchangeable parts together and you need to label a bunch of data, but it is also the one with the most learning. This is where you find out that your input has a lot of characters in the high UTF-8 register or has had the punctuation stripped or that what you thought were spaces were actually non-breaking spaces.

Your components are easy to understand and the entire system is more or less interpretable. You can still reason about the individual components and the system in general, and you can hand-tune everything. Because you can reason about your model, you can develop intuitions and theories about why it works and what it needs. This is information that you did not have when you started. Imagine debugging a deep-learning model if you hadn’t gone through this process.

There are non-technical benefits to starting with heuristics. They are relatively easy to write, so you can move quickly. The sooner you understand your data, have a working prototype, and get baseline accuracy and performance metrics, the less stressed your manager is going to be and the more time you can spend thinking instead of responding to bureaucratic pressure. Although your boss wants to hear “it’s done” more than anything else, telling your boss “We have data and know where to invest our effort going forward” is better than “I’m using the latest technology and this model is going to be awesome, but I can’t quantify any of this.”

At some point, however, your heuristic models are going to drive you crazy. Rules tend to be high precision, low recall, and, as they start to overlap each other, you have to install layers of exceptions and resolvers. Machine learning is now an easier and more sustainable way to improve performance. It’s time to use some ML components.

Use Some ML Components

Your labeled data allows you to identify which heuristics are the weakest and should be replaced with ML models. The machine learned models might have worse performance than the heuristics, but since the pipeline is parameterized, you can keep developing them before you swap them in. Because of all the work in the first phase, this phase is actually pretty easy because you get to start with some of your featurization already done: heuristics make for excellent features. Your recall will most likely improve, although precision will probably get worse. Because you labeled the data, you’ll know exactly by how much and you’ll have a confusion matrix for each model the moment you swap it in.

For the Contact Information Parser, we swapped out the heuristic contact detector with a CRF model and the title heuristics module with a logistic regression model. Both used features from their respective heuristic versions. They were the weakest components in the pipeline and replacing them with models significantly improved performance.

Once you’ve replaced enough components, you can set your parameters to choose the ML components over the heuristic components, and you can look at system and subsystem level performance. The worst part of the Contact Information Parser at this point was the Resolver for the field taggers. We built a recursive function to split/retag and installed multiple layers of rules to resolve spans that persisted with multiple tags. Understanding, maintaining, and improving it became very difficult and unpleasant. We thought that the system would perform better if it had information about all of the spans at once. This brought us to the final phase.

Systematic ML Models

You are now able to make informed decisions about how big of a piece is worth replacing with an ML model, how far away your system is from “good enough,” and what resources are necessary to finish the project to varying degrees of completion. You can take your entire system, with all of its labeled data and performance data, and replace it end-to-end with a gigantic deep learning model, or you can replace smaller subsystems with ML models. For this reason, this stage requires the most judgment. Luckily for you, this is also the phase in which you have the most information about and experience in solving this problem.

In the Contact Information Parser, we replaced the sequence of models and heuristics used to assign text spans to contact fields with a single CRF model, which we called the CRF Field Model. Building the model at this point in our development was simple because of all the work we had already done. We used all of the heuristics as features, added in n-grams, and produced a model that improved performance across all classes. It was easier to understand, and it was easier to improve. Instead of having to think up new rules and exceptions to those rules, we simply labeled more data. Building the CRF model from the very beginning would have been much harder and it is unlikely that we would have had ideas for features that were as good.

Practical Consequences

On a final note, you might notice that earlier I used the bizarre phrase “varying degrees of completion.” This brings me to the final, and probably most important, reason why you should follow Gaul’s Law. As early as possible, you had a system that completed the required task. It was probably unsophisticated and made harsh tradeoffs between precision and recall, but it provided end-to-end functionality. Things change quickly in the AI world. You might work in an organization that boasts being “agile” and only looks two weeks into the future. The odds that you will have enough resources to complete your project to the highest standards are nearly zero. If you plan your work so that nothing is done until everything is done, you are unlikely to ever finish. Following Gaul’s Law allows you to ship something when organizational priorities and resources shift away from your project, and gives you incremental performance data to help keep the resources around for as long as possible.

From Gall’s Law to Gaul’s Law was originally published in Salesforce Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to use open source data to solve customer problems

Noah Burbank — Tue, 10 Mar 2020 17:21:14 GMT

Imagine a Data Scientist on a life raft adrift in the middle of the ocean, lips chapped, sunburned, wearing a tattered shirt from some startup that doesn’t exist anymore. The Data Scientist manages to hoarsely say, “Water! Please Water!” and stares deliriously at the ocean and starts to weep. Alas, you know the saying, “Water, water everywhere / nor any a drop to drink.” While we gather giga- and terabytes of data every hour of every day, until it gets labeled, that data is as useful as the horizon-to-horizon saltwater is to our poor, stranded nerd.

Labeling data is not easy. It requires judgment. It’s slow. Labelers disagree with each other. There are many tools and solutions for these problems. This article is not about those problems; it is about another set of problems for labeling data: privacy, trust, and the General Data Protection Regulation (GDPR). In order to label data, you need to be legally and ethically permitted to look at it. In this post I’m going to show you how to respect the trust, privacy, and dignity of your customers by modifying open source datasets to mimic key features of their data.

This problem reared its head when developing the Contact Information Parser (CIP). Contact information is, by definition and on purpose, personally identifiable information (PII). We needed PII to make the CIP, but our Master Service Agreement (MSA) requires that we have to scrub PII from customer’s data in order to label it. So here we are, sitting on a sea (ok, “lake” ) of data, none of which we can drink. There are three solutions to this problem: get permission to use customers’ PII data, mask all of the PII before using their data to make the CIP, or find an open source datasource.

The first option would require that we renegotiate the MSA with customers coming up for renewal. It would take a long time, it would cost us lawyer fees and contract concessions, and some customers (think healthcare and finance industries) could not agree to it even if they wanted to. Building the CIP without PII data guarantees that we both don’t know how good our models are (you can’t calculate recall without a labeled dataset), and that they would probably be quite bad because of a lack of data. So the only real solution was to find an open source dataset.

As it turns out, there are a lot of open source email datasets. The most famous is the Enron corpus. There are other datasets of emails available — typically from government. Some states, like Florida, have “Sunshine Laws” that mandate emails be published within a certain number of years after a governor leaves office. Jeb Bush’s emails are publicly available (although when first released he forgot to remove social security numbers from donors). John Podesta’s emails from when he was the chair of the Democratic National Committee are publicly available.

The problem with using these corpora is that they are dissimilar to a corpus constructed out of my customers’ data. Many of the emails are not business emails. Twitter and Facebook and Instagram and YouTube did not exist in 2002 and cannot be found in the Enron dataset. The datasets include extremely idiosyncratic phenomena unknown in your customer’s data. Enron, for example, had this weird internal phone network system where everybody used a stupid 5 digit extension code and put that in their signatures.

By going with open source emails, we have replaced the legal problem of labeling PII data with a technical problem: how do we make an open source corpus look like our customers’ data? Let’s explore the simple example of how we did this for the address parser component of our Contact Information Parser (CIP). Addresses are a very common part of contact information, so not only did we need to be able to identify them, we needed to parse the individual parts of the address. The process generalizes to four components:

Corpus profiling method: Corpus => Profile
Profile comparison metric: (Profile, Profile) => Double
Corpus change method: (Corpus, Profile) => Corpus
Non-convex minimization of the profile comparison metric across the parameter space for the corpus change method

Start by making profiles for your open source corpus and your customer corpus. Then run the profile change metric in all its variations on the open source corpus. Then select the modified corpus with the shortest distance metric. The visualization below illustrates this in 2D space.

Suppose that you want a lion, but you have a Maine Coon. If you’re careful, you can adjust the Maine Coon until it has a profile that looks a lot like a lion. It won’t be a lion, but it also won’t cause you any of the problems that a lion causes you.

Let’s get started!

The Corpus Profiling Method: Corpus => Profile

The hard part of writing the profile method is figuring out what makes for a meaningful profile. I was going to make a sequential model, so my profile consisted of a couple of metrics, along with the most common state transitions and their probabilities. We’re able to produce the profile for the same reason that we’re able to train models on customer data — we can use their data for in aggregate for unsupervised models. A profile is an unsupervised model.

Most address parsers rely on a combination of large vocabularies, regular expressions, and rules to capture edge cases. So I started by building a heuristic signature parser based on, you guessed it: regular expressions, word lists, and rules. This allowed me to take a given address string and turn it into a sequence of state transitions.

This works well because of the quality of the lists of addresses that you can get, but if you don’t want to keep massive resource files around, and you want to make a model that can handle unknown input by making use of the sequential information, a sequential ML model, like Conditional Random Fields (CRF), is the way to go. Given that the model was going to make use of the sequential information, the profile that was going to be the most meaningful for my use would be the distribution of the most common state transitions. I found that collapsing transitions that maintain the current state improved the quality of my results. For an example:

https://medium.com/media/e9108601d85662904711a61d7fdc1440/href

ReduceByKey and for any address corpus we can produce a profile like the following:

35% HOUSE_NO — STREET
19% HOUSE_NO — STREET — CITY — STATE — ZIP_CODE — COUNTRY
5% COUNTRY
5% HOUSE_NO — CITY — STREET
3% HOUSE_NO — STREET — SUITE — CITY — STATE — ZIP_CODE — COUNTRY
2% HOUSE_NO — STREET — CITY — STREET
2% HOUSE_NO — STATE — STREET
1% SUITE — HOUSE_NO — STREET — CITY — STATE — ZIP_CODE — COUNTRY

The Profile Comparison Method: (Profile, Profile) => Double

If we’re going to modify the open source addresses to look more like our customers’ addresses, we need to measure how far apart the two are. In a certain sense, it doesn’t matter too much how you go about this. We’re going to end up using it as an objective function for the optimization that comes up later, so as long as more different is a bigger number, you’re good. I decided to opt with an asymmetrical score. In the following equation, you have the top n patterns, and you sum one less the probability weighted absolute value of the difference between that pattern’s frequency in the customer profile and in the modified open source profile.

I chose this for two reasons. First, it is weighed by the target set. Second, we deal with cases where some pattern is totally missing from the target dataset.

Corpus Change Method: (Corpus, Profile) => Corpus

We can’t use our customers’ data to change the open source corpora, but we can use the profile of our customers’ data. This method is the most complex and interesting part of the shaving-a-Maine-Coon-to-look-like-a-lion process. It is likely that your method will involve a couple of parameters, so prepare to run some kind of search after you write this method to find the parameters that deliver the lowest minimum.

I tried three strategies in writing this method. First: filter out addresses from the public dataset that did not match any of the top state transition patterns in the customers’ address profile. This was a great idea, but it took our original corpus of roughly 200,000 addresses and left us with fewer than five hundred. The second idea was to create an address generator. Take the distribution of sequences from the customers’ data, and then generate examples of that sequence with tokens from the open source data. The problem is that none of these addresses would be real, with combinations of streets, cities, and states that do not even kind of exist, e.g. San Francisco, Utah and New York City, New Mexico. It also would take a kind of ouroboros logic — the heuristic parser would be used to create data for the machine learned model, somehow with an expectation of with higher accuracy. It was unlikely that the parser would learn more about the real logic of addresses than it would about the logic of my quick-n-dirty heuristic parser.

I then thought of a third way (which we ultimately ended up using): mildly destructive filtering. If an address fits one of the most common patterns [this is parameterized] in the customer data, keep it (so that those [354] perfect addresses don’t go to waste). Otherwise, try dropping all combinations of one or two tokens, and then randomly sample from those valid sequences according to their distribution in our customer’s dataset. Let’s consider the example of the White House address: List(“1600”, “Pennsylvania”, “Ave”, “NW”, “Washington”, “DC”, “20500”) which has the collapsed tag pattern: List(HouseNo, Street, City, State, ZipCode). If the pattern List(HouseNo, Street, City, State, ZipCode) is one of the top n patterns in your customer’s profile, then keep it in. It probably is. Now consider another address: List(“1600” ,”Pennsylvania”, “Ave”, “NW”, “Suite”, “2A”, “Washington”, “DC”, “20500”, “USA”) with the collapsed tag pattern: List(HouseNo, Street, Suite, City, State, ZipCode, Country). People rarely include the country code in their business address, so this is not one of the top n patterns. So we expand out all of the variations that drop one and two elements. I show the first the elements of this list to illustrate:

List("Pennsylvania", "Ave", "NW", "Suite", "2A", "Washington", "DC", "20500")
List("1600", "Ave", "NW", "Suite", "2A", "Washington", "DC", "20500",)
List("1600", "Pennsylvania", "NW", "Suite", "2A", "Washington", "DC", "20500")

Each of those variations gets represented as its collapsed tag sequence. Invalid state transitions are removed, and then we randomly select a pattern based on its frequency from the customers’ profile. This produces an imbalance toward shorter sequences, so we also up- and downsample to help smooth this out.

Minimization

My corpus change method had three parameters: whether to collapse sequences, how many of the top sequences to allow to pass through untouched, and how many of the top sequences to use to characterize a corpus. An exhaustive search of all reasonable values of this space is at most a couple thousand. I wrote a Spark job to spread out all of the parameters, and process them, and collect every parameter with its profile comparison metric. We then used the best one.

I hope that this post helps illustrate how to build the products our customers want while still treating them and their data with respect. Part of what makes for a great data scientist is a genuine excitement about data, about what it tells you about the world, about how it was made, how it was gathered, and what biases it carries within it. This subjective and qualitative experience is typically subsumed under some boring corporate heading like “domain expertise,” but there’s so much more to it. What I hope this approach gives you is a way to keep feelings around, a way to keep around that subjectivity and idiosyncrasy of each dataset, without bearing the ethical and legal costs that come with using private, customer data.

Please feel free to reach out to me, either through email (nburbank [at] salesforce [dot] com) or on my Instagram page (@nburbank) where you can see portraits of the excellent colleagues I get the privilege of working with everyday.

How to use open source data to solve customer problems was originally published in Salesforce Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.