Demystifying Data Science-Why Profanity Isn’t Always Profane

Published in

Radius-Engineering

8 min readSep 22, 2017

As a data scientist, I sit at the nexus of data, software, and business users. Often times bridging the gap between all three to ensure businesses find success with their data. One of the key responsibilities in this role is cleaning and filtering data, ensuring data quality standards are met before data make its way into official database records. Some of this filtering process is relatively straightforward: information needs to be filtered for accuracy, deliverability, and de-duplication.

Other aspects of the process, however, are less obvious and are sometimes only realized when they manage to show up in the data and make their presence blatantly known. Included within the former is the important task of removing potentially offensive company names from the data set, an issue you might not have even thought about until such a company name is suddenly staring at you within the data you are working with. At Radius, our entire team works to offer the latest and most accurate B2B business and contact data. But, just like any other data set, we need to exclude offensive, obscene, or inappropriate companies that should not make their way into our records or be presented to our customers.

Why profanity isn’t always profane

While this may seem like a rather straightforward task, it becomes much more difficult when you realize that a vulgar term to one person may have an entirely different and more benign meaning to another. Additionally, many offensive terms happen to be legitimate words that are not necessarily inappropriate or crude within other contexts. In fact through this process of filtering out profanity, we find that in the process of removing “bad” words, we may also be removing legitimate, non-obscene company names. Finding the right balance is not always an easy task.

This problem is by no means confined to Radius data, in fact it’s so prevalent within many different industries that it has a name: the “Scunthrope problem,” named so after AOL’s profanity filter in 1996 prevented residents of the city of Scunthorpe, England from creating accounts with their product.

In case you’re wondering why that particular city’s name was flagged, try to identify the string of letters within the text that contains an obscene word.

Google apparently didn’t learn from AOL’s mistake, as their filters encountered the same issue with the same town. Similarly, the towns of Lightwater and Clitheroe were flagged by profanity filters for obscene strings of letters within their names.

Because of this, I’ve decided to investigate Radius’ profanity filtering process to:

Identify how many companies are being removed from the data that are in fact legitimate
Examine whether any patterns emerge from the analysis

So, how does Radius’ filter fare?

In total, 1,263 companies within The Network of RecordTM were filtered out for profanity, making up only 0.02% of the approximately 5.8 million companies filtered out as ‘invalid’ and a miniscule portion of the entire almost 40 million valid and invalid companies that we track. Surprisingly, profanity doesn’t appear to be as prevalent as we might have assumed at the outset of this process.

Next, I began plotting the locations of all the companies the algorithm labeled as “profane.” I was interested to see if some locations were more prone to having vulgarly-named companies. Each location is heat-color mapped by the number of profane-labeled companies divided by all companies that were labeled invalid by state. Red stars indicates there’s a higher percentage of profane company names in that state, blue means its lower.

It appears that Washington state and DC both have a high amount of “profane” companies. The reasons for this are not immediately clear, although before we jump to conclusions, we should confirm that the companies filtered out are indeed profane and not a mistake caused by the algorithm. Taking the earlier AOL and Google stories as examples, we want to ensure that these locations don’t have specific terms that are used in different ways or contexts than conventional profanity.

Now, upon further analysis we found that the reasons a place was deemed “profane” was because they included names containing the term ‘Jerk’ or fell into one of the categories illustrated in the above chart.

In fact, 63.74% of all the “profane” companies fall into one of these categories. Let’s dive a little deeper into each of these categories.

Why chicken isn’t profane

The foremost reason companies are getting filtered out for profanity is due to the term “Jerk” in the name (18.61% of the “profane” companies contained “jerk” in the name). However, based on further analysis, we find that the word “jerk” is not being used as an insult or as profanity. Instead, it actually happens to be food-related i.e. ‘Jerk Chicken’, ‘Jamaican Jerk’, ‘Caribbean Jerk’, etc.

We found that 60.43% of the “Jerk” profane company names were related to food. The rest were unclear, including: ‘Jerk Chateau’, ‘The Jerk House Reggae Sessions’, ‘Dat Jerk’, ‘The Jerk House’, ‘Yardies Jerk’.

Why context plays a big role with LGBT terms

Next, we found that companies were getting filtered out if they have common LGBT terms in their names. While these terms may at times be used in a derogatory way, they are often legitimate terms used within the LGBT community. Adding these terms to the blacklist actually turns out to be highly problematic because we then filtered out all LGBT community centers, LGBT health clinics, and any companies aimed at connecting with LGBT issues.

Many of them are important companies that customers might be trying to reach. The challenge here is to ensure that profanity is filtered out without excluding LGBT organizations and companies.

Why sex isn’t always profane

Similarly, the phrase “sex” is also filtered and companies with that term in their name were removed from our dataset. However, like the case with LGBT terms, most of the companies were organizations focused on sex education, sex health, and sex therapy, as opposed to inappropriate content.

This also appears to be a case where looking at context might be important instead of fully banning the term outright so we don’t exclude valuable companies.

Why countries and animals aren’t profane

The fourth category by which companies were being filtered out was due to terms with racist connotations within the company name.

The problem with this is that many racist terms happen to have entirely different, benign meanings within other contexts. So often times many of these “racist” companies were not racist at all.

For example, “Spic and Span Laundry Services,” “Spic & Span Cleaning,” and all other company brands related to cleaning and laundry services were removed. “Bee and Wasp Removal” services were also out. Even embassies and government-related organizations including “Papua New Guinea Embassy,” “Papua New Guinea Mission To The U N”, and “Papua New Guinea Tourism” were filtered. Along with companies related to the country, the algorithm also removed “Guinea Pig Farm,” “Texas Rustlers Guinea Pig Rescue,” and the like from the data.

As you can surmise, removing all companies related to an entire country isn’t the best method to limit the use of a potentially profane term. But more on how we solve that later in this post.

Other profane terms that didn’t make the cut

All companies that were “against hate” or “silencing hate” or “preventing hate” were filtered out. The term “hate” was a no-no. Similar to the LGBT-conscious companies, it’s crucial to discern the context in which particular terms or phrases are being used prior to them being filtered out.

We also found that ‘Hasting Dwight Custom Laid Tile’ and ‘Laid Right Flooring’ were removed. ‘Laid Back Gardens’, ‘Laid Back Lodge’, ‘Laid Back Festival’, ‘Laid Back Services’, ‘Laid Back Fishing Innovations, LLC’, ‘Laid Back Larrys,’ and any other “Laid Back” company was out. As you can guess, the term “laid” is being filtered out, but in this case it’s not being used in a profane context.

In fact, most “profane” company names were actually legitimate companies. Below you can see the number of legitimate, unclear, and actually profane company names by category of profanity. As you can tell, most are legitimate.

So looking back at our geographical map, why does Washington and DC have so many “profane” companies? To help answer this, I plotted the locations by category of profanity.

Looking more granularly at the DC region, I found that there were many businesses with LGBT terms in their names, which explains why many companies were labeled ‘profane.’ 6.6% of all the company names which had those terms in the name were located in DC, despite the fact that only 0.3% of US businesses within our data are located there.

As for Washington, it looks like the reason for the high profanity count is due to companies such as ‘Franklin PUD’, ‘SNOHOMISH COUNTY PUD’, ‘Saddlecreek Pud Partners’, and ‘PUD Federal Credit Union’ being located in that region. About 69% of all companies with “pud” in the name are located in Washington. After some investigation, it was discovered that PUD is an acronym for Washington’s Public Utility Districts, which are not-for-profit, community-owned utilities that provide electric, water, sewer and wholesale telecommunications services. That solves the mystery of all the “profane” companies coming from Washington state, while also illustrating the point further how an obscene term for one person can be an innocuous acronym for another.

So, how can we improve our profanity analysis?

These examples prove that a word’s meaning is intrinsic to the context of the entire phrase surrounding it. As AOL and Google showed, banning terms outright even when used within other phrases will lead to valid cities being inadvertently blocked from their products. Similarly, filtering out certain terms from The Network of Record leads to valid companies being filtered out.

More importantly, in some cases we are filtering out socially-conscious companies that are fighting against derogatory and offensive issues.

Clearly a profanity filter is not easy to implement correctly, with a necessary tradeoff between the filter limiting profanity and the filter removing legitimate companies. It appears like some manual oversight might be necessary in cases like this.

After conducting this analysis, we removed several of the most common LGBT terms from the list. Given that the number of actually profane company names with those terms were extremely low, we wanted to allow more socially-conscious businesses to make it into the Network of Record.

Instead of banning certain words outright, the team is going to work on building a more complex profanity model that takes the context of the phrase into account. If there is evidence that the term in question is not being used in a derogatory manner, the business should remain in the data set.

For example, ‘hate’ should be ok if it is preceded by a ‘silencing,’ ‘preventing,’ or ‘against.’ ‘Laid’ should be fine if it’s followed by a ‘back.’

A bit more complexity and more human oversight may be necessary in truly separating profane companies from legitimate businesses.