How we wrote a custom field type identifier on Formaloo?

Published in

Formaloo

8 min readAug 23, 2020

First, let’s introduce you to Formaloo:
Formaloo Customer Data Platform (CDP) collects, analyzes, and unifies data from all data sources in order to grow customers’ loyalty.

One of the main services of Formaloo is its Drag & Drop Form Builder (alongside CDP, WP Plugin, Invoice Builder &, etc.). Our form builder can help users create contact forms, online surveys, invitations, and any other form you can imagine so you can collect the data, registrations, and payments you need. Currently, 90,000+ forms and quizzes are active on Formaloo.

In order to process the data submitted on the forms, we need to extract some data from each submitted row. Email, phone number, and name are the main fields we use to identify the customer who has submitted the form.

But the problem we faced was lots of our users using non-email, non-phone field types such as short-text or long-text fields as an email or phone field (So they won’t have any validation and other issues). Also, we don’t have any specific field type for names so we wanted to know what fields our users usually use as a name field (Maybe short-text or etc.).

As you can guess due to the description above this is a “text classification” problem. So let’s talk about how we handled this problem using machine learning.

Preprocessing Data

The dataset that we had was different arrays containing email, phone, name field titles used by our users extracted using common keywords for name field titles and field types for phone and email in 2 languages (English & Persian).

We created data frames for each type separately using pandas. For example, for email:

emails_df = pd.DataFrame({‘field_title’: emails_en + emails_fa, ‘category’: ‘email’})

As you can see we created a field_title column which consists of the English and Persian titles concatenated also has a category column which differs for each type (‘email’, ‘phone’, ‘name’).

For the data cleaning purpose, we removed field titles with less than 4 characters for email and phone data frames and less than 3 for the name data frame. For example, for the name:

names_df = names_df[names_df[‘field_title’].str.len() >= 3]

We then concat all these data frames in one:

df = pd.concat([emails_df, phones_df, names_df], ignore_index=True)

To avoid mistakes we did some cleaning on the concatenated data frame:

Lowercased field_title column values:
df[‘field_title’] = df[‘field_title’].str.lower()
Removed “None” values on df:
df = df.dropna()
Found and removed outliers — Part1:
For example, we found that the Social Security ID is used 37 times as an email, phone, or name field title. Another example is the use of “title” (users left it as it is) 1990 times and the use of “Postal code” 8 times as the field title. We created a list of words to ignore and cleansed our data frame from these outliers:
df = df[~df[‘field_title’].str.contains(‘|’.join(words_to_ignore))]
Found and removed outliers — Part2:
To avoid outliers to effect our classifier, we found values with less than 2 occurrences in the field titles column and remove them:
df = df[df[‘field_title’].map(df[‘field_title’].value_counts()) >= 2]
Dropped duplicate rows:
df = df.drop_duplicates()

Data Exploration

We add a column encoding the field titles as an integer because categorical variables are often better represented by integers than strings.
We also create a couple of dictionaries for future use.

# factorize() function encode the object as an enumerated type or categorical variable. df[‘category_id’] = df[‘category’].factorize()[0] category_id_df = df[[‘category’, ‘category_id’]].drop_duplicates().sort_values(‘category_id’) category_to_id = dict(category_id_df.values) id_to_category = dict(category_id_df[[‘category_id’, ‘category’]].values)

We plotted the number of values in each category to see if they are balanced or not:

This means that we need some resampling on our data. A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling). Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free lunch).

The technique that we used is Over-sampling followed by under-sampling. We will do a combination of over-sampling and under-sampling, using the SMOTE and Tomek links techniques:

from imblearn.combine import SMOTETomek

smt = SMOTETomek(sampling_strategy=’auto’) X_smt, y_smt = smt.fit_sample(X, y)

'auto': Equivalent to 'not majority'. Resample all classes but the majority class.

Text Representation

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from text is to use the bag of words model: a model where for each document the frequency of words is taken into consideration, but the order in which they occur is ignored.

sublinear_df is set to True to use a logarithmic form for frequency.
min_df is the minimum number of documents a word must be present in to be kept.
norm is set to l2, to ensure all our feature vectors have a euclidian norm of 1.
ngram_range is set to (1, 2) to indicate that we want to consider both unigrams and bigrams.
stop_words is set to remove all common pronouns to reduce the number of noisy features.

persian_stop_words = loadtxt(‘stopwords.dat’, dtype=str, delimiter=’\n’) stop_words = text.ENGLISH_STOP_WORDS.union(persian_stop_words) tfidf = TfidfVectorizer(sublinear_tf=True, min_df=3, norm=’l2', encoding=’utf-8', ngram_range=(1, 2), stop_words=stop_words) features = tfidf.fit_transform(df[‘field_title’]).toarray() labels = df.category_id

Persian Stop Words List: https://github.com/kharazi/persian-stopwords

Now, each of 4327 field titles is represented by 1116 features, representing the tf-idf score for different unigrams and bigrams.

Unigram vs Bigram

We are trying to teach a machine how to do natural language processing. We humans can understand the language easily but machines cannot so we trying to teach them a specific pattern of language. As a specific word has meaning but when we combine the words(i.e group of words) than it will be more helpful to understand the meaning.

An n-gram is basically a set of occurring words within a given window so when

n=1 it is Unigram
n=2 it is Bigram
n=3 it is Trigram and so on

Now suppose machine try to understand the meaning of the sentence “Enter your email address” then it will split sentences into a specific chunk.

It will consider word one by one which is unigram so each word will be a gram:
“Enter”, “your”, “email”, “address”
It will consider two words at a time so it will be bigram so every two adjacent words will be bigram:
“Enter your”, “your email”, “email address”

So like this machine will split sentences into a small group of words to understand its meaning.

Playing with Models

Now that we have a numerical data frame, preprocessed and ready, it’s time to try some of the most famous multi-class classification machine learning algorithms.

models = [ RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0), LinearSVC(random_state=0), MultinomialNB(), LogisticRegression(random_state=0), ]

We first prepare an array with those models initializer and do 5-fold cross-validation to find their accuracy on our different sets of features and labels.

CV = 5 cv_df = pd.DataFrame(index=range(CV * len(models))) entries = [] for model in models: model_name = model.__class__.__name__ accuracies = cross_val_score(model, features, labels, scoring=’accuracy’, cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_df = pd.DataFrame(entries, columns=[‘model_name’, ‘fold_idx’, ‘accuracy’])

We see that LogisticRegression has the best median accuracy: ~%88

model_name
LinearSVC                 0.876553
LogisticRegression        0.883989
MultinomialNB             0.857310
RandomForestClassifier    0.839944
Name: accuracy, dtype: float64

We then continue with our best model:

model = LogisticRegression() X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=0) model.fit(X_train, y_train) y_pred = model.predict(X_test)

We split our features and labels using Sklearn’s train_test_split function, then implement fit and predict functions on them.

Confusion Matrix

A confusion matrix is a tabular way of visualizing the performance of your prediction model. Each entry in a confusion matrix denotes the number of predictions that were made by the model where it classified the classes correctly or incorrectly. Unlike binary classification, there are no positive or negative classes here. At first, it might be a little difficult to find TP, TN, FP, and FN since there are no positive or negative classes, but it’s actually pretty easy. What we have to do here is to find TP, TN, FP, and FN for each individual class.

conf_mat = confusion_matrix(y_test, y_pred)

The vast majority of the predictions end up on the diagonal (predicted label = actual label), where we want them to be. However, there are a number of misclassifications:

'phone' predicted as 'email' : 25 examples.

'name' predicted as 'email' : 14 examples.

'email' predicted as 'phone' : 12 examples.

The numbers are slightly normal. We wrote a function below to check which rows caused these numbers:

for predicted in category_id_df.category_id: for actual in category_id_df.category_id: if predicted != actual and conf_mat[actual, predicted] >= 10: display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][[‘category’, ‘field_title’]]) print(‘’)

90% of them were outliers, so this can be waivered due to outliers that we saw our data can have and we couldn’t find it manually or they are outliers but their value count is more than 2 (increasing value count check caused the accuracy to decrease so we avoided it).

We then printed out our classification report for each class to see how well our model works besides the accuracy score (which was ~89%)

print(metrics.classification_report(y_test, y_pred, target_names=df[‘category’].unique()))

Macro F1: It calculates metrics for each class individually and then takes the unweighted mean of the measures.
Weighted F1: Unlike Macro F1, it takes a weighted mean of the measures. The weights for each class are the total number of samples of that class.

Time to Predict!

import time start = time.process_time() key = model.predict(tfidf.transform([‘enter your email:’])) print(id_to_category.get(key[0])) print(‘Predict runtime: ‘, time.process_time() — start)

‘email’
Predict runtime: 0.0062239999999960105

Good!😎

Deployment

After the initial steps of creating our model is done, it’s now time to ask our back-end developers how they want to implement this model. After having a short talk, we agreed on implementing two functions on the service like below:

A function that takes a string and decides which category it belongs to.
A function that takes a string and a category and returns the comparison result as True or False.

So the first step was to export our model. We used Joblib for this purpose:

joblib.dump(model, ‘field_type_identifier.pkl’, compress=9)

We then wrote the function we and the back-end team were agreed on:

#1 def detectWithoutType(field_title): model_clone = joblib.load(‘field_type_identifier.pkl’) key = model_clone.predict(tfidf.transform([field_title])) id_to_category = {0: ‘email’, 1: ‘phone’, 2: ‘name’} return id_to_category.get(key[0])

#2 def detectWithType(field_title,field_type): model_clone = joblib.load(‘field_type_identifier.pkl’) key = model_clone.predict(tfidf.transform([field_title])) id_to_category = {0: ‘email’, 1: ‘phone’, 2: ‘name’} return (id_to_category.get(key[0]) == field_type)

Testing the functions:

detectWithoutType(‘enter your email:’) ‘email’

detectWithType(‘enter your email:’,’email’) True

Thank you for following this article! We look forward to hearing any feedback or questions from you.