Can we determine whether a user inputs a real or a fake email address by analysing their typing behaviour?

Vlad
kappa.london
Published in
9 min readMay 14, 2020

Living the start-up life definitely has its perks. And we surely live it to fullest here, at Kappa. It’s the freedom, the people, the very dynamic workflow peppered with really exciting achievements.

Yeah, we’re working from dawn ’til dusk and yes, we dive into every challenge heads-first but hey! That’s the path to accomplishment, to self-fulfillment. At least for me, that is.

Everyone often has a quirky idea for an app or a tool and we’re encouraging our mates to blare out crazy creativity. One of these scenes happened a few weeks ago and I was instantly hooked. Klauss came up with an overnight idea for an email validator. ‘’Let’s make a tool that detects whether a person inputs their real email or just gibberish on a website’’. Sounded interesting and really fun to think through. Any such experiment really gets me going.

Since I’m a practical guy, I was swamped in questions about relevancy, efficiency and scale. So I had to take a step back and focus on the problem at hand: how on Earth am I going to put this together? This dilemma crushed my practical uncertainties and plunged me into an ocean of curiosity and excitement. The right kind of ‘pond’ I could find myself into at that point.

Now lets get down to the technical stuff. Non-engineers — look away! Just kidding, I’ll make this understandable even by the non-initiated.

I wanted to build this validator solely by taking into consideration the user’s typing data and behavior, without resorting to other methods such as text analysis based on dictionaries. Hence, it was clear I first had to build a little tool that could record my typing of email addresses, where I could also label them as valid or fake. This could help me generate a small initial training and testing set.

The fastest way to set such a tool up was using HTML and JavaScript. At this point, I began thinking exactly what data it should record and how that could help me extract relevant features. The most important thing, obviously, was to record every action the user performs on the email input field. I simply started with attaching 3 event listeners on the input field:

The user focuses on the input field (“focus” event) The user presses down on a key (“keydown” event) The user releases a key (“keyup” event)

For each of these events, I simply recorded the time when it happened (down to milliseconds) and the key that was pressed (for the “keydown” and “keyup” events).This can be taken much further, but I figured it should be enough for our little experiment.

From this data I could extract the following features: The time the user took to type his first character from the moment he focused on the input The overall typing speed The number of deletions the user did The distribution of time taken between all sequentially pressed keys The distribution of distances on a QWERTY keyboard between all sequentially pressed keys. Obviously, not all users use this keyboard layout, but again, it should suffice for our experiment The distribution of sum of distances on a QWERTY keyboard between tuples of 2, 3, 4, 5 sequential keys pressed.

The features above are, of course, just my best guesses on what could potentially be useful for the classifier to reliably predict whether the email address is real or fake. Generally, I would expect to see a higher typing speed, fewer to none deletions and smaller distances between keys or tuples of keys for obviously fake email addresses like “asasflakflask@gmail.com”.

At this point, I realised that, based on these features alone, there might be a lot of misfires, as there are many valid email addresses that can be typed fast and whose characters are close to one another on the keyboard. Hell, even “asfasfas@gmail.com” could be a valid email address, but the point is: would you want to talk to that person? Probably not. Anyhow, once again I tried to leave the practical side out and continue with the experiment to see what results I would get.

I quickly finished the recorder tool and I started inputting some email addresses in order to populate my dataset. It looked like this:

I recorded around 200 email addresses and decided it was enough for the time being. Another thought I had at this stage was that the training set was going to be very biased on my typing behavior and the way I thought about the recording process when I decided to input valid versus fake email addresses. It was starting to look like it won’t work at all, but I figured that maybe it wouldn’t be a problem if we had a big enough and diverse training set from a lot of different people. Eager to see it done and play with the prediction algorithm, I decided to move on to the next stage.

As I mentioned above, I’m not particularly knowledgeable in AI or machine learning. I knew that a classifier of this sort could easily be built at a high level with Python, using the scikit-learn module. My environment had everything installed, except scikit-learn. Once I set that up, I was good to go.

The first step was to retrieve all of the data I recorded in the JavaScript tool, in order to feed it to the Python script. I simply exported that to JSON and placed 70% of the entries in the file “data_train.json”, which would be used to train the classifier. I placed the rest in “data_test.json”, which would be used for testing. I decided to build 2 Python scripts: train.py and predict.py. Everything could have been contained within one script, but I wanted to keep it clean and save the classifier in a file from the train.py and then load it in the predict.py to test that this would work as well. That’s because I knew that the next step would be to put it online somewhere for my colleagues to play with.

For training, I loaded the JSON data first.

Then, for each entry, after validating it, I had to create and add its features to the dataset. I started with the features whose values were just numerical. I had already generated these types of features in the JavaScript tool, so it was just a matter of adding it.

After that, I had to deal with the other features, which represented certain distributions of times and distances. The underlying data that I saved in the JSON file consisted of the actual lists of values from which I would have to compute those distributions. For example, for the distribution of time taken between all sequentially pressed keys, I had the list of values representing all those times. I realised I couldn’t just add the arrays as features and began thinking which properties would best describe the distribution of these lists of values. I figured that computing the minimum and maximum values, the values average and the standard deviation would create a clear enough picture of a distribution of a list of values. These 4 values would become the actual features for each of the distributions.

The next step was to create the classifier. After researching a bit and playing around with different types of classifiers and kernels, I chose to use a SVM classifier with a linear kernel. The last steps were to train the classifier and save it to a file. The target variable below is just a list of 1s and 0s that represent the label (valid or fake) for the email address at the same index in the dataset.

Moving on to the prediction side, the predict.py file looks almost the same, with a few differences. First, it loads the “data_test.json” file, instead of “data_train.json”. Then, it loads the classifier from the “email_classifier.joblib” file, instead of creating it. The last step is to generate a prediction on the test data and print the results.

Finally, I ran the scripts for the first time and got the results.

I was stunned to see these results. It couldn’t be. I added some more test entries and the stats didn’t change. As I said before, I figured the classifier was just too biased because of my typing style, so I asked my friends to input some email addresses into my recorder tool, in order to feed them to the prediction script. They inputted another hundred addresses and then I ran the scripts again. These were the results:

As I imagined, the accuracy dropped significantly, but what was still a good sign. Arguably, the most important aspect was that the Recall was still high. In this context, the Recall tells us that very few valid email addresses are being marked as fake by the classifier. This is essential, as you could allow certain fake email addresses to bypass your website’s validation without any difficulties. That happens 100% of the time anyway without this kind of validation — but you would never want to block a real person by mistake.

The results above were based on the classifier that was trained only by my typing behaviour. I wanted to see what would happen if I trained it with some of my friends’ email addresses as well. Thus, I asked them to input a few more and added some to the training set and some to the test set as well. These were the results:

This was a huge improvement. Now it had ~94% accuracy, and most importantly 100% Recall. This means that not even a single valid email was marked as fake by the classifier.

After a bit more testing and playing around with it, I uploaded the recorder tool and the classifier file on a server. I adapted the files to create a simple app that checks whether you input a valid or fake email address and passed it to other people to play with it. Here’s how it looked.

After observing what the users did, I noticed that the 94% accuracy wasn’t holding up. It misfired too often. I imagined this was the same problem we had — the classifier was trained on just a few specific typing behaviours. Maybe with a big enough dataset, gathered from a large and diverse group of people, the classifier could be able to learn enough in order to offer reliable results for most people and their typing patterns.

At the moment, in terms of its practicality, we don’t think this type of email address validation can be reliable enough to be used in the wild, even if these first results look promising. We will continue to research and experiment with this.

Some next steps for this experiment would be to research more deeply different types of classifiers and kernels and understand which ones would work best in this context and also research into generating better and more relevant features.

--

--