Malicious Content Identification

A Hackathon Success!

Xandr Engineering
Xandr-Tech
18 min readAug 10, 2018

--

This post is written by Eliot Huh, Keith Pallo, Chenny Ren, and Hartek Sabharwal. Each was a member of Team Early Bird — the submission that won the AppNexus Challenge at the 2018 New York DeveloperWeek Hackathon. DeveloperWeek Hackathon’s represent the nation’s largest challenge-driven hackathon series and are hosted several times throughout the year around the country. The 2018 NY Hackathon hosted over 35 different teams with 150+ hackers. In addition to winning the AppNexus challenge, Team Early Bird also placed in the Top 5 Overall Teams at the Hackathon.

Forming the Team

If you have been to any widely “open” Hackathon before, you know that the first hour at these events is one of the most interesting sights for anyone to behold. Given that directions and challenges are usually posted ahead of time, the environment created by the diversity of participants is truly amazing. One is likely to find a group of individuals who just met standing adjacent to a dedicated team planted at a workstation. The members of the first group may be trying to determine if an idea might work while the others may be ceremoniously licking their chops as they begin to implement a project that had been in ideation for days.

On June 19th at 9:00 a.m. on a Saturday, the scene at the Brooklyn Expo Center was no different. This was New York’s much anticipated DeveloperWeek Hackathon, and the chaos created by hundreds of hackers trying to either find a project or feverishly start making progress was coming to a peak. This particular hackathon had several different challenges that hackers could participate in, which is not always the case. Hence, conversations focused on what challenges one would like to participate in. From these conversations our team quickly emerged — we were all extremely interested in working on the Malicious Content challenge presented by AppNexus.

The prompt was relatively straightforward, but by no means simple. Given a set of malicious creatives (image advertisements with dubious intentions), find a way to determine if any newly presented advertisement is malicious or not. With two CS undergrads entering their second year at university, one member entering a Master’s Program in Cybersecurity, and another entering a Master’s Program in Artificial Intelligence, we were all very excited for a variety of reasons. Not only was there a clearly defined and objectively measured goal, the latitude of approaches we could take was huge! This challenge also presented a way to use machine learning at the intersection of cyber-security, and determining “bad content” represents one of the most pervasive problems facing several leading technology companies today. Also, we were proud to be working with an innovative and invested partner. While researching the challenge we found several blog posts and white papers presented by AppNexus on this topic, including one specifically on fraudulent ads. Brought together by our enthusiasm, we forged on to accomplishing our task.

Challenge Overview

Our first step was to examine the data. The challenge presented two distinct types of malicious content — Misleading Claims and Tabloid Advertorials. Their definitions and an example of each are as follows:

Tabloid Advertorial: Images where the content of the creative does not match that of the landing page. In other words, it would be used to trick the user into clicking something.

Misleading Claims: The claim purported by the creative is unsubstantiated or provably false. In other words, claims that simply aren’t true. The most classic example would be Snake Oil — a magical mixture that can cure all ailments.

Given a set of about two hundred ads labeled as either Tabloid Advertorial or Misleading Claim, we began to research quantitative methods that would create representations of each advertisement. Almost immediately, it became clear that we would need to use machine learning. Any possible representation which we could define ourselves simply wouldn’t be good enough to achieve our desired accuracy in finding novel malicious advertisements. Machine learning would alleviate this challenge — instead of having to explicitly tell our algorithm what to analyze, our program would learn what to look for by examining the dataset. The methods we researched seemed promising and readily available. However, a few questions quickly arose no matter what kind of method we were thinking of applying. First, regardless of the desired ML algorithm we implemented, almost all of them would need a comparable set of non-malicious ads. We would come to call these “quality advertisements”. These ads would not only help in training our system, but they would also be useful in testing. Additionally, each model we researched had several drawbacks and watch-outs listed. Which one would be best? Why would we use one over the other for this specific purpose?

Project Ideation

With these core questions floating in our minds, we decided to take action and find some “quality ads”, or novel advertisements that were not malicious. This seemingly easy task proved rather difficult! Almost every which way we tried, it seems that querying for “Real Ads” actually produced image results of “Fake Ads” given the recent controversy around this area. This led us to think: What exactly is Malicious Content? Our team tried a traditional brainstorm, and a word cloud resulting from our conversation looked like someone had simply randomly picked terms from a dictionary. Phrases like “Big Text”, “Theft”, and “Large percentages” were mixed together in a way that was seemingly unrelated. Stumped, we thought hard and came up with two unique methods for achieving our goal.

First — we noticed that querying for ads from Fortune 500 companies produced a large number of high quality advertisements. However, we did not want to introduce bias into our training data by only selecting content from companies that we knew. So, we turned to a random walk method of sorts — we used a specific online company name generator to create a list of major corporations. From there we used a Google Chrome extension to download a large sample of that company’s advertisements, while also removing any ads that still seemed potentially malicious. We then repeated this process for several companies.

Additionally, we thought of mediums that were not likely to produce malicious content due to their nature. So, we sourced images from providers that have a much lower sample of advertisements with higher scrutiny — Print Advertising! This idea came to us when recollecting about some of our favorite shows, and thinking about the opening episode of the hit “Mad Men”. Throughout the episode Don Draper is stumped on creating a new advertisement campaign for a cigarette company. This was particularly challenging given that significant recent research had revealed the negative effects of smoking. If an advertiser such as Mr. Draper was going to think that hard about one ad, then surely those presented were going to be within our quality bucket. Using the same method as above we sourced an additional set of images further diversifying our “quality ads”.

Now, the next step was to partition our data into training and test sets and apply a machine learning algorithm. We thought that a use case might help us in determining which ML algorithm, or which set of ML algorithms to use. After searching for some time, we realized that we had just personally encountered a problem that could be solved — creating a custom definition of malicious content! Since malicious content can mean so many different things in different contexts, we wanted to create an intuitive system that would allow any user to “tune” their own definition.

This system would have several potential use cases. One example would be for an application, such as Pinterest, that would allow users to upload large numbers of images which could potentially be undesirable. An additional use case would be for a tool that an ad provider, someone like AppNexus, could present to sources that show ad content (publishers). These sources could then tune their own definition of malicious content to what their individual business needs were.

For example, an independent news outlet may want to rely on an entirely different definition of malicious content than a retail clothing startup simply due to the vastly different audiences that each is trying to serve.

In order to create this tool, our approach would be to combine two different ML systems to detect malicious content. First, we would would train two distinct ML algorithms which would independently determine if new content was either quality or malicious. Then, we run both algorithms on each new image, and combine the results. The system would fail content if either marked an image as malicious — effectively creating the most conservative content administrator based upon both sets of results (statistically speaking, the union of the two sets). This would be the default definition of malicious content. From here, a user would then be able to select what types of images they wanted to allow from the set of advertisements that previously failed on only one of our ML algorithms (now the disjoint union of the two sets). Finally, both ML algorithms would independently retrain themselves with the new user input, creating a definition of undesirable content according to that specifics user’s inputs.

Hence, our system would intuitively figure out what advertisements the end user would want to catch — and it would do so extremely quickly.

Step 0: Create two distinct ML algorithm’s to determine malicious content. Then, test new creatives using both algorithms independently and create most conservative result

Step 1: Allow users to decide on photo’s that are not malicious if they failed on one ML system

Step 2: Retrain original ML systems based on user input

We called ourselves Team Early Bird — we wanted to not only allow those using our tool to catch worms, which in our case was unwanted advertisement content, but we would help users get the RIGHT worms based on their own individual goals.

With this in mind, we turned our focus to creating the ML algorithms that would independently determine malicious content. After looking at the different pro’s and con’s, we decided on a text-based system and a purely image-based classifier to create a strong mix of results. The thinking behind this was that our intuition around a text-based system was relatively strong — the ads were likely to include some “look here” words. However, we did not have a good idea of what an image classifier would really use to distinguish these ads. Obviously the system would develop pattern detection, but we scratched our heads when really thinking hard about it. We hoped that combining these two vastly different methods would lead us to some interesting results — so we pressed onward.

Fundamental System 1 — Text Analysis

As humans, our first impression of malicious intent generally came from the generic, spammy text in the ads, but we were not sure how this would come into play in the analysis. First, we extracted the text from the images using Tesseract in Python, one of the best open source optical character recognition (OCR) engines available. However, it alone failed to read the ads accurately enough to generate worthwhile data. The strings were not only unreliable, but many images did not produce a single fully formed word. Huddled around one laptop, we struggled to comprehend how any analysis would use the sparse phrases such as “Look,,, “ and “ st0Ck going upwrd,” almost reminiscent of early 2000’s texting. To fix it, we programmatically resized and enhanced the images before processing. Along with some additional text cleaning, such as only allowing ASCII characters, our output improved substantially.

Our next move was to extract useful features from the text using natural language processing (NLP), a rapidly developing field in CS which allows computers to understand, on some level, human language. One technique in NLP is sentiment analysis, which takes a sentence and outputs whether it is mostly positive, negative, or neutral. Generally, such a program uses a database of positive and negative words and phrases, while also keeping track of sentence structure and negations. A staggeringly effective implementation is included in the Stanford CoreNLP library, and it is the one deployed in our model. Our hypothesis was that malicious ads would use stronger, more forceful language to compel users to click, while non-malicious ads would use more neutral language to better deliver a message and avoid negative language to preserve positive associations with the brand.

Having spent hours setting up our development environment, fiddling around with data, reading blog posts and research articles, we were all forced to go home. Standing on the R train platform, we overheard two separate teams mention both Tesseract and NLP, so the best team was going to be the one with the smartest implementation and deepest optimization, not the one that threw out the most buzzwords and library names. Our subterranean reconnaissance further confirmed that we would all be working late into the night.

The pure sentiment analysis turned out to be fairly effective. The data showed that 62% of non-malicious ads were neutral in sentiment, compared to just 22% of tabloid and misleading ads. We were afraid that the program would get the sentiments wrong since the core methodology of NLP is far from perfect and often misses the nuances of real human language, like sarcasm. However, the claims in our set of malicious ads were simple enough that this did not happen. An ML model using the polarity of sentiment to classify ads as either malicious or non-malicious achieved about 65% accuracy. However, sentiment analysis was less effective in differentiating the two types of malicious ads, an additional goal of the challenge, so finer-grained features were needed.

Looking at the results we thought to ourselves, if we were the computer, what else would we want? All of the analysis we had done simplified the ads into one basic number, while any robust ML model would need lots of data and lots of columns. We needed metrics that were just as substantive but with more granularity. After reviewing the diverse set of coursework accumulated between us we landed on the following; “How about some basic statistics?” Other than just the sentiment, it would likely be useful to know things like the most common words and how often they appeared compared to peers in different ad types. This led us to complete a simple n-gram analysis. First, we calculated the most common words present in both types of malicious ads. This required tokenization of the text and the removal of stop words, which are phrases such as “the” or “is,” that are common in advertisements of all kinds but semantically are virtually meaningless. This was done with NLTK, another NLP library. We learned about this from an online Kaggle competition dealing with a similar task. We repeated the process for tabloid and misleading ads separately.

This was an effective strategy because although both types of ads were malicious, the language each used to draw in users was distinct. Tabloid ads commonly used words like “rumors,” “reason,” and “confirms”, while misleading ads used words like “trick,” “free,” and “millions.” For each ad, we also computed two scores based on the frequency of keywords in tabloid and misleading ads, with more weight given to more popular words. An ML clustering model using sentiments and these scores achieved about 80% accuracy in differentiating all three types of ads.

Finally, we repeated the preceding analysis with bigrams, which is NLP nomenclature for two consecutive words. This was meaningful because certain groups of words become much more useful next to others than on their own. The most common bigrams in malicious ads included phrases like “read more,” “simple trick,” “discover how,” and “hollywood’s elite” — all clearly resembling a spammy ad.

This brought us to our second day at the expo center, and the pace was feverish. Our only real breaks were taken during lengthy software downloads and compilation of code. Some of us attended conferences about the latest developments in the industry, like “Blockchain Framework Exonum 0.8 for Developers.” Some of us, the ones who didn’t believe in the blockchain hype, collected company merchandise and dined on AppNexus’s platter of fine cheeses. Some of us, the ones repulsed by cheese and blockchain, continued to look at code, perhaps finding an errant comma half an hour into code compilation. And Eliot won an RC helicopter.

For the final machine learning component tying it all together, we experimented with both a random forest model and a support vector machine (SVM) . A random forest ideally allows for the minimization of overfitting and would avoid significant consideration of hyperparameters. It showed promising cross validation scores. However, we ultimately went with an SVM because of our small number of features (7) and small sample size (398). It gave us a staggering 90% accuracy in categorizing tabloid, misleading, and non-malicious ads, and 95% accuracy in categorizing ads as malicious or non-malicious. Below is an image to represent our text classifier system.

Our code and sets of images for this fundamental system can be found on GitHub.

Fundamental System 2 — Image Classification

Given that pure image classification is a very hard thing to think about intuitively, we were very interested to see how our results would turn out and what the ML process would really entail. We desired to do a rather simple image classification using a Convolutional Neural Network (CNN) — a deep learning model that we could implement using python. We learned that building a CNN from scratch would require training on several thousand different types of images to be accurate enough for production. Quickly our team realized that our combined compute power of 4 laptops would not be enough! Desperately we started looking into using cloud computing services like AWS and Azure. We dove into the documentation, but it became apparent that actually getting to use significant resources power that we needed was not going to be easy. Would this be impossible to do? Should we look into a different simpler system? These were questions we thought about, but thankfully we found a potential solution — modifying a pre-trained model! To understand this and the inherent advantages provided, we can examine a very simplified version of a CNN.

Essentially, a CNN utilizes the concepts of neurons from biology to very effectively detect patterns. This is likely no shock, as this concept has been displayed in major news outlets prominently for the past few years. You often hear that these deep learning models have many different layers. But what does that mean and how does it relate to the brain? These layers can intuitively be thought of as different systems that detect a very specific type of small pattern. For example — one layer may look to identify a specific feature in someone’s face, such as someone’s eyes. In actuality, these layers detect patterns that are much smaller, such as one tiny portion of a letter, but the idea is the same — small groups of pattern detectors. These layers are then grouped together, often using simplification processes like pooling (using subsets of data for calculation simplification) in order to detect something much larger and more difficult. Continuing on from our previous example, a CNN may try to detect if an image contains a human face. This is called the final classification because it combines results from the previous groups of small pattern detectors to determine one final attribute, which may have a significant amount of variance. This is essentially what your own brain tries to do — you look for small patterns, like hair, wrinkles, ears, and so on, and from the aggregation of all this data, are able to make a judgement call. No one, or one set of features, defines a face, but intuitively you just sort of “know”.

Figure: Simplified CNN Example

(Source: mathworks.com)

The way that these layers (or more simply thought of as small pattern detectors) are formed is by looking at thousands of images and tuning the system through a process known as backpropagation. This was troubling as we had a fraction of this image set from our newly created database. So, we turned towards pre-trained models, and determined that using Tensorflow from Google would greatly help us in this issue.

Tensorflow is an open source library for numerical computation, specializing in machine learning applications. TensorFlow introduces the field of image recognition and uses a pre-trained model (Inception) for recognizing images. The model we would import would have been already trained on the ImageNet Large Visual Recognition Challenge dataset — a massive database. Models from Tensorflow are powerful as they can differentiate between 1,000’s of different classes, ranging from Dalmatians to dishwashers, and are very sophisticated. However, they take only a few minutes to actually download. We would be able to take advantage of Tensorflow by retraining the last step of the process in one specific model — simply telling the chosen algorithm to classify the images into tabloid advertorials, misleading claims, or quality ads. This way we would get almost all of the advantages of a CNN without having to fully build out all of the smaller pattern systems that were built in the previous layers!

So, we installed Tensorflow and retrained a MobileNet (a small efficient convolutional neural network) to fit our needs. First, we configured the MobileNet by setting up the input resolution and relative size of the model as a fraction of the largest possible MobileNet.

Before starting the training, we also installed a module called TensorBoard. TensorBoard is a graphical monitoring and inspection tool included with Tensorflow. Hence, it would allow us to monitor the training progress. With our environment set up, through a few simple commands we retrained the imported model to solely classify images into our three distinct categories.

After all the training steps were completed, we ran a final script to test evaluation accuracy on a set of images that were kept separate from the training set. After running the script on several images our results were relatively positive. Depending on how the images were partitioned in the training process and the particular photos that were being tested — we achieved between 45% and 55% overall accuracy in placing our advertisements into the correct group. These lower results were likely due to a low sample of images to test on, but they also represented a massive improvement from the baseline of randomly guessing (~33%). Our second ML algorithm concept was proved, and thus our two core systems had been developed. Since virtually all of the development process takes place in the command line, the code to implement this system has not been included in the Github repo. However, this Google TensorFlow tutorial was the overall basis that we used in order to train our ML system.

Laying out Next Steps

Once these two systems were trained, the time for code freeze was coming. Unfortunately, we did not have an opportunity to link our results together — but what was unique about our group was that we had created the use cases already. Our last step was to showcase our work — we crafted a simple Powerpoint, viewable on GitHub, that would allow us to walk our judges through our project ideation, core systems and an idea for a tool that could use the classification algorithms that we had implemented.

Before we knew it, the first round of evolution was upon us! Teams of judges quickly swarmed around the entire hackathon, selecting groups systematically to present their work. Almost instantly our turn came — we had about two minutes to present all that we had done. Scrolling through the presentation while side-by-side displaying our code on another laptop, we confidently introduced our challenge, the specific problem we would address, and then dove deep into our progress thus far. It ended up taking around ten minutes, but we suppose the project was compelling enough to warrant quintupling their time limit. This much was confirmed when we had the judge come to us privately afterwards, dropping the dour poker face. Afterwards, we presented directly to AppNexus, which was delightful as the judge was not only engaged with our solution to a prompt that he actually worked on, but also proposed some thoughtful questions. How would our system affect AppNexus’ clients? What would we do for an override if a significant portion of quality ads were being blocked?

After this experience, we returned to our station. The time went by in a whirlwind and it was a breath of fresh air for the first time in the last 36 hours or so. While pondering the questions mentioned, our team received a phone call. “You have been selected as a top 5 team in the entire hackathon.” Thrilled, we immediately began to clean up our code and presentation and before we knew it, we appeared on stage at the Hackathon showcasing our work to the entire event. Thanks to serious rehearsal, we exceeded our time limit by a slightly smaller factor.

We sat back down to see the other four teams. Since their projects were inspiringly creative and technically sophisticated, we talked about how this had been a good experience even without the win. (Admittedly, we regretted not having cooler visuals.) A random audience member said he liked us the best, which was recognition enough at that point. But astonishingly, he wasn’t the only one; half hour after our presentation, we were also selected as the AppNexus challenge winner! Overall, the experience was not only fast-paced and exhilarating, but we could not have asked for a more invested and innovative partner than AppNexus.

With more time, we would have liked to use more NLP techniques to get a better grasp of text meaning in the advertisements. Examples include combining different forms of words, using word2vec to consider the similarity of words, using a parser to analyze sentence structure, and more advanced NLP to find specific emotions that are crafted in the advertisement. We would also plan to perform a grid-search to optimize the hyperparameters on the SVC, and further process the images for better OCR text extraction (binarize, remove noise). Also, we would have liked to create a much larger database of images for both testing and training our image classifier. Finally, we would have loved to combine the results from both systems and actually test out our own inputs to create a Team Early Bird version of malicious content.

Thank you AppNexus for presenting the challenge at the NYC DeveloperWeek Hackathon and for allowing us to be part of this blog!

--

--