On Our Take-Home Data Science Assignment

An in-depth analysis of our last internship assignment about Aspect Extraction and Insight Presentation.

Fernando Tadao Ito
birdie.ai
12 min readJul 29, 2020

--

Photo by Rishabh Agarwal on Unsplash

Here at Birdie, we try to arm everyone that goes through our internship hiring process with feedback and an opportunity to aggregate real-data analysis to their portfolios. In this article, we discuss our latest task and the proposed solutions, focusing on the best traits of each thought process.

The Task

One of the steps in our data enrichment process is a specialized Named Entity Recognition task called Aspect Extraction. This is a way to find exactly the object of opinion inside a text.

This phone has a slick design, no doubt. But the battery life is short and replacement parts are expensive.

With this information, we can deduce trends in user opinions centered on specific brands, retailers, and products. These trends can be used to compose KPIs of User Experience and Satisfaction with passive data sources (not directly collected by surveys and consumer support channels).

The proposed task was, given a partial dataset of Refrigerator user reviews collected between 2019 and 2020;

  • Find a way to extract those aspects;
  • Discover insights based on the aspect exploration;
  • Present those insights in a Jupyter Notebook report.

The goal of the task was not to reach a working solution: it was to analyze the thought process behind the research needed to be done. How did they explore the data, fiddled around with the main libraries for text manipulation, organized their pre-processing steps, explained their reasoning… This is more important than any model or correct answer. We are scientists, we sit on data and think about it for days to reach a good conclusion and devise a proper presentation. No amount of interview coding questions will measure your curiosity!

The Exploration

The very first thing we evaluated was how the applicants manually analyzed the data before any classification effort. This is the most important step to any prospecting data scientist, one that does not require ML knowledge or experience: domain knowledge may be repurposed from similar areas but never truly replaced.

You need to explore your data, delve into the intricacies of it. Check their source, how it is structured, posit previous processing steps it went through before reaching your hand. Develop a healthy distrust of hand-me-down data… Knowing is half the battle.

Now, this dataset comes from raw data, straight out of our crawlers. It is rife with common issues in unstructured datasets: duplicate data, missing field values, faulty grammar, and crawler mistakes.

The first four reviews in the dataset: in red, duplicated reviews from the same product in different colors; in yellow, a blurb that indicates syndication; in green, textual errors.

Data Issues

How you deal with these issues is dependant on how you will direct your analysis. In the matter of duplication:

  • Let’s say you wish to analyze all the reviews of a single product. You will need every bit of information you can because there is not enough volume. Duplication of reviews between different products is not an issue at all.
  • But if you wish to analyze all the reviews of a brand, retailer, or year, you will be adding a lot of redundancy. While this may not be extremely noticeable, it may lead to erroneous insights in some corner cases (i.e., one refrigerator with 10 different colors may have 10 duplicated reviews between all SKUs).

What about that promotion blurb? It’s about how you present your assumptions, and whether or not you want the words review and promotion in your analysis.

  • Either you assume we do not want those repeated words that convey no intrinsic information about the product, you could filter them out with a simple regex…
  • Or you can assume that we need to know when the word promotion comes up regardless of context, as it may be important internally. It’s important to explain this design decision in your code or the post-assignment interview.

Dealing with grammar/crawler errors is pretty straightforward: most approaches to the aspect extraction task rely on word frequency. Errors will be naturally filtered out as their frequency is extremely low. You can try to salvage those errors, though: applying an out-of-the-box stemmer, lemmatizer or spell-checker will fix the most common errors on these reviews.

Variable Selection

Our dataset has many variables: 25 fields extracted from a shard of our raw review database. Let’s review the ones we consider the most important:

  • Review Title & Review Body: These are the main sources of our textual data. We can extract aspect information from here.
  • Offer Retailer/Title/SKU, Specs: These are offer-specific pieces of information and can be used as discrete filters for specific analysis.
  • Review Score: One of the most interesting stats for a review is the absolute score the user gives to that product. This can be leveraged for a sentiment score of sorts for each aspect.
  • Post Date: With a sufficient amount of aspect information, you can create a time series on how an aspect changes throughout a select period.

The Extraction Method

Having pre-processed and explored the data, we can start extracting aspects from these reviews. But what is an aspect?

In short, an aspect is an opinionated term in a sentence that qualifies a feature of the product. It is the target of a reviewer’s hatred, love, or indifference.

I hate the screen size of my smartwatch.,

I love the battery life of my smartphone!

There are other apps available for this TV.

Word Frequency for Aspect Vocabulary Creation

The simplest way of finding those aspects is with a simple word frequency method, finding the most frequent words in our dataset. This often comes coupled with hard preprocessing steps, such as stopword removal, Part-of-Speech filtering, and manual blacklisting of common terms.

A sample of possible aspects extracted this way could be:

[‘fridge’, ‘ice’, ‘part’, ‘review’, ‘door’, ‘promotion’, ‘refrigerator’, ‘freezer’, ‘space’, ‘water’, ‘side’, ‘room’, ‘maker’, ‘drawer’, ‘size’, ‘month’, ‘get’, ‘lot’, ‘shelf’, ‘everything’, ‘itâ’, ‘food’, ‘storage’, ‘unit’, ‘time’]

These are the most frequent nouns of our dataset. Notice that the bold aspects correspond to that blurb we mentioned earlier: if you could explain why those aspects could be noise in our technical interview it would greatly increase your standing amongst others!

While there are some useful aspects in this list (door, space, shelf), many of those words are just noise (everything, time, get). If you choose to stick with this approach, be sure to also delineate a possible filtering technique as an enhancement to your solution.

To further increase the value of these extracted aspects with just a few lines of code, you could integrate bigram/trigram phrase detection and find multi-word aspects: constructs such as ice maker, freezer space, side door, and shelf size are more informative than just a single word.

Or, instead of going for singular nouns, you could go straight to noun phrases: nouns coupled with their direct modifiers. This yields constructs such as great ice maker, loud freezer noises, and awful delivery, with the bonus of being structures robust to small changes on syntax and phrase composition. You can also run those bigram/trigram finders on these noun phrases and extract only what is deemed a multi-word expression.

Topic Modeling

Going forward in complexity, we can model a topical representation to find relevant word groupings in our dataset. In short words, a topic model creates contextually-related groups of words and represents documents by a mixture of these groups.

You can assume that the words deemed part of those topics are relevant enough to be considered aspects. If you have a topic that is primarily represented by the words ice maker, ice dispenser and ice tray, chances are that those words are simultaneously statistically relevant to this dataset and have some kind of semantic relationship.

Thing is, most topical representations will also jumble up important keywords with noise. It’s a pretty good start for an aspect vocabulary generator, but should also be coupled with a filtering strategy or a heavy pre-processing routine.

Custom Named Entity Recognition

If you had time, you could invest it on a customized Named Entity Recognizer. You can create a manually tagged corpus of sentences with corresponding aspects and train a NER model. Spacy has a great framework for training customized entity recognizers using word embeddings.

But this takes time. These models usually need a good amount of tagged data to be generic enough to capture unseen entities. Just explaining your reasoning would be enough to demonstrate this possible solution and show off your research. If you could source a pre-trained model for this specific domain of refrigerator reviews or even a generic aspect extractor, even better!

Semantic Rules

Armed with knowledge about English, you can create rules that identify combinations of Part-of-Speech tags and dependency relations between tokens in a sentence. This is a pretty standard method, used in most papers about aspect extraction. It’s simple, fast, and effective.

Source: displaCy demo

In the example above, we can identify some prospective rules:

  • Composite nouns (battery life) are usually aspects;
  • Nouns that are the subjects or direct objects of a phrase (phone, [battery] life) are good candidates;
  • Adjectival modifiers can be joined with their affected nouns (long [battery] life) to qualify an aspect.

But there is a catch: this requires some experience in NLP and some prior knowledge about syntax and semantic analysis. It’s off the beaten path for most prospective data scientists unless they specifically research about these kinds of methods. It also takes a little while to tune the rules to filter out noise if you want to do them from the ground-up.

Heavyweight Models

This is where the out-of-the-box library implementations end and the academic papers begin. There are plenty of Sequence Networks that tackle the task of contextual token classification, CRF models that automatically create rules from a sequence of feature-rich inputs, deep sentence representations such as BERT that can be combined with any ML method…

It’s a plus if you go after one of those techniques in a short timeframe, of course. But it’s always a big gamble to pursue these kinds of complex modeling strategies: make sure you timebox your efforts appropriately and always have a fallback method on your sleeve.

The Presentation

Half the flavor in a meal is in its presentation. It doesn’t matter how delicious it is: if it’s slapped together haphazardly, your mind will dull the senses with distrust. Even the pickiest eaters can be persuaded with a nice plate and tasteful cuts and slices.

Think about how to make people digest the most intractable math with gusto and you will win over any interview.

We asked applicants to present their work on a Jupyter Notebook. It is one of the best ways to present Python code, it allows the use of many graphical libraries, and it creates an easy way to generate HTML pages with their results.

Now, when we analyze the tastefulness of an application, we go over three main points:

  • The code structure: documentation, variable nomenclature, syntactic sugar;
  • The thought process explanation: hypotheses to test, why a pre-processing method is being used, where are the flaws in your modeling and possible corrections;
  • The insight presentation: what can you infer from your results, why are these insights interesting, how to prove your assumptions.

Code Structure

There is no better indicator of Python familiarity than the coding idiosyncrasies in your Notebook. We like the Google Style Guide and the Google Docstring format, but any convention is fine given that it is constant throughout your code. Your two biggest allies are:

We don’t require a superstar-level of Python coding proficiency regarding the functionalities of Python (list comprehension, lambdas, decorators). We require only readability and a modicum of comments to explain the gist of a convoluted function.

We are also especially fond of proficiency in the use of some popular Python modules: NLTK, Spacy, Scikit Learn and Pandas are some big names that we invariably use on our daily work.

Example of applying a Lambda function to extract polarity and subjectivity from a text snippet.

Now, while the example above demonstrates a great Pandas use of .apply(), it also processes the same text twice. It isn't a problem for such a small dataset, but it could hinder a large-scale implementation in the long run.

Thought Process

Data Science project development is not straightforward. It’s mostly trial and error, in the beginning. Evolves to educated guesses based on prior knowledge of the domain and your toolkits, and then consolidates on working baseline solutions for generic use cases.

Describing your flow of hypothesis and tests is essential to impart your knowledge to others. Likewise, we value presentations that strive to explain each step of the modeling process in detail.

Example of a successful step-by-step explanation.

A Data Science assignment is not unlike a scientific paper: present problem, formulate some hypotheses for resolution, explain the methodology, describe the execution, show results, then draw conclusions. Just follow those steps and try to be as didactic and concise as possible.

We cannot stress enough the importance of showing your work. Even if there were failed iterations of your models, describe what you tried and what you’d do if you had more time. Leave those codes in a separate file, away from your main report.

Insight Presentation

This is the most important step of them all. No amount of good coding practices or modeling proficiency will be enough to supplant your conclusions about your results. We can sum this step as a simple question: Why is this interesting, and how can I use it?

After extracting all those aspects, we can use the other variables in our dataset to discover interesting relationships:

  • What are the most frequent aspects of positive and negative reviews?
  • What are the most cooccurring aspects?
  • What are the main aspects of the analyzed category?
  • Are there differences between brand aspects?
  • Can we analyze singular products?

What we were looking for was a big-picture view of the dataset, how the candidate would correlate their extracted aspects with all the other info in our dataset. For instance, we gave a simple insight away on the task proposal: are there differences between aspects of competing brands?

It's pretty easy to identify some aspects that stand out:

  • “Ice Maker” is a pretty sore spot on Samsung refrigerators. We can identify that it is one of the most frequent aspects appearing on negative reviews (1–2 stars), and cooccurs with issues and recall.
  • “Doors” is the aspect that is not working correctly according to Whirlpool consumers. “Shelves” and “Bins” are also being negatively quoted on negative reviews.
  • “Lightning” is a recurring problem in GE refrigerators. People complained about weak/failing lightning issues in several refrigerators.

There are several ways you can spin conclusions over this analysis: graphs indicating aspect perception over time, comparing different brands, comparing aspect frequencies over retailers… Every graphical element you add to your report makes it easier to understand and helps you convey your points-of-view. But this is in no way a requirement for a good report: a succinct, didactic explanation is more often than not a good replacement for a subpar graph.

An example of a great textual conclusion supported by statistical data (in Portuguese).
An example of a conclusion based on word vector distances between positive adjectives and the aspect space, listing the best-rated brands (in Portuguese).

Write your conclusions in a way that anyone reading your report could grasp what is described in your analysis. Treat anyone that reads your code/report as a layman: for all intents and purposes, the final insights gained through your code should be understandable to a stakeholder or a project manager.

We are extremely thankful to everyone that submitted their tests and assignments, even if they were incomplete. We wanted to have an internship test that would double as a portfolio addendum to everyone that was interested in our selection process. If you want to do your own analysis, feel free to use our partial dataset and this article for inspiration!

--

--

Fernando Tadao Ito
birdie.ai

Consultant Data Scientist that also moonlights as Data Engineer