How we built an NLP-flavored Data Science curriculum and hackathon course

Catarina Cruz Silva
Sep 26, 2018 · 13 min read

It was a few months back that my friend Sam, working in Lisbon’s Data Science Academy, challenged me to build a team and take on the responsibility of creating the Natural Language Processing (NLP) component of their 6-month program.

Consisting of three weeks of learning materials, the module builds towards a 1-day hackathon where participants would compete by trying to solve a challenging NLP task.

By the end of July, we hosted the final competition of Text Classification at Unbabel. It was a fantastic ride, and I came away from it with a new opinion on hackathons, a better understanding of language in data science, and some different notions of the field itself.

But before delving into the most important skills to learn when entering the field of data science (with an extra focus on NMT) and the process we took to build our hackathon, allow me to present The Academy.

The Academy

Data is one of the highest-valued assets nowadays, and companies are desperately searching for data scientists, data analysts, data engineers, data modelers, and data architects [1][2]. There are still not nearly enough candidates as there are open positions, so naturally, a lot of courses and materials have emerged claiming to be able to turn you into a data scientist.

So while you might be rolling your eyes and thinking yet another data science hyped course (and indeed it is yet another data science course), it’s a good one, and here’s why.

The Lisbon Data Science Academy (LDSA) is a 6-month program aiming at ”helping people become entry-level data scientists by teaching introductory material”. With a group of amazing people in connected fields, it goes through the basics of data science, teaching how to gather, analyze, visualize and present data in many different contexts. It divides the workload into big groups, called specializations, each consisting of:

  • weekly notebooks of learning materials, complemented by graded exercises
  • a hackathon on the subject

The learning materials may consist of more than one notebook, but they should go over most of the skills required to learn the topic. Although the academy is paid (mostly to support operational costs associated with the hackathon), all the materials are made available to anyone, including the hackathon challenge. I find this extremely valuable, since it allows anyone to initiate themselves in the field or just access some good materials, at any time.

The community also adds huge value to the Academy. Behind it all, you have data scientists working in all sorts of things, in all sorts of companies, and you have eager students with all these different backgrounds, each bringing something new to this network. For each hackathon, they randomly shuffle the students into groups, stimulating not only the ability to work with different teams, but also allowing them to better connect with their peers.

There is one additional thing that I really want to praise about this group. LDSA has a policy to keep waste — plastic, in particular — to a minimum. As such, they try to find providers that don’t rely on plastic packaging and chose, for example, a traditional coffee machine using ground coffee over capsule machines(you can argue these are recyclable, but the process is not as environmentally friendly as simply not generating that extra trash). Even though they are a small organization, the commitment they make towards this initiative is inspiring, and I do hope that each and every person coming across the academy takes this mindset into their own companies once the course is done.

Let’s move on to more practical things.

Part 1 — Getting Started with Data Science and NLP

What can you do to learn data science? And what is a data scientist, really?

Although there are a lot of possible answers, in my mind a data scientist should know the basics of all stages of data, from its extraction to all the required processing until the actual analysis phase. This can also entail creating visualizations and exposing condensed and useful information from that data.

To learn this, you need some good skills in many fields ranging from mathematics to programming, together with good analytical and presentational skills. All of these are achievable, regardless of background, and the academy aims to arm you with the following curriculum:

  • Initial bootcamp on basic statistics, programming, and classification
  • 1st hackathon on Binary Classification
  • 2nd hackathon on Time Series
  • 3rd hackathon on Data wrangling
  • 4th hackathon on Text classification
  • 5th hackathon on Recommendation systems
  • 6th hackathon on Deployment of models

The bootcamp introduces an overview of useful frameworks, such as pandas, scikitlearn, matplotlib, and numpy. It assumes a basic knowledge of the Python language, but I should mention that most of it is quite transferable to R, one of the languages of choice for some in the field.

The first units also cover the basics of statistics, from which students learn how to analyze data properly, get basic metrics out of it, do statistical inference and understand what correlation actually means — spoiler: not the same as causality.

Paraphrasing Huffman,“The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify.”[3]. With the main foundations of statistics, not only do you learn how to analyze data, but you also become more aware of what is happening behind the curtain, and are better at avoiding the pitfalls of data misrepresentation.

From the basics, it builds from simple classification algorithms, linear regressions and logistic regressions to more powerful algorithms like the K-nearest neighbours, Gaussian Naive Bayes, Decision Tree, and Random Forest, and even includes unsupervised learning methods like K-means for clustering problems — this post has a nice and small description of each.

Hyperparameter tuning and validation and evaluation techniques are also presented — like some of the described here — and by the end of the bootcamp, all students are able to create basic scikitlearn workflows and apply them to simple problems. The first hackathon immediately follows, challenging them to tackle a Binary Classification problem.

The next week is dedicated to time series, the second specialization of the academy. It goes over some particularities of working with data that involve time, introducing important concepts like trend, cyclical, seasonal and irregular components. The students learn to identify these components and to do forecasts in this context. They also learn how to deal with these problems as a non-timeseries problem, skills they’ll be able to apply in the TimeSeries Regression hackathon.

By this point, they’ve seen a great many deal of problems, but a very important piece is missing.

As someone who actually handles data, I can tell you that the biggest chunk of work in solving data science problems is just getting the data and processing it so it is ready to be analyzed. For this purpose, there is a need to process different formats — JSON, HTML, XML, XLS, CSV, TSV, among others — and handle different encodings. Other sources of data exist: databases and APIs are also a big part of it. SQL, HTTP requests and the basics of web scraping are taught to introduce the required concepts. The specialization is completed with cleaning methods to get the data in the same formats and prepare it for processing. Although this can be a painful experience, it is also extremely useful.

The fourth specialization, and the one that brought us here, is the Text Classification specialization. It extends a lot of the previous concepts — that previously were dealt with through numeric data — to text. Thus, it focuses initially on techniques used in text processing, like tokenization, regular expressions, stemming, stopwords, among others.

The students learn how to extract features from raw text, which differs from the previous problems where the features were readily available. They learned how to assess the usability of these features through methods like Chi-squared and perform some initial analysis on the text. The specialization also introduced high-dimensionality problems and dimensionality reduction techniques like SVD and PCA. Although Neural Networks were left out, it closed with word embeddings, a major concept in most NLP problems nowadays, and one that allows you to extract interesting relations like “king - man + woman = queen”.

The NLP frameworks and libraries presented included NLTK, SciPy, FastText, and some Stanford libraries. The examples shown covered topic classification and sentiment analysis, and the hackathon focused on intent classification, providing a wide overview of real text classification problems.

I should say, though, that NLP is a much bigger field than just text classification. If you want to fully understand the field, there are a vast amount of resources out there. From a more academic perspective, I’ve worked with some colleagues to pull together the following reading list for a deeper dive on NLP.

(Disclaimer:I would not advise you to start from these if you don’t have the basic mathematical or statistical background)

The two last specializations are, Recommendation systems and Deployment. The first deals with a type of systems that is present almost everywhere nowadays — think of any online platform, like Amazon, Airbnb, or even Facebook and Google, providing you with“suggestions”(a.k.a recommendations) of what to buy, consume, and where to go next. The second teaches students how to deploy their models, so they can actually be put into good use. After all, students should be prepared to apply data science to the real world — this is the end goal!

So, if you want to become a data scientist, and/or these topics interest you, you can either spend all that time looking them up and trying to understand them by yourself or just follow the academy notebooks. Who knows, you might even want to enroll in the next one!

But if you can’t wait until the next signup date, there are a lot more resources that you can access in the meantime to start learning data science or just practice your skills — a decent analysis of available courses is presented here. And if you get stuck trying to solve or learn something, remember there is always a community ready to help out.


Part 2 — Building a learning hackathon in 5 easy steps

So how did we go about creating our hackathon?

Along with my Unbabel colleagues Gonçalo Correia and Pedro Ferreira, and Pedro Balage from Priberam, we sat together and came up with a plan. Of course, there are tons of ways you can think about and structure it — there is no one recipe for anything — and keep in mind that our context was the academy and not just a hackathon. So, despite the clickbait-y title, here is the actual thought process we followed to build our hackathon.

Step 1 — Define your curriculum

The first thing we needed to do was to define a curriculum. We thought about the type of skills which should be taught in this type of academy, and came up with the following requirements:

  • The skills should be applicable to other problems, not specific to the particular challenge
  • There should be already a framework or library with a simple implementation of the methods taught

Luckily, since this is the second edition, there was already a curriculum from the previous year, so instead of reinventing the wheel, we reviewed the existing one and organized it in a way that some of the skills were built on previous ones.

We ended up keeping the three major groups of the previous year: the first consisted mostly of data processing for text, which is useful in any NLP task and has a ton of libraries available. The second and third focused on dimensionality reduction and unions of features, both of which are not only an important piece of text classification problems, but can be expanded to any complex data science problem, and are included in the used toolkits.

From these big topics, we broke them down to the more meaningful methods and tools and created small subsections that could be learned separately.

Step 2 — Structure and design your materials

After setting the curriculum, we needed to get our hands dirty and actually write the learning materials. Following the academy structure, our rationale was that each section or subsection should allow the students to understand why the topic is meaningful, what it consists of, and how the methods work. Additionally, there should be real-life examples demonstrating the performance and use of each tool. Each section ended up containing a structure similar to the following:

1 — Motivation of tool/method

2 — Description of when and how to use the tool, by introducing real-life examples

3 — Background of the method, and basic implementation, if applicable

4 — Additional analysis and visualizations on performance and results

One thing that I found extremely useful when creating these materials, was to read other available materials on the subject and analyze how they are presented or tackled. If I found materials that I thought were very well explained, I would add them as a reference so the students could also benefit from it. However, if I found a topic that had a difficult background without any good resources easily found, I would try to come up with additional materials — that were then provided as annexes — providing demonstrations or extra explanations on these topics.

Last, but not least, we tried to have as many visualizations as possible, because these are retained much better than just text: it is not without reason that people say a picture is worth a thousand words.

Step 3 — Pick a task

With our curriculum defined and the materials more or less completed, we started to think on which task to pick for the hackathon. We arrived at some key points:

  • It should be an up-to-date problem, with real-world applications
  • It should allow the students to apply most of their skill set about the field

We gathered a bunch of candidate tasks and tried to look into each, to see if it fit the requirements. Our final candidates were the following:

Candidates:

  • Sentiment analysis
  • Intent classification
  • Spam detection
  • Topic classification

Although we felt all of these could satisfy the previous requirements, we ended up picking intent classification. In the chatbot and automation era, this one seemed the most aligned with the first requirement, while still allowing for a lot of the skills to be applied. So we defined our task as intent classification and built our story around it — we went with a setting of a company asking them to build an AI assistant, for which the first task would be to detect intent in the users’ requests.

Step 4 — Choose your dataset

Now that you have a task, pick your dataset. This can be done in parallel with step 3 if you know datasets in your proposed tasks are scarce. In our particular case, we had decent enough datasets available, so we picked the one that mostly fit the task.

However, there were a few caveats. First, it was easily found if the sentences were put on Google. Second, you could achieve a pretty good baseline with very simple methods, leaving very little space for improvements, which in real life situations can actually be the case. A lot of times you just move in that small improvement space, but in this learning context, we wanted the improvements to be clearer. Finally, it contained a very specific set of actions and was not representative of our story.

So we set out to fix that:

  • We translated the sentences to a different language(Portuguese),making it very hard to traceback the sentence to the online dataset — since going back to English from Portuguese does not match typically the original sentence — and reducing a bit the baseline performance;
  • We added a category of sentences with no intent, approximating the problem of a real-life situation, where sentences caught might not always have an intent, and reducing the baseline performance a bit more.

With these changes, the dataset turned out to be very good for the particular task. But I’ve mentioned a few times the baselines in here, and that is also an important factor when picking a dataset. You should have the steps that you expect most students to run and the respective baseline scores that they can achieve with those. This ties not only with the quality of the dataset, but it allows you to have some idea of the scores and performance to expect at each level.

Step 5 — Review and refine your materials

Last, but not least, review all your materials over and over again. Seriously. Review it. Again. And again. Ask someone to review it and try to run the examples and the exercises. Get that second set of eyes to detect mistakes or unclear parts. Refine it with their feedback and take another look at the materials as a whole. Ask yourself a few questions:

  • Are the materials complete?
  • Are the materials clear?
  • Are they engaging and consistent?
  • Does they build towards the final task?
  • Does they contain extra pointers to allow the students to go further?

If you are able to get all these points and are finally happy with your learning materials, it is finally time to share them with the students. And word of advice, as you keep getting feedback, try to continuously integrate it to make this source of learning as good as possible, for as long as possible.

And that is it. That was how we built and hosted this data science hackathon. It was a lot of work — for ourselves and for the students — but in the end it completely paid off. The students rose to the challenge and came up with amazing solutions, they did some incredible presentations and also had some fun in our headquarters!

For us, teaching is indeed the best way to learn. We came out of this with an even better sense of a lot of topics in data science and NLP, and we had the opportunity to meet some amazing people in the field. And you know what? I can’t wait for next year’s challenge.

References

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.

Catarina Cruz Silva

Written by

Senior Research Engineer at Unbabel. Also known as mother of GPUs, freer of CPUs and rightful queen of NMT. Star wars fan, Engineer, Developer — in that order

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.