Peak impostor syndrome: me as a lightning speaker on ML in PL 2019 (photo by Karol Kaczorek)

How I didn’t become a Data Scientist in 6 months…

…but learned a lot, so read on!

Marek K. Cichy
6 min readApr 1, 2020

--

tl;dr: One linguist’s path into data science. First, my background and motivation, then a list of books and courses that worked for me, and a few of my projects.

Why

I’m not really into New Year’s resolutions, but on the 2nd of January 2019 I started a U-turn in my professional life.

I had spent the first 8 years after getting my philology degree working with languages. I´d earned my bread as a freelance translator and interpreter, helped a startup enter Portuguese and Spanish-speaking countries and co-founded a boutique comics publisher editing the best Latin American graphic novels in Poland.

My translation niche seemed increasingly unreliable, not to mention indie comic book publishing — although it was very rewarding in a creative sense. As for BizDev, to my surprise, it involved meetings. Lots of meetings. Even in a startup.

In 2018 I got involved in an NLP project as a Portuguese and Spanish linguist (preparing datasets, annotation, translation evaluation, etc.). My tasks, while fairly repetitive, nudged me to jump the fence and try my luck on the tech side of this field. In other words, I decided to learn to program, with focus on NLP in particular and AI/ML/DS in general.

The effect? After one year, I… won’t call myself a data scientist (nor I feel the need to). I do, however, participate in Omdena’s (collaborative AI for good) projects, currently in the role of a Machine Learning Engineer, I held my first talk at an ML conference last autumn, and work on filling up the remaining knowledge gaps.

How did I get here?

Photo by Stephen Monroe on Unsplash

How

I started with the following principles:

  • I need to dedicate nearly all of my working time to it. I had the luck to have a way to support our family financially during the transition;
  • I’ll learn on my own, that’s the way I like to work. Besides, the Internet is (Stack)overflowing with learning material;
  • That said, I need to find a mentor to guide me through the process;
  • I imagined spending some 3–6 months solely learning stuff, then find a company for a trainee post.

After some time, I verified these assumptions:

About two months into the process, the company I had worked for the year before approached me with some new projects. I was more than happy to do it. I realized how hard it is for me to self-motivate having solely a distant goal, with no gratification in-between, even if your financial needs are satisfied for the time being.

As for the abundance of learning material, it is a mixed blessing. I was diving head first into a new area, so every time I concentrated on a particular topic, references of several other ideas sprung up. Reading about NLP leads to word vectors, hard to understand without linear algebra. In order to have any notion of Machine Learning solutions, I needed to revise and expand my formerly meager knowledge of statistics. Etc. etc…

That meant the need for a mentor was even more pressing, especially since I’ve heard tales of how difficult it is to arrange someone like that in 2019' Poland. However, in my case, it wasn’t. One of the first people I spoke to while deciding to take the leap into the area was a tech educator. She introduced me to Piotr Migdał, now responsible for some of the more uncommon projects of my life.

I cannot overstate the importance of having a mentor, but I’d say the crucial support that Piotr gave me was nudging me into the appropriate directions (materials, projects, events, people) and discouraging from the paths that didn’t offer much perspective.

Photo by Eugenio Mazzone on Unsplash

What

As you may expect from my mini-bio above, there were several areas I had to catch up with. Below, I list the materials I used in each of these:

Programming in Python

With no prior experience, I started with Python for Everybody by Dr. Charles R. Severance, it’s a great introduction to data-oriented Python programming. Witty, informative and making sure you round up every section of theory with some exercises.

For pure coding skills, I practised using Codility’s Programmers’ Home. They provide a list of lessons with exercises with automatic tests assessing your code’s performance. (shh… if you get stuck on any exercise, Codesays has some solutions, but promise me you’ll try all your ideas first ;)

Theory: statistics + linear algebra

My knowledge on the subject ended in Polish high school some 15 years ago. So, also a lot to catch up here. Thankfully, the access to quality materials is much better than it was then. Piotr recommended (and I wholeheartedly agree!) Seeing Theory as a visual introduction to probability and statistics, as well as Introduction to Linear Algebra by Immersive Math. Also, what helped me a lot were videos by 3Blue1Brown and StatQuest.

Machine learning

As a theoretical foundation, Introduction to Statistical Learning worked for me. The practical part of the book is designed for R users, but in my case it proved quite straightforward to do the same exercises with relevant Python libraries. Also, some good people have done the work already.

Apart from that, there are inumerous articles I’ve googled during the course of this year, or ones read casually thanks to recommender systems, here on Medium and elsewhere. If I am to name one author, please check out Jay Alammar’s blog. The way he explains various concepts from the field was really eye-opening for me.

DataCamp

Why did it earn its own section? I have a complex relationship with DataCamp. I must say I’ve completed 19 courses on the platform, they are accessible, attractive, the course authors are fun to watch. I particularly liked Justin Bois’s Statistical Thinking and Kirill Smirnov’s Practicing Coding Interviews, as they provided a wider perspective than just introducing the student to a given library.

That was exactly the problem in other courses — my role was often reduced to filling some gaps in 6-line chunks of code, which was not particularly challenging nor making me go forward in my education. Even the Projects, supposedly designed to help you tackle a particular problem from start to end, gave little space for one’s own experiments, trials and errors.

Bonus! For Polish speakers…

… I have an additional resource to recommend. Last year, a group of Data Science self-learners created a Machine Learning Study Group. They go through successive books on weekly Zoom calls, each participant responsible for a particular bit of the book. All previous calls are available on YouTube.

Dear reader, if you know of any similar initiative in other languages, I’d be happy to add a link below, with due credits to you. :)

What for?

As a final thought: all of the readings and courses above clicked into place once I put them into practice. As soon as I could, I started private projects, both inspired by my mentor and ones I came up with myself.

The above-mentioned article on “AI’s dirty mind” evolved into @IsitNSFW?, a tweetbot you can check for kinky content (for some background, hop on here). I also bridged my past professional life with my future one by creating a Portuguese dialect classifier and deploying it online.

The work I’m doing now as part of Omdena is also difficult to overestimate: I learn how to cooperate in a data science project and I mingle (remotely) with a global community of like-minded individuals. Most importantly, it’s a huge reality check — real-world data are much more messy than the pretty dataset you receive in any course.

I hope the above serves as inspiration. Most of all, remember to have fun doing stuff that matters to you!

--

--

Marek K. Cichy

PL PT ES Linguist turned NLP/ML rookie. Striving to bridge various worlds.