Jalaj Thanaki is an experienced Data Scientist with a demonstrated history of working in the information technology, publishing, and finance industries. She is author of the books Python Natural Language Processing and Machine Learning Solutions. Her research interest lies in Natural Language Processing and Machine Learning to solve complex problems. Besides being a Data Scientist, Jalaj is also a social activist, traveler, and nature-lover. I happen to find Jalaj at a good time when she just launched her second book. Jalaj has been kind enough to do a E-Book giveaway of a few copies of her books. Please find the details at the end of the interview.

Vimarsh Karbhari (VK):What top three books about AI/ML/DS have you liked the most? What books have had the most impact in your career?

Jalaj Thanaki (JT): There are many books which really helped me to learn NLP/ML and AI in my career but my personal favorites are:

There are lot of things happening in DS/AI space. So, apart reading books, I’m a big fan of reading research papers which help me to keep myself up-to-date.

VK: What tool/tools (software/hardware/habit) that you have as a Data Scientist has the most impact on your work?

JT: In software category,

  • I just love Ubuntu operating system.
  • I’m using Python as my primary coding language for data science projects.
  • I prefer Pychram IDE for coding because of its debugging feature.
  • Pandas library — because it helps me to deal with varieties of file formats and I can perform Exploratory Data Analysis (EDA) easily by using its APIs.
  • PyTorch: Currently I’m trying to switch from TensorFlow to PyTorch. Learning is always fun for me. We can define, change and execute nodes of neural networks as we go in PyTorch. There is no need of special session interface in PyTorch. I like these features in PyTorch. This framework is more tightly integrated with Python language and feels more native to me.
  • GitHub helps me to manage and maintain my projects.

In hardware category, I prefer to use GPU enabled system to train Deep Learning models. You can read more about my custom build computer system here.

VK: Can you share about the Data Science related failures/projects/experiments that you have learned from the most?

JT: Back to my graduations days, when I was working on my thesis, I had not properly understood, when to use which Machine Learning algorithm that led me to perform lot of iterations. During these iterations, I learnt a lot from my mistakes.

During the implementation of the multi-class classification model for NLP application, data samples for some class-labels did not generate as expected, hence my model suffered from the overfitting issue. I came to know this issue while I was testing the model on hold-out testing corpus. I resolved this issue by applying various subsampling techniques.

I will be doing a book review for the book on Acing AI so stay tuned and subscribe to our newsletter to not miss it.

JT: Fortunately, I got the chance to write two books on topics which I wanted to write after my graduation. First one is “Python Natural Language Processing” which helps beginners to learn NLP from scratch and second one is “Machine Learning Solutions” which is practical guide that helps readers to build and optimize various Machine Learning applications. It includes applications from Natural Language Processing(NLP), computer vision and Reinforcement Learning domains.

Given a chance I want to write a book entitled “Reusable Architecture for ML Applications”. Nowadays companies are having various types of projects and some of the project features are overlapped with other project features. How we can build the reusable architecture which helps developers to build the common project features easily across the multiple products. If the reusable architecture would be achieved then it would save a lot of development time and energy of Data Scientists and it would be efficient solutions for many companies.

VK: In terms of time, money or energy what are the best investments you have made which have given you compounded rewards in your career?

JT: After completion of my undergrad study, I had to choose between job and graduate study. I’m glad that I had chosen graduate study because as part of my thesis work I came to know about the NLP domain. That thesis work was my starting point in the Data Science field.

I believe that “Simplicity is the glory of expression”. I always tried to make content as easy as possible for my readers so that they can understand complex data science related concepts really well. I prefer to write in simple language. As a result, my first book “Python Natural Language Processing” was considered as textbook by Weill Cornell Medicine, Department of Healthcare Policy & Research for their Natural Language Processing in Health course. I got to know lot about marketing, pre-sales, post-sales and so on after publishing my first book.

Networking with students, researcher, industry experts and entrepreneurs is always delightful experience. People are ready to share their ideas and knowledge with others. I really enjoy that positive spirit.

VK: What are some absurd ideas around data science experiments/projects that are not intuitive to people looking from outside in?

JT: People who are unfamiliar to the data science field they think that machines can learn or think by themselves and the job of the Data Scientists/Machine Learning Engineers is just to monitor the process.

VK: In the last year, what has improved your work life which could benefit others?

JT: Since last year, apart for begin Data Scientist, I was writing my books. In order to manage these two projects, I had to have good time management skill. Initially it was really challenging for me but eventually I learnt how to manage my time efficiently. In order to do so, I used to prepare my weekly as well as daily to-do list so that I could know how much time I need to spend on each of the tasks. I usually try to make realistic plan which I can follow and I always have a room for watching YouTube videos.☺

Nowadays, I have cut out my time on unnecessary meetings and communications. I don’t spend too much time on social media (but I always resolve the questions coming from my followers.) All these steps help me to become more productive.

VK: What advice would you give to someone starting in this field? What advice should they ignore?

JT: I have number of things that I wanted to share with newbies and job seekers.

  • There are many subdomains in data science. Such as analytics, NLP, computer vision, speech and so on. Please don’t pursue any subdomains of the data science just because others are pursuing it. Take your time. Try to understand your interest area. Getting confuse during this process is normal. Unlock yourself. Don’t be afraid of failing in your experiments. Try to clear your vision and to do so, first of all you need to read a lot as well as start implementing number of small applications for each of the individual domains. Run this exercise for a week or two for each domain. Check what kind of work you are enjoying the most and this way you can decide your interest area(s). Start acquiring the domain specific skills after deciding your interest area(s). Learn concepts by practically implementing them. Don’t try to attempt all things at a time or try to acquire all the skills at the same time. Give yourself proper time to learn.

Always remember that your data science related skills, your projects and your contributions remain with you forever so focus on them rather than any other things. Remember — “Acquire skills in such a way that you can be a technology creator instead of a technology user.”

  • Those who are trying to get the job in data science field, I would like to tell them please focus on the projects and domain of the company for which you will be hired. Don’t focus on the size of the company. In long run, your projects/work-portfolio will speak, on behalf of you. I would like to tell you that if given a chance you should work on big projects at small or medium size company aligned with your interest area so that you can learn more about various aspects of the data science rather than working at big company on small chunk of a project.
  • When you try to change your job, you will be asked what kind of work have you done so far in your current or previous company? Your potential employers won’t be much interested in your previous company’s profile or current company’s profile but they are interested in your profile, in your skills and want to learn more about you and your projects. Make sure you have great work-portfolio so that you can impress them.

Advice that you should ignore is:

  • I would ignore those who say some kind of certification in data science is really mandatory to prove your skills. I have different opinion. If you learn skills without certification and prove your skills by completing some cool data science projects then there is no need of certification. You can also enrich your skills by participating in various hackathons.

VK: What is bad recommendations given in data science in your opinion?

JT: According to me, there are no bad recommendations. It is very subjective matter and it varies person to person. You need to decide which recommendation will be the best suited for you.

Although during initial days of my career, I got an advice from some source that you have to know all the advance concepts of linguistics if you want to learn about NLP but in reality, I just need to know basics concepts of the linguistics which can help me in my project. I don’t like when people consider role of linguists and NLP engineers in same manner. In reality, they both serve a different purpose and have a different skill-set. I also don’t like when people consider data science and data analytics in the same way. Based on this interpretation they advise people whereas in reality, they both are different terms. They include different set of sub-domains/fields.

VK: How do you determine saying no to experiments/projects?

JT: I always choose the project which can be helpful for the company as well as take lesser amount of time to develop. I keep the projects in my wish list for which more amount of data and extensive amount of time is needed.

VK: Do you ever feel overwhelmed by the amount of data or size of the experiment or a data problem? If yes what do you do to clear your mind?

JT: I usually get overwhelmed when I deal with any new dataset. In order to clear my mind I usually start doing following things.

  • As a first step, I try to understand problem statement really well.
  • Check what type of dataset I’m having. Whether it is structured dataset or unstructured dataset.
  • If dataset is the structured one, then I take one table at a time and try to understand the meaning of each column. I also check what is the importance of the data column for building data science application.
  • If dataset is the unstructured one, then I take small amount of chunk from the dataset. Analyze it. List down my findings. Now, I need to repeat the process couple of times. Every time the chunk of the data should be obtain randomly from the main source of the dataset so that I can generalize my findings.
  • If it is possible for you then try to understand how data has been collected.

VK: How do you think about presenting your hypothesis/outcomes once you have reached a solution/finding?

JT: This is the challenging part for me. Especially, when the person whom you need to explain your results is not from the technical or data science domain. In that case I’m following the given steps.

  • List down important points of your outcome ( findings, advantages and disadvantages). I also try to cover all minute but important details about the result/outcome for different types of stakeholders.
  • I always keep things simple (minimum technical words, more layman terms) so that people can understand the outcome easily.
  • I usually prepare the list of potential questions which can be asked to me so that I can answer them with proper logical explanation.

VK: What is the role of intuition in your day to day job and in making big decisions at work?

JT: Intuition helps in order to derive the basic features or choosing hyper-parameter for data science project. It also helps you to make base-line model for the project. If you have deep knowledge about the domain then your intuitive decision really helps to make big decisions.

VK: In your opinion what is the ideal Organizational placement for a data team?

JT: In my opinion, every company and team have their own choices and hierarchies when it comes to the placement of the data science team.

  • Type 1: If Data Scientists who are focusing more on software engineering part of the data science projects then they should report to the Engineering.
  • Type 2: If Data Scientists who are building new products then they should report to the product team or CEO because the features of the new products should be aligned with the overall vision of the company.

VK: If you could redo your career today, what would you do?

There are nothing much that I want to change in my profession journey but I really wish I would have started hosting ML projects on GitHub earlier but better late than never.

VK: What are your filters to reduce bias in an experiment?

JT: I usually use cross validation techniques to handle bias related issues.

  • If you have adequate number of data samples and you want to use all the data samples present in the dataset then use K-fold cross validation.
  • Random subsampling is more preferable when dataset which you are considering is either undersampled or oversampled. As well as if you don’t want to use all the data samples in K-1 fold then random subsampling is the way to go.

VK: When you hire Data Scientists or Data Engineers or ML Engineers what are the top three technical/non — technical skills you are looking for?

JT: If I’m hiring a Data Scientist who will be building data science products then following are the key skills.

  • Strong knowledge of ML/DL
  • Good coding skills
  • Great learner
  • Good communication skill

VK: What online blogs/people do you follow for getting advice/ learning more about DS?

JT: Machine Learning subreddit is one of the resources from where I get an idea what is currently happening in AI / ML industry.

Here are some of the blogs, and YouTube channels which I follow:

Twitter and LinkedIn works for you if you know who you need to follow. I like to follow top researchers from academia and industry experts on twitter and LinkedIn so that I can come to know best of both the worlds. My twitter handle is @jalajthanaki.

