30 Days of Data Science Career Tips- Part 1
Recently, I did a 30 day series to talk about interview questions and answers asked in top tech companies. For the interview QnAs, I was maintaining the blog to have all questions in one article:
30 Days of DSML Interview Questions
Back in 2021, I had similar series but it was focused on data science career tips. I had many folks responding me if I had collected all those tips in an article. Sadly then, I never started an article on it. But thanks to Linkedin, I was able to download all those posts. To organize it better, and not to make a bulky read, I have divided total tips into two. This is part 1 covering 1–15 tips.
Tip 1: Data scientist is not the sexiest job of the 21st century. Sorry to start this with a bitter reality. People come to the data science field with very high expectations. While the jobs in the past were increasingly in high demand for Data Science, job seekers have now realized that companies were/are just slapping the title for any data jobs as data scientist. Most of the time you take a job and wonder to yourself- what happened to the sexiest job ? You will be working not only to build models; but to collect, ingest and clean the data. You will be lucky if you find a well defined data engineering team to help you for pipeline building. Therefore my first tip is lower your expectations, and learn some data engineering skills.
Tip 2: During the early stage of a data science career, it is very easy to become a victim of the Dunning- Kruger effect. As Wikipedia says: The Dunning-Kruger effect is a hypothetical cognitive bias stating that people with low ability at a task overestimate their ability. I have added one diagram below with my edited comments on red. Try to find that best route as I have pointed out on the graph, because sudden rise in confidence with limited knowledge, and later falling down on it can be harsh to negatively impact your own learning experience. Identify yourself where you are, and once you pass the convex point, confidence and knowledge can both follow the positive growth.
Tip 3: Ask yourself if you are ready for a data scientist role. If the answer is maybe, then you are already doubting yourself. These days the Data Scientist job sounds like ‘jack of all trades and master of none’. Companies are adding all the data related keywords in job description(JD). In effect the job seekers are tweaking their resumes by adding the same keywords to beat the automated resume filtering system. This process of keyword stuffing is cyclic, and never ending. In the early career stage, sometimes it’s better to find jobs as Data Analyst, or Business Intelligence Analyst, where you can optimize your learning time to the skills focusing in some core areas rather than trying to learn everything driven by the job market, not by what you are interested in learning. Most of the time you will end up working only on 25% of JD. For analyst roles, I suggest spending some time in Data ingestion/validation, ETL, SQL, Descriptive Statistics, and Data Visualization. These are good skills for a Data Scientist as well. Once you get an opportunity to gather some experience, you can focus your self-learning towards DS/ML skills, and build upon some domain expertise to help your next career move towards the Data Scientist role.
Tip 4: Sometimes it is very easy to get yourself pulled into multiple directions while you are trying to learn and ramp up your DS skills. It has happened to me. I consider myself as a person who buys more books than I read. To overcome this, lay out a clear study plan, break it down into multiple subsets, and make sure that each subset covers an important area of DS. Based on my learning experience, I am suggesting following areas:
- Programming skills
- Math (including Statistics & Probability)
- Data Wrangling
- Machine Learning
- Data Visualization
- Big data transformation
- Product sense
- Communication skills
Going forward, we will be covering one tip for each of the above subset.
Tip 5: I am covering the subset of ‘Programming Skills”. In my personal experience, I suggest focusing on one analytics language rather than trying to learn many. I prefer Python because it is easy to learn, has lots of great packages, open source, and the python skills can be used not only for analytics but also for other areas. Google Colab is great, since you can start coding in python right away without the hassle of setting up an environment. Once you are familiar with it, you can set up a conda environment on your own. One thing I suggest is while you learn Python, think from the Data Structures perspective, it will be helpful in future for the interview prep. One more language to ramp up is SQL. You will be writing queries for sure. Spend some time on advanced aggregations, CASE statements, Subqueries, JOINs including self-join, CTEs, and window functions.
Tip 6: I am covering the subset of ‘Math (Stats & Probability)’. This is one of the heaviest learning paths for DS. For Math, I will suggest starting from a refresher on linear algebra, vectors, matrix multiplication and calculus. For the Statistics & Probability (S/P), it can go to any depth. Best strategy is to set an early objective of what kind of role you prefer. Most of the DS roles are aligned to one of the following:
Applied Scientist: High bar on S/P and some coding
Data Scientist (General): Medium bar on S/P, and Data analytics
Research Scientist: High bar on S/P, Research Oriented (PhDs)
ML Engineers: Low bar on S/P, very heavy on Coding
Tip 7: I am going to cover “Data Wrangling” here. In short, data wrangling is the process of data handling, cleaning, validation and transforming the raw data into processed data that is ready for analysis. It actually covers the data preprocessing and exploratory data analysis (EDA). Sometimes EDA can be viewed as a separate component, but I think it can be considered under the same hood if we include variable analysis ( both univariate and multivariate), correlation analysis, and some plots for visualization of distribution. Most of the tools or packages, I will be talking here will be Python based. Following are the top 5 packages for data wrangling plus EDA: pandas, numpy, scipy, statsmodels and matplotlib. Of course there are many more, that’s the beauty of python. One more package worth of mentioning is- pandas_profiling. I suggest using it if you want to create a nice html report.
Tip 8: Today I will be covering ‘Machine Learning’. It is the biggest area of the DS learning path. I will have to divide & conquer this. Will be discussing ML in 3 sub-sections of Feature Engineering (FE), Modeling and Deployment.
Feature Engineering is an art, it requires both domain knowledge and creative thinking. Data by itself will never provide the predictive power to the model. Too many features will overfit the model, having redundant and/or unimportant features will degrade model’s performance; and not having enough features will compromise the model performance. I personally suggest spending a significant amount of time on learning various techniques of feature creation and selection. In a business scenario, this requires discussion with subject matter experts and stakeholders. The correlation analysis vs. causal inference (aka causation) is important to understand. This is when hypothesis testing comes into picture.
Tip 9: I am covering the subsection of Modeling under the main topic of Machine Learning. I used to always get confused on algorithm vs. model. And yes there is a difference. Here is an easy way to differentiate: Model = train( algorithm + data).
Training the ML algorithm with data to find the optimized hyper-parameters and weights is called modeling. Few important things to focus during modeling are: try multiple algorithms (always start out with simple), have validation set in your data split to use it for hyper-parameter tuning, pick your accuracy metrics wisely (e.g if you have class imbalance use AUC weighted instead of accuracy), learn about overfitting and ways to handle it, bias variance tradeoff. Modeling is vast, it’s not about just calling the fit and predict functions. I suggest spending some time on understanding how some of the popular algorithms work.
Tip 10: Let’s talk about model deployment. This is one of the most important parts of the ML life cycle, yet it is always overlooked from data scientists. One of the main reasons is that while learning ML projects- we do data preprocessing, EDA, feature engineering, train multiple algorithms, cross validate and pick the best model, and thats it; a great feeling of achievement thinking that data science ain’t too bad. I was one of them, until I was working in a real job as a Data Scientist, and I quickly realized that the best model in notebooks and slides will never provide any business value. Most of the online blog posts, Kaggle projects, and even the DS community learnings are geared towards model development but not deployment. I suggest putting extra time and effort into learning model deployment. If you are building a git portfolio, take your ML project an extra mile by serving your model via web service so viewers can interact with your model. Think about different ways to stand out and be competitive in the job market. MLFlow is a great open-source end-to-end ML lifecycle management tool. Look into Flask, Streamlit, Container Service, Kubeflow. At least do one end to end project.
Tip 11: A picture is worth a thousand words. Yes we are talking about data visualization today. In order to pass along the right message, I spent a few minutes to plot two visualizations for total post views on my tips so far. V1 on the left shows the bar chart of post views. This chart is a little crowded, and it is hard to draw insights out of it. V2 on the right is a time-chart, looks much cleaner, and I can draw some insights out of it- like the overall trend has slowed down, one data point is really sticking out ( tip 6 about Statistics was a popular one), it seems like I missed posting on 6th March. It’s amazing how a well structured and rightly picked chart can make a difference. Data visualization is an art. It comes handy during EDA, and also when you are presenting your findings to leadership and stakeholders. I suggest being tool agnostic when it comes to data visualization. Companies may use a variety of tools, what’s important is how you tell your story with data. Tableau and PowerBI are the leaders based on the Gartner report. Some python packages like matplotlib, seaborn, bokeh, and plotly are the popular ones out there.
Tip 12: Today I will be covering the importance of big data processing for DS. The definition of big data started with 3 Vs (volume, velocity, variety), but now it seems like more Vs are getting added to cover the bigger meaning of big data. For data science, big data is good news. Machine learning models, and more importantly deep learning models crave big data. But the challenge is how to efficiently handle and process big data. This is where data scientists need to wear their data engineering hat, and collaborate with their engineering team to build an effective data pipeline. Spark is the most popular framework of choice for big data processing. You can use it as python api (pyspark) or use it via its native language Scala. The learning curve on python is easier, but using it on Scala is 10 times faster. I will suggest starting with pyspark and then level up with Scala. Recently, I also felt that I need to ramp up my data engineering skills. Follow the hashtag #dataengineering in linkedin, tons of good information there.
Tip 13: Today I will be talking about ‘product sense’ from a DS perspective. Oftentimes I have found Data Scientists a little hesitant about whether product skill is required to be a good DS. I argue with that statement. While you don’t need to know the full skillsets of a product manager, you have to understand the business value you are providing to the product. Either directly or indirectly you are supporting some product and delivering customer service ( either internal or external). Many tech companies now have specific roles for DS that are more geared towards product oriented analysis, and their interviews are structured to cover product case study. For those roles, hypothesis testing and experimental design skills are must have. At the end of the day, your stakeholders or customers care less about how many layers you added in your neural net, if you are not bringing up any business value.
Tip 14: Today I will be covering the topic of communication skill. Most of the time, we focus on hard (technical) skills, and soft skills remain underrated. For DS roles, communication is very crucial, as you will be supporting multiple functions within the organization: like getting with business folks to gather requirements while identifying and framing the problem, then sharing your analytical findings back to them. Also during the project itself, it is important to cross collaborate with data engineers, software developers, product managers, and other key business stakeholders. One of the challenges I came across was communicating the data/model findings to non-tech audiences. In one of the interviews, I was asked to explain p-value to a business leader who is from a non-statistics background, and I struggled for a minute. The overall idea is how well you can break down your analysis in terms of business OKRs (Objectives and Key Results), and how you measure it by KPIs (Key Performance Indicators). I can suggest a great book to read if you are interested. Here is the summary of it:
https://strategyfieldguide.com/articles/measure-what-matters-john-doerr/
Tip 15: Today’s post is about the importance of domain knowledge. If you are already doing DS for sometime, then you should already have a domain of interest to align your career. If you are new and trying to break into DS, then you might be unsure about what domain will interest you. However I suggest picking at least 2–3 domains of interest, and build your portfolio to include 1–2 projects from each domain. For example: it does not make sense to apply for a job at an e-commerce company if your portfolio doesn’t have any projects in that domain. Once you have gained some experience, you will realize which industry will interest you for a long term career, but it’s worth having multiple options in the beginning. Some of the popular industries fostering DS are finance, healthcare, social networking, e-commerce and retail, cyber security, gaming, telecom, and more.
I hope the tips were helpful to you. I will be sharing the Part-2 soon.