So, you want a career in data science?
A few months ago, I participated in a panel discussion called “So, you want a career in data?” with Vivian Li and Pooja Sund. It was my first experience as a panelist, and it was so much fun! One of the questions asked by an audience member was “So, how do I start?” and I realized I had much more to say than I could cover in that forum.
The result is this article. But before I begin, here are two things to note if you feel the urge to sigh and ask yourself, “Yet another article on starting a career in data science?”
- First, I am acutely aware that there are a lot of materials on how to approach a career in data. Additionally, many of those articles are based on personal experience and trajectory, just like this one. But I think what makes all of them instructive and helpful is that each person takes a different path across a common data science terrain. As the saying goes, “you’re unique, just like everybody else.”
- Second, in this article I am choosing not to touch on any soft skills such as communication or working in a team, among other similar topics. While these are equally important considerations, they are already covered by others extensively elsewhere.
In this article I outline some thoughts on data querying and preparation, statistics (descriptive and inferential), data science programming languages, data visualization, Machine Learning, and data engineering. I think that having familiarity with these is a great starting place for beginning a career in data science.
Data querying and preparation
Number one in my list of skills for a data scientist to have, and thus my starting point for this article, is data querying and preparation. (Caveat: I assume that somebody in your team has already created a data pipeline and organized data storage so that you have actual data to work with as I describe here, which I also touch on in the last section on data engineering.)
It’s impossible to overstate how much time data scientists spend extracting and cleaning data — and if we do it incorrectly, inaccuracy is the likely result! (This is also known as GIGO: garbage in, garbage out.) It is easy to do by accident: I have seen the wrong conclusions reached only because someone used an inner join when a left join was called for, among many other missteps. I know that I have made similar mistakes, and hopefully I have discovered all of them.
The language you are likely to use for querying is SQL (the Microsoft implementation is T-SQL). But it could also be another proprietary one or even one of the less common query languages — though fortunately it’s unlikely that you would be expected to know them to get a job. As a result, you cannot go wrong by starting with SQL, as it is an industry standard. Here are two free online courses on SQL from Microsoft:
Both courses have access to a free lab to run exercises. To simply read through the courses, no sign up is required. To have access to the lab, you need to create a profile and login, but it’s still free. In my view, these two courses provide good high-level structure about what the SQL language (or more precisely, T-SQL) is about, and the free exercises are very useful. Also beneficial is that even if you use other implementations of SQL (such as Oracle or MySQL, among others), while there might be some nuances in syntax the fundamental concepts are the same.
Another great aspect of studying SQL is that whenever you feel stuck, there will likely be an answer on Stackoverflow, in other free trainings, in blogs, and so on. And, in case you interview with a team that happens to use another data query language that you don’t know, unless it is explicitly mentioned in the job description as a mandatory skill, being able to demonstrate your SQL skills is likely enough for the interview. For example, in most cases a team that uses Kusto would be satisfied if you explain your data query logic through SQL and not Kusto if you’re not familiar with the latter.
If you plan to delve even more deeply into developing SQL skills, you can install Microsoft SQL Server Developer Edition for Windows 11 along with the AdventureWorks sample database (both are free downloads from Microsoft.com). In this way you can continue learning on your own as you will have what you need on your workstation (and no, Windows Server isn’t required for this).
I also want to mention, but without covering here, the topics of database architecture and data schemas, both of which I recommend for you to look into as well.
Statistics
Number two among skills for data scientists is knowledge of statistics (including both descriptive and inferential). This is because after you have your data (queried, cleaned, and “massaged” overall), your next step is to review and describe its features — hence the label descriptive statistics. Your next step after that is to make conclusions and deductions — the task of interferential statistics.
Regarding online training in statistics, much of what I have encountered is either basic or advanced. In terms of the former, if statistics is a completely new area for you, there is nothing wrong with starting with content along the lines of “statistics for dummies” or anything else that provides a gentle introduction to the topic. Regarding the latter, much of what I have seen is too lengthy and perhaps even confusing (for my personal taste, at least). That said, I have found the following learning paths helpful:
- The Data Science Specialization on Coursera is a well-structured course for what one needs to understand about data science, including the basics of statistics.
- For someone more like myself with a STEM background (I have a math and computer science degree), the Statistical Inference course on Coursera may be more refreshing and relatable, though I did notice some comments about the statistical inference part of this track that were not very positive, and the overall course is based on the R programming language, while (as a colleague has put it) “all the cool kids now use Python.” (More on data science programming language choices in the next section.)
Does this mean that studying a data science programming language should go earlier in the list of things to learn to start a career as a data scientist? In my view, no. Even if you were to consider pursuing these two subjects (statistics and programming language) in parallel, I would still say that statistics should go first because it concerns more of the “what,” while programming language is more of a tool to get there, in other words the “how.”
One question asked during the panel was whether it’s enough to know Excel. I would say that if you know 80 percent of what Excel can do, wow, that’s fantastic (I use Excel a lot and I think I know only about 15 percent). As an example, Excel allows the adding of a linear regression line to a time series chart with one click, and with one more click the adding of coefficients and R-Squared. This means that you should know the statistical concepts behind what linear regression is and what R-Squared and coefficients are, but it doesn’t really matter if you use Excel or other tools to calculate them.
That said, although Excel is a very powerful tool that allows you to run many tasks for descriptive and inferential statistics, at some point I think you may realize that Excel might not be the best tool for some of those tasks, leading you to start looking for other tools, and that’s where programming languages come in.
Programming languages
Number three among skills for data scientists is knowledge of a programming language, which you use for further data wrangling (see additional note in the third paragraph below), for further in-depth analysis, and for Machine Learning (ML) work. Usually, the choice is between R or Python. Note that I am making a distinction here between a programming language and a data query language (which I covered in my first section above on data querying and preparation) — although querying data and using the results with Excel is a powerful combination, when you are ready to move to statistical inference analysis or ML, knowledge of a programming language is a must.
Unless you are engaged in Deep Learning, there is essentially parity between Python and R, though Python is much more popular now as well as being more beneficial if you work with Spark for large-scale data analytics, while R is traditionally considered better for visualization (though Python is catching up). This makes the one to choose primarily a matter of personal preference. As with SQL, both Python and R have strong communities on Stackoverflow, so whatever question or problem you may have, you can very likely find an answer there.
Digressing to data wrangling for a moment, it’s also a matter of preference and query performance (or perhaps some other considerations) — and depending on the problem you try to solve — whether to run some data manipulation during the actual query or to do the manipulation in R, Python, or even Excel.
For myself, I am an R person (call me old fashioned). I have recently started using Python more intensively but when it comes to exploratory data analysis (EDA) — an essential part of data preparation — I still prefer R, especially for data visualization (more on this in the next section below).
So, how to learn these languages? Once again, there are numerous resources, both free and for a fee. I used Coursera to start learning R (at the time it was free unless you wanted a certificate) and then continued on Datacamp (for which you must pay for a license, but I would say it’s worth it and they run promotions from time to time — and no, they didn’t pay me to advertise them, either). Datacamp has a lot of courses, and you can combine studying a language (let’s say Python) with statistics and / or Machine Learning, for example.
Within the languages themselves, the key elements (in my view) are (a) to read and transform data and (b) to run your models. As for data, you are most likely to get data in the form of a table as the result of a query (a “data frame”), so it’s likely you will need to do the following:
- Clean up wrong or missing data, including handling outliers or imbalanced data.
- Group the data (for example, by average height for data concerning women and men, so you can group by gender) and then summarize the data (for example, sum, average, and standard deviation) for the various groups.
- Join some of the tables to make them more useful, similar as to how you would join them in SQL, but using R or Python frameworks instead.
- Visualize the data — you might be surprised at how much you can learn about what the data represents just by plotting it in various ways.
This is not a complete list, just a few examples. But a good starting point for any of it is understanding how to work with data frames (using pandas in Python and dplyr in R).
When it comes to ML models, in my personal view it’s impractical to try to learn all the model parameters and functions. Instead, work to understand the main frameworks (train, fit, predict), but also understand that each ML model has its own nuances, parameters to tune, and so on. Unfortunately, you cannot learn these up front; instead, you must learn them as you go through your work. Also, note that here I’m talking only about the language part of Machine Learning, and not about the models themselves (which is a separate story for a different article!).
Data visualization
I consider data visualization fourth on the list for data scientists because it might not be the one that helps you land a job — but when on the job, it is invaluable in helping you stand out and deliver your message. Visualization is part of a larger skillset around “telling the story” (and while it may be a cliché, it’s very true). This is important because, while data science is a technical discipline, it stands out among other technical disciplines for the necessity of explaining its results to an audience that doesn’t have the same level of technical knowledge as the data scientists who produce it.
Telling the story with data involves discussing (a) what’s happening, (b) what it means, and (c) what you should do about it (for more information, read this article from my colleague Casey Doyle). If you are familiar with Gartner’s four stages of Business Intelligence (BI) — descriptive, diagnostic, predictive, and prescriptive — the “what’s happening” involves descriptive and diagnostic, “what it means” entails prediction, and “what you should do about it” is about prescription. So, how you present (or visualize) the “what’s happening” has a great impact on the conclusions and resulting actions for your audience.
At a more granular level, data visualization also has its share of common best practices (such as “do not omit the axis”), but they are generally not as precise or scientific as other aspects of data science. An entry point is this old and yet useful article on visual encoding. (But before you begin reading it, ask yourself how many dimensions you have at your service for visualizing data. If you answered “two” or “three,” consider the question again after reading the article.)
I liked the following two-part course on visualization in R from Datacamp. Not only does it help you learn technical skills for how best to use the ggplot library for making data visualizations, it also provides a structured way to approach visualization as the courses gradually unwrap additional dimensions as described in the visual encoding article mentioned above:
As for a third part of the course, it might be interesting to explore advanced methods of visualization, but I recommend first considering their practicality. For example, before you decide to apply advanced charts, think twice about whether your audience knows how to interpret them. One time when I was working on an update for our regular internal newsletter, I planned to use box plots, but our chief editor recommended that I choose another chart more suitable for a broad audience.
Another approach would be to read about the work of some experts in visualization such as Edward Tufte, or simply do a web search for “best samples of visualization” to get some ideas about how powerful visualizations can be (but alas, it is unlikely you will find a precise user’s guide to creating them).
I end this section with two final but important points: Don’t forget to take into consideration color-blind people when you choose your palette of colors, and remember that pie charts aren’t the most favorite among data scientists. (A piece of trivia: Did you know that in French a pie chart is called a “Camembert chart”?)
Machine Learning
I now come to the fifth item on my list, and what many consider the coolest part of data science work: Machine Learning. Unfortunately, there is no easy way to learn ML skills quickly, but you might consider two approaches:
- For in-depth knowledge, take two or more years of classes on data science that include the most top-notch algorithms and then use these skills to look for a job. For example, several universities offer online degree programs in data science, though other programs are also available for those who aren’t looking to earn a degree.
- Use available online resources (such as Coursera or Datacamp) to learn the basics (both theoretical and practical), use these skills to get a job, and then continue learning as you go through the business cases you encounter. If you work with other data scientists, be sure to learn from them using real-life scenarios or ask them for advice or help with brainstorming. And even if your job doesn’t have an ML component, work on some cases from Kaggle or read ML blogs on Medium.com — which is worth of doing anyway to stay in touch with the data science universe.
In regard to keeping up with ML developments on blogs, they might seem intimidating — especially in the beginning — as in some blogs authors juggle a variety of abbreviations that you might not have heard before. But keep reading, and you will get there. Many bloggers use examples to illustrate their points and these represent a lot of free data sets that you can use to play with data and learn ML algorithms. Also, you will encounter a plethora of recommendations for what to learn first, how to prepare for interviews, and more. Instead of giving my ideas of what the ML basics are, here is an article published by current and former colleagues that includes learning resources for ML (among other data science topics), and here is an article explaining the 11 most common ML algorithms. And, of course, you can search on your own for “main ML algorithms” or “how to start Machine Learning.”
Furthermore, I recommend understanding not only algorithms, but also main categories of tasks (i.e., regression / classification), categories of ML (i.e., supervised versus not supervised), and other main concepts applicable to any algorithm (such as train-test-validate, how to normalize and scale your data, bootstrap, overfit, and bias-variance tradeoff, to name only a few).
And remember that in real work, simplicity is often king. If a problem can be solved with a simpler approach (or, even without using an ML approach), it might be best for a given situation. Or at least try to start simple (helped by the fact that you did preliminary EDA before jumping into an ML algorithm).
Data engineering
While this last item on my list might not be critical to getting a job, understanding the data engineering task is getting more important. It’s almost like BYOD but where D is for data: In a large company you might need to pull data from various sources and formats so that you can work with it in your capacity as a data scientist.
In my view, this domain consists of several tasks including data modeling, the ETL process, tooling, and ML Ops, each of which I summarize.
Data modeling
Data modeling involves how to store data. For example, should it be stored as one wide table or as a few tables in a snowflake design? Should it be pre-aggregated or raw? Here it would be useful to understand concepts such as normalization, primary and foreign keys, and more, all of which are relevant even if you don’t make decisions on data modeling (in other words, if you only use the data). Knowing how the data you work with is organized helps you with data extraction (as described in the first section above). But knowing how to do this by yourself is critical for a DBA (database administrator) role. I started from An Introduction to Database Systems by Christopher Date. As I found it to be tough reading, I’m not sure whether I can say that I recommend it, but I learned a lot from reading it.
The ETL process
ETL stands for extract, transform, load, the process of creating a data workflow — a.k.a. pipelines — and it is at the core of a data engineer’s role. It’s a practical task that depends on the technologies chosen in your organization. But it can also be versatile, as it’s dictated by your scenarios and needs. If you use Azure, there are lots of materials available on Microsoft Learn under the “Data Engineer” role. It would also be helpful if you can work with experienced data engineers and learn practical things from them.
Data architecture
Data architecture is the realm of someone who makes decisions about the technology stack for the organization to use (starting from whether it should be on prem or cloud), what database platform to use, what tools to use for the ETL pipeline and compute tasks, and more. This might be the same person who creates ETL pipelines or the database design or it might be a separate role, such as a data architect.
Someone in this role is typically quite experienced, and so as a beginner it might be best to think of this area as a way to expand your horizons. I am afraid that I am not personally familiar with courses that can help build this skill, but you might try looking for courses for a data architect role.
ML Ops
ML Ops is short for Machine Learning Operations and is a relatively recent skill as a part of data engineering work. It pertains to questions such as, when your model goes into production where should it run automatically on a regular basis, where might you want to monitor it, and how might you want to retrain and redeploy it. This work is rather specific as it combines the skills of a data scientist who is very familiar with models and the skills of a developer operations (Dev Ops) engineer, as you need to be familiar with tasks related to deployment into production. Unless you are looking for this specific role, it’s not the first skill you need to get started in for a data science career, but as with the other skills I’ve mentioned, it’s one that you will have to learn at some point in time.
Conclusion
There are many trajectories you can follow with data science work, and in this article I have tried to summarize my own personal experience in the hope that you might find it useful. I finish with this quote from Chanin Nantasenamat:
“The best way to learn data science is to do data science.”
Good luck!
Katya (Катя) Lazhintseva is on LinkedIn.