The data scientist toolbelt

Published in

Data Science at Microsoft

19 min readSep 1, 2020

One question I often hear is: “What skills should I learn to be an effective data scientist?” This comes up in mentoring sessions, 1:1s with team members, Q&A sessions with students, and more. Whether you’re looking to get into the field, or are a data scientist already, it’s a relevant topic, as we all need to continue advancing our skills as we grow in our careers. But what areas should you focus on? While the field of data science has been continuously changing, we’ve put together a framework that has withstood the test of time. In this post, we’ll walk through three key areas to continue advancing your skills. Here are some quick links to each section (note, these links may not work if you’re reading this article on a mobile device):

The data science Venn diagram

Technical skills
Business context
Soft skills

Bringing it all together (the intangibles, the unicorn, the pep talk, and a plan)

Frequently asked questions

The data science Venn diagram

A plethora of Venn diagrams have been used to describe the field of data science. Of course, there’s the original Conway Venn diagram that started it all. Then there’s the “Battle of the data science Venn diagrams,” with a number of variations that others have created over time. In addition to being entertaining, each of these variations offers good points and perspectives. And the variety of perspectives makes sense, since the field of data science is evolving, and it’s also diverse.

We like to summarize the many skills that comprise a data scientist into three main categories:

*Credit: Matt Storey, who developed this durable framework for our team in 2015.*

I haven’t drawn this diagram to any scale, so you could debate the relative sizes of these three categories. However, “technical” is likely one of the largest categories, given the many facets it entails, so we’ll start there.

Technical skills

“Technical” covers a broad set of capabilities. In this section, we’ll walk through what each of these technical skill areas means, and how to apply them to your work.

Analytical problem solving: Before you get your hands on any data, you first must understand the problem you’re trying to solve. Having this perspective to help chart a path forward is key to getting there. Furthermore, having effective judgment to evaluate approaches, choose the correct formula, and apply the appropriate data points are all key to a successful result. Without it, you might inadvertently show a ratio in the reverse direction, plug in the wrong data points, or otherwise manipulate the data in a way that isn’t sound. Data science roles often specify quantitative fields of study as preferred academic degrees to encourage this kind of critical thinking.

Statistical concepts and techniques: Another important skill to ensure data is handled correctly is statistics. This includes concepts such as probability distributions, confidence intervals, regression analysis, and hypothesis testing. Statistics is one of the many educational backgrounds that aligns well with data science. It’s important for all data scientists to have enough of a foundation in statistics to develop an intuition regarding sound approaches, as well as awareness of the key concepts. The danger is that if you don’t know that additional considerations need to be checked (like sample sizes, distributions, and statistical significance, among others), you might share unverified analysis with a stakeholder and mislead them with the results.

Languages (SQL, Python, R, and Kusto): Having fluency in some programming languages and the ability to learn others is key to ensure that you’re efficient and productive in your role. These languages are used to manipulate data, implement machine learning models, and develop programmatic solutions. In any data science interview, at some point you will be asked questions to ensure you’re familiar with common querying concepts such as joins, as well as programming syntax in the language of your choice. It’s helpful to also have experience with tools like Jupyter notebooks and R Studio as effective environments where you can create and share documents with live code.

Machine learning modeling: Machine learning is one of the areas you hear about most in data science and is certainly exciting and powerful. There are a variety of modeling techniques, such as supervised and unsupervised learning, classification and regression, clustering, deep learning, reinforcement learning, and more. These are used in a variety of applications, such as recommender models, natural language processing, segmentation models, forecasting models, and propensity models. These models help us understand current dynamics, predict future outcomes, and recommend user actions. It’s good to gain experience with the various enterprise scenarios that arise, such as handling noisy and/or sparse data, providing users with model explainability, running models in production with ML Ops, retraining the models over time, incorporating user feedback into the model, and tracking model performance.

Data preparation and pipeline management: In order to get started, there is a nontrivial first step to prepare the data. Often we’re working with “big data” in real-world enterprise scenarios. We need to identify the relevant datasets, gain permissions, extract the relevant records, and join with the appropriate identifiers to create meaningful connections. Then we pursue our initial data exploration. This includes checking the data quality and completeness, as well as handling outliers and other data cleaning needs. If it’s your first time working with the dataset, you may need to read internal documentation or test out the scenario to verify the data it generates. There may also be cases where the telemetry isn’t available and you must design new instrumentation. If you’re setting up a production model to run on this dataset, you’ll want to architect the pipelines in a way that allows for an automated refresh schedule and live site management. To ensure a stable service, this involves DevOps concepts including anomaly detection, reliability, uptime, and service level agreements. Finally, you must be familiar with policies around data privacy, GDPR, ethics, and security to ensure that the data is handled appropriately.

Experimentation: Running experiments is key for driving innovation in a data-driven culture. Successive random control trials can help the team discover drivers and learn which one is having a material impact on business goals. In order to effectively lead these activities, you must be able to design proper experiments and analyze the results. This includes forming hypotheses, constructing proper control groups, accounting for biases, running statistical tests, and concluding findings. Finally, experience running experimentation review councils, applying ethical practices, developing and using experimentation frameworks, managing intersecting experiments with multi-attribution, and reporting experiment results at scale are all relevant on-the-job skills. Experience with causal inference techniques can be useful for impact analysis, as well.

Data visualization: When it’s time to present our results, visuals help tell the story. It’s important to communicate to stakeholders in a concise way so that points are consumed and remembered, and data visualization is an effective tool toward this goal. Several best practices help land the message with visuals. These include selecting the optimal chart type, simplifying (by removing unnecessary lines or datapoints), reducing colors, focusing attention on the key points, making text large and readable, aligning to a grid, and more. Here are a few books that we like, which provide practical tips: Storytelling with Data, The Wall Street Journal Guide to Information Graphics, Information Dashboard Design, and The Visual Display of Quantitative Information.

Business context

One of the most exciting aspects of data science is the opportunity to apply data to business scenarios. This includes using data to inform business decisions and developing AI services to enhance the customer experience. But to be successful at these enterprise data science projects, you must understand the business and customer scenarios. In fact, this understanding is useful for any industry job. That is why Starbucks employees do taste tests, Airbnb employees host guests and stay at properties, and Amazon employees visit fulfillment centers. Similarly, in Azure, we sit in with our support team on customer calls, dogfood our product, and visit data centers. Putting yourself closer to the customer creates a new level of awareness and empathy. It also provides context for how different departments need to work together to enable the end-to-end experience. Taking a customer-centric mindset is a good compass to help guide you on any decision, based on what is best for the customer. Finally, this background can help spark ideas for how to make an impact from your role.

While business context is critical for any business role, in this section we’ll cover why it’s particularly important for data science. For a data scientist, business context includes understanding user scenarios, having a close connection with your business stakeholder, and being a subject matter expert in the dataset. Data can be misleading if misused, and one of the many ways to misuse data is to misinterpret the meaning of a field in the database. Therefore, it’s very important to maintain good documentation and understand what the data represents. This context also helps identify data quality issues (by giving a sense of expected bounds), uncover findings (by recognizing interesting trends for the business), and inspire new ideas for what to explore or model next. Understanding the goals for the analysis is often the key differentiator for turning data points into insights, by framing the results in an actionable way.

While your particular application domain will vary, in this section we’ll provide examples from our experience with Microsoft products, to help spark ideas for how to grow the business understanding in your company as well. In Azure, we need to have a technical understanding of the Azure services that customers use (shown below) and the solutions that they build:

https://docs.microsoft.com/en-us/learn/modules/welcome-to-azure/3-tour-of-azure-services

We also need to understand the experience that different audiences have when engaging with Microsoft sites, programs, and services:

So, how do you build that context? Here are a few approaches:

Leverage training materials: Your company probably has plenty of resources for users to learn how to use its products. Those can be great resources for you to study, too. For Azure, this includes Azure.com, Docs, MS Learn, Channel9, Webinars, Quickstart templates, Knowledge center, and more. If you’re learning as a team, schedule a recurring brownbag and nominate team members to research and present topics at each session. You can also watch the executive keynotes and session demos from external tradeshows to stay well-versed with customer scenarios. Internal all-hands meetings, earnings reports, and “ask anything” sessions are additional opportunities to hear from executives on the company and product direction.

Develop a project: Find a project you’re motivated to complete and that involves using the product. This can be a personal project or an initiative that you volunteer to help with at work. Having a specific end goal in mind will force you to work through scenarios and learn more in the end, compared to simply browsing through the learning materials above. To keep yourself accountable, you can also commit to a deadline, such as an event presentation, as a forcing function to prioritize this learning activity.

Listen to customers: Join support calls, events, or message boards to hear what’s top of mind for customers. If you don’t have access to these in your data science organization, ask your business stakeholders about opportunities you can join. The competition and market direction are good aspects to be aware of, too. Think of new ideas and approaches that the team can take to accomplish the strategic goals.

Document your understanding: As you learn, document your understanding so others can also benefit. Creating documentation for user flows and experiences can be a powerful step to align different parts of the organization. It also helps ensure that you’re interpreting the data correctly by clarifying the business process that it represents. Each time you share the draft with another person, you’ll learn a bit more about the way things are actually occurring, and you’ll end up with an artifact that is an accurate representation of the truth. (This improved cross-group understanding is beneficial for your stakeholders as well.) As the size of your company grows and the numbers of teams contributing to the customer experience increases, shared understanding and written artifacts become even more important. Below is an example from the Azure Marketplace user flow.

Soft skills

Traditionally, the training materials for data science have focused on technical skills. However, at any given point in time, I find that the areas my team members are prioritizing for their career development are pretty evenly split among the three categories introduced earlier (technical, business, and soft skills). More and more, I also see these topics coming up in industry conference sessions on “tips to be a successful data scientist in the enterprise.” It’s great to see the growing acknowledgment for this. I do find that soft skills are a key aspect of an individual’s ability to have a strong impact and growing career path in the organization. So, what are the top soft skills for data scientists?

Communication: As data scientists, if we develop the most innovative solution but no one knows about it, how much impact can it truly have? Scientists must speak, and data scientists are no exception. When we speak, we also need to make sure our message comes across. To land takeaways with a busy executive, the communication needs to be clear and concise. It’s good to have additional details “back pocket,” but many can be saved for the Q&A session. We also need to share the facts in a way that accurately conveys the information, why it matters, and what action to take. Data storytelling is a key skill to land this story arc. To learn more, see LinkedIn trainings on Presentation Skills and Public Speaking, enroll in the Coursera course on scientific writing, join Toastmasters, hire a speaking coach (we’ve had a good experience with Richard Klees), and most of all: Practice, practice, practice. For fast results, identify a learning buddy (or your manager) to give feedback after each presentation, including what went well and what you can improve.

Influence: Related to communication is the ability to influence. The data scientist must be able to stand by the numbers, whether they represent “good” or “bad” news. At its best, data science is a close partnership with stakeholder teams. Rather than merely serving data points, the data scientist should bring ideas (based on data insights) regarding the strategic initiative to take on next. Finally, the data scientist should be able to say “no” to lower priority asks and curiosity questions, in order to focus on projects with maximum business impact.

Collaboration: At the same time, data scientists need to be incredibly collaborative, both with business stakeholders and with fellow data science team members. While there are opportunities to deliver results independently, there are also many team projects to partner on together. Given our diverse backgrounds, we tend to create environments for team members to gather ideas and perspectives from the broader group. This helps keep our work consistent while ensuring that we’re applying common best practice approaches.

Organization: Good organizational skills are important for everyone’s effectiveness, but in particular for data scientists. There is no shortage of questions we want to answer with data, so it’s important to prioritize one’s backlog. Data scientists need to “cost” and plan their projects, so that others can depend on them to deliver. Documenting project requirements, leveraging work item tracking, and publishing results are all great best practices.

Bringing it all together

While the primary focus of this article is on data scientist “tools,” or skill sets, it’s worth taking a moment to discuss what data scientists do with all these tools. For example, if we defined a ceramic pottery artist by their tools (potter’s wheel, wire cutter, angled knives, shaping tools, sponge, brushes, calipers, kiln, and so on), we would be missing the essence of what they do (create art!), and therefore lack the context of what they use these tools for.

In Ron Sielinski’s earlier post “The role of data scientist,” he quotes the following definition from Jeannette M. Wing: “Data science is the study of extracting value from data.” This “value” (i.e., what data scientists “do”) comes in the form of analytical insights, machine learning models, experimentation results, and more. For more details, see the article for specific data science deliverables by role.

Like the potter who needs to combine learned skills and tools with their own innate artistic style, the data scientist also brings together both art and science.

The intangibles

This brings me to the “intangibles” — the more innate characteristics of effective data scientists that are not listed in any data science master’s curriculum and yet are core traits for those successful in the field.

Curiosity: One of the most fun parts of being a data scientist is uncovering surprises. While we design a product or program with a particular use case in mind, users might find novel ways to take advantage of it, and telemetry data is a path toward discovering the truth. In looking at the data for one thing, we might also notice another trend that turns out to be a powerful insight. But without curiosity, this insight might go unnoticed. Another key trend that curious data scientists notice pertains to data quality, which is key to being able to deliver high quality analysis.

Creativity: In the previous article, Ron noted the importance of creativity for data science as “one of the most valuable skills of a data scientist, but often the least emphasized, and certainly the most difficult to cultivate.” A creative data scientist will come up with ideas for new AI services to improve the customer experience, by “connecting the dots” from the data. Creativity also helps with navigating the inevitable blockers that come up along the way and inspires ways to work through them.

Grit: Having a strong determination and drive for results helps the data scientist to work through challenges. (Some also call this “stick-to-it-iveness.”) For a data scientist, this may include working through data access permissions, finding ways to join disparate datasets, handling noisy data with outliers, driving model performance, navigating experimentation limitations, and handling resource constraints.

Growth mindset: A growth mindset is the belief that you can learn anything you set your mind to. It’s about facing challenges with the excitement for the learning opportunity that they provide, rather than being discouraged by the risk for failure. For a data scientist, a growth mindset (or “learn-it-all” approach) is key to learning the many skills discussed in this post. It also means you’ll be more open to feedback, which will make your impact that much greater.

Passion: Of course, anytime someone is passionate about their work, the better job they’ll do at it. Passionate data scientists are excited about the application of science to business, and want to operate in a data-driven culture (versus relying on opinion). I often see candidates applying to our team who have experienced other methods of decision-making and want to be part of a more structured approach.

The unicorn

After reading this long list, you may feel a bit overwhelmed. If so, you’re not alone. In fact, there is a coined term, the “data scientist unicorn,” in existence because it’s so rare to find someone who satisfies all the criteria. Among professions, the data scientist skill set is one of the more diverse. Some joke that the data scientist job description is truly a “wish list.” While each is skill is useful, it’s possible to start with a subset, while continuing to develop the rest.

In fact, another way to interpret the data science Venn diagram is that it represents a data science team rather than a data science individual. That is, even if each individual doesn’t fulfill all the areas, we can address them by hiring a team with complementary strengths. In this way, putting together a team is like assembling an orchestra. Not every individual needs to be an expert in everything — they just need to work together well.

I recommend doing your own skill assessment. Reflect on what your true superpowers are and find roles that leverage them. At the same time, be self-aware and understand your development opportunities. Then prioritize learning plans based on what will help you be self-sufficient and have the biggest impact with your work.

The pep talk

As a result of having such a long list of requirements, data scientists often experience “imposter syndrome.” This is a feeling that even if you have a data science job or data science achievements, you fear you’re a fraud — that you “got lucky” with those accomplishments — and you don’t truly belong. Imposter syndrome occurs frequently in the tech industry.

Personally, I like to turn this “upside down” and maintain a healthy outlook by remembering that no one knows everything in the world of technology. In fact, the more you know, the more you actually realize that you don’t know. And that’s the beauty of tech. It’s constantly evolving, so we get to continuously research and contribute new ideas. If you’re excited about the opportunity to grow and experience lifelong learning, you will find this motivating, and never be bored doing the same thing over and over again.

Of course, our inner critic is useful at times. It pushes us to achieve more and do better. However, when it reaches too high of a level, it can be debilitating. “Inner Critic Inner Success” offers a “devil’s advocate” technique for this. If you think you don’t know about a particular topic, you can use reverse psychology to eek out all the things you do actually know, and then build on that. In the end, the goal is to end up with a healthy level of self-awareness.

Plan your work and work your plan

Now that you know where you stand, take some time to reflect on where you want to go next. One tool that you can use to figure this out is a career plan template. There are many versions available online, so feel free to pick one that works for you. The key is to make time for this reflection, and to consider your values, skills, and passions. Think back on an awesome day, and then figure out what excited you most.

Assuming this still leads you down a data science career path, the next step is to put together a learning plan. Pick a few areas to pursue and activities that will get you there. Below are some examples. Pick two to three activities to focus on every three to four months, and then schedule a check-in with your manager to keep yourself accountable. You may find it helpful to designate a specific time in your schedule to tackle these trainings. In our team, we carve out Thursday afternoons for learning and development.

Frequently asked questions

This framework is a useful guide both for individuals interested in entering data science as well as individuals interested in growing within their Data Science career, at any level. Here are a few questions I often receive:

What are the different roles available in the field of data science?

This article builds upon our earlier articles, which cover this topic:

Designing a data science organization (which describes the types of organizational structures you can join), as well as
The role of a data scientist (which describes the types of data science roles within the organization)

What are some recommended training resources for technical skills?

There are many ways to acquire data science technical skills, starting with a boot camp to a formal degree (bachelor’s, master’s, or Ph.D.). Of course, the more thorough the program, the more comprehensive the training you will receive. There are a growing number of data science programs that have become available over the past decade. If you’re learning these skills while working (which also gives you the opportunity to practice them on the job), you can take advantage of evening and weekend programs, as well as books and MOOCs (“massive open online courses”).

What is the ideal background for a data science role?

Data science attracts people with a wide variety of backgrounds. While the majority of software engineers have a computer science background, the educational background for data scientists is more evenly divided among math, statistics, physics, economics, engineering, and other applied sciences. Data scientists may have work experience in finance, consulting, database administration, business planning, software engineering, and more. One benefit of bringing together diverse backgrounds is that we often receive open-ended questions, which gives us an opportunity to consider multiple perspectives as we determine our approach. Given that most data science university programs came into being in the past five to ten years, it’s more common to see data science-specific degrees among recent graduates. If you’re interested to learn about career paths of our current team members, please see our “Faces of data science” series.

I’m interested in switching careers into a data science role. (Or, I’m interested in shifting focus within data science career paths, from analytics to machine learning.) How should I proceed?

My top three tips are to learn the skills in this article, get a mentor in the field, and gain experience with “hands on” projects. Developing a project has the benefit that you’ll learn more about the capabilities and limitations of the tools and techniques you’re studying by pursuing a specific end goal (as opposed to more theoretical learning). In the process, you’ll gain a better sense of what you know and what you need to learn. Very importantly, you’ll also confirm whether you actually enjoy this kind of work! Finally, you’ll have experience to reference and draw from as you interview for future data science roles.

There are a few ways to kick off a project. One option is to find one at work. You can volunteer to help the data science team (and get mentorship from them in the process). Another option is to start a new project, within your current role, that will benefit the business. This provides a way to continue contributing to your team while learning a new skill and starting to position yourself in a new way. Finally, you can also develop a personal project, and share the code on GitHub.

What is the interview process like?

Our interview process has a few stages, including a resume review, initial phone call with a hiring manager or HR recruiting, technical screen, and finally the “in person” interview. (Note: The in person interview is currently held remotely, as of March 2020.) In an article in our “Faces of data science” interview series, three members of our team who joined in early 2020 speak about their interview experiences at Microsoft and offer some tips and perspectives.

Where should I apply if I’m interested in a data science role at Microsoft?

Please visit our careers site at https://careers.microsoft.com/.

Lisa Cohen is on LinkedIn.

Data science learning resources

Books, courses and articles to grow your data science foundation

medium.com