How to Think Like a Data Scientist in 12 Steps

James Le
James Le
Aug 3, 2018 · 33 min read

Introduction

At the moment, data scientists are getting a lot of attention, and as a result, books about data science are proliferating. While searching for good books about the space, it seems to me that the majority of them focus more on the tools and techniques rather than the nuanced problem-solving nature of the data science process. That is until I encountered Brian Godsey’s “Think Like a Data Scientist” — which attempts to lead aspiring data scientists through the process as a path with many forks and potentially unknown destinations. It discusses what tools might be the most useful, and why, but the main objective is to navigate the path — the data science process — intelligently, efficiently, and successfully, to arrive at practical solutions to real-life data-centric problems.

Lifecycle of a data science project

  • The 2nd phase is building the product, from planning through execution, using what you learned during the preparation phase and all the tools that statistics and software can provide.
  • The 3rd and final phase is finishing — delivering the product, getting feedback, making revisions, supporting the product, and wrapping up the project.

Phase I — Preparing

The process of data science begins with preparation. You need to establish what you know, what you have, what you can get, where you are, and where you would like to be. This last one is of utmost importance; a project in data science needs to have a purpose and corresponding goals. Only when you have well-defined goals can you begin to survey the available resources and all the possibilities for moving toward those goals.

1 — Setting Goals

2 — Exploring Data

3 — Wrangling Data

4 — Assessing Data

  • Inferential statistics asks, “What can I conclude?”

Phase II — Building

After asking some questions and setting some goals, you surveyed the world of data, wrangled some specific data, and got to know that data. In each step, you learned something, and now you may already be able to answer some of the questions that you posed at the beginning of the project. Let’s now move to the building phase.

5 — Developing Plan

6 — Analyzing Data

  • Inferential statistics is inherently one or more steps removed from the data. Inference is the process of estimating unknown quantities based on measurable, related quantities. Typically, inferential statistics involves a statistical model that defines quantities, measurable and unmeasurable, and their relationships to each other. Methods from inferential statistics can range from quite simple to wildly complex, varying also in their precision, abstractness, and interpretability.
  • Latent variables.
  • Quantifying uncertainty: randomness, variance and error terms.
  • Fitting a model: maximum likelihood estimation, maximum a posteriori estimation, expected maximization, variational Bayes, Markov Chain Monte Carlo, over-fitting.
  • Bayesian vs frequentist statistics.
  • Hypothesis testing.
  • Clustering
  • Component analysis.
  • Support vector machine
  • Boosting
  • Neural network
  • Deep learning

7 — Engineering Product

  • Flexibility: In addition to being able to perform the main statistical analysis that you want, it’s often helpful if a statistical tool can perform some related methods. Often you’ll find that the method you chose doesn’t quite work as well as you had hoped, and what you’ve learned in the process leads you to believe that a different method might work better. If your software tool doesn’t have any alternatives, then you’re either stuck with the first choice or you’ll have to switch to another tool.
  • Informative: Some statistical tools, particularly higher-level ones like statistical programming languages, offer the capability to see inside nearly every statistical method and result, even black box methods like machine learning. These insides aren’t always user friendly, but at least they’re available.
  • Commonality: With software, more people using a tool means more people have tried it, gotten results, examined the results, and probably reported the problems they had, if any. In that way, software, notably open-source software, has a feedback loop that fixes mistakes and problems in a reasonably timely fashion. The more people participating in this feedback loop, the more likely it is that a piece of software is relatively bug free and otherwise robust.
  • Well-documentation: In addition to being in common use, a statistical software tool should have comprehensive and helpful documentation. It’s a bad sign if you can’t find answers to some big questions, such as how to configure inputs for doing linear regression or how to format the features for machine learning. If the answers to big questions aren’t in the documentation, then it’s going to be even harder to find answers to the more particular questions that you’ll inevitably run into later.
  • Purpose-built: Some software tools or their packages were built for a specific purpose, and then other functionality was added on later. For example, the matrix algebra routines in MATLAB and R were of primary concern when the languages were built, so it’s safe to assume that they’re comprehensive and robust. In contrast, matrix algebra wasn’t of primary concern in the initial versions of Python and Java, and so these capabilities were added later in the form of packages and libraries.
  • Inter-operability: If you’re working with a database, it can be helpful to use a tool that can interact with the database directly. If you’re going to build a web application based on your results, you might want to choose a tool that supports web frameworks — or at least one that can export data in JSON or some other web-friendly format. Or if you’ll use your statistical tool on various types of computers, then you’ll want the software to be able to run on the various operating systems. It’s not uncommon to integrate a statistical software method into a completely different language or tool.
  • Permissive licenses: If you’re using commercial software for commercial purposes, it can be legally risky to be doing so with an academic or student license. It can also be dangerous to sell commercial software, modified or not, to someone else without confirming that the license doesn’t prohibit this.

8 — Optimizing Data

9 — Executing Plan

  • If you’re a software engineer, you know what a development lifecycle looks like, and you know how to test software before deployment and delivery. But you may not know about data and no matter how good you are at software design and development, data will eventually break your application in ways that had never occurred to you. This requires new patterns of thought when building software and a new level of tolerance for errors and bugs because they’ll happen that much more often. You should consult statisticians who are well versed in foreseeing and handling problematic data such as outliers, missing values, and corrupted values.
  • If you’re starting out in data science, without much experience in statistics or software engineering, anyone with some experience can probably give you some solid advice if you can explain your project and your goals to them. As a beginner, you have double duty at this stage of the process to make up for lack of experience.
  • If you’re merely one member of a team for the purposes of this project, communication and coordination are paramount. It isn’t necessary that you know everything that’s going on within the team, but it is necessary that goals and expectations are clear and that someone is managing the team as a whole.

Phase III — Finishing

Once a product is built, you still have a few things left to do to make the project more successful and to make your future life easier. So how can we finish our data science project?

10 — Delivering Product

  • In some data science projects, the analyses and results from the data set can also be used on data outside the original scope of the project, which might include data generated after the original data (in the future), similar data from a different source, or other data that hasn’t been analyzed yet for one reason or another. In these cases, it can be helpful to the customer if you can create an analytical tool for them that can perform these analyses and generate results on new data sets. If the customer can use this analytical tool effectively, it might allow them to generate any number of results and continue to answer their primary questions well into the future and on various (but similar) data sets.
  • If you want to deliver a product that’s a step more toward active than an analytical tool, you’ll likely need to build a full-fledged application of some sort. The most important thing to remember about interactive graphical applications, if you’re considering delivering one, is that you have to design, build, and deploy it. Often, none of these is a small task. If you want the application to have many capabilities and be flexible, designing it and building it become even more difficult.

11 — Making Revisions

12 — Wrapping Up Project

Conclusion

Data science still carries the aura of a new field. Most of its components — statistics, software development, evidence-based problem solving, and so on — descend directly from well-established, even old fields, but data science seems to be a fresh assemblage of these pieces into something that is new. The core of data science doesn’t concern itself with specific database implementations or programming languages, even if these are indispensable to practitioners. The core is the interplay between data content, the goals of a given project, and the data-analytic methods used to achieve those goals.

Cracking The Data Science Interview

Your Ultimate Guide to Data Science Interviews

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store