Photo by Kaleidico on Unsplash

Becoming a better data scientist: Lessons from academia and industry

Marrit Zuure
Orikami blog
Published in
12 min readJan 4, 2022

--

Two years ago, I pivoted from a PhD in neuroscience to a career in industry data science. What followed was mild culture shock. Industry and academia can be very different beasts. Work suddenly culminated in a release in a matter of weeks instead of years — fantastic! But being unable to fit every interesting analysis in the time budget was a new reality to adjust to.

I’ve since concluded that academic data science could take some cues from industry data science, and vice versa. Academia tends to value thoroughness and rigor, whereas industry tends to value pragmatism. This article is a guide for academics and industry data scientists alike, to encourage not getting stuck at either extreme. It touches on each stage of developing a research project, from its initial conception to the sharing of results.

Not all lessons listed in this article are applicable at the same time. In fact, some are mutually exclusive. Getting you to apply them all is not the goal. Instead, this article invites you to think about where you are, where you’d like to be, and which approaches you could use to get there.

The start, scope and design of your project

Any good data science project starts with the end goal in mind. The number of possible analyses for a given data set is potentially limitless, and without a clear goal, you’re at risk of getting lost in the woods. Academia and industry have different goals and different takes on what the purpose of your research should be. Both takes can help you clarify the scope and direction of your research.

The scientific method, summarized. (Image by Efbrazil)

Lesson from academia: Formulate a hypothesis before getting started

The goal of academic science is to increase knowledge by establishing statements as fact. To achieve this, every research project is guided by a carefully formulated hypothesis. Put very briefly, you start with a theory (how do you expect the world to work?) and operationalize it (what would this look like in observable data?), with a check for falsifiability (could the observable data also discredit this idea?). Hypotheses are great in part because they set a very clear goal for your work. A well-crafted hypothesis can go a long way towards informing your experimental design.

Not all industry data science projects lend themselves to formulating a hypothesis. For those that do, it can be a great tool to determine scope. A good hypothesis will naturally inform (and constrain!) the type of data to collect and the types of analysis to perform.

As an example: “I want to know how high different dog breeds can jump” leaves a lot to the imagination. You’ll want to measure jump height somehow, but do you need to collect data from every breed of dog? Say you rethink this and find that you have the hypothesis that larger dog breeds can jump higher, i.e., you expect size and jump height to be positively correlated. This tells you several things: you’ll need to sample breeds with a range of different sizes, you’ll need to sample enough dogs from each breed to get a representative sample in size as well as jump height, and you’ll need enough data points to be able to calculate a statistically significant correlation.

Lesson from industry: Work towards actionable insights

The goal of industry data science is to use data to guide decisions. As a PhD candidate in a fundamental branch of neuroscience, I used to dread being asked about the societal impact of my work. Industry taught me to approach this question from a different angle. I now look for actionable insights: what insights can I produce that will guide others’ decisions?

In industry, actionable insights could look like determining whether an intervention is cost-effective and should continue to be funded, or whether QA needs to be bumped up to meet standards, or whether a campaign should target a specific audience. In academia, actionable insights could be as overt as informing new treatment programs for a certain disease, or they could be as subtle as nudging your field towards using a more accurate analysis method or towards adopting a certain theory.

Knowing the impact you want to have will help you tailor your experimental design, your choice of analyses, and the way you disseminate your results towards that goal. You may also want to limit your scope to the insights with the largest impact: while many insights can be actionable, not everything will be acted upon.

Choosing and collecting your data

Once the scope and general direction of your data science project have been defined, the next step is collecting your data. Data collection is resource-intensive, and it’s worth thinking carefully about the data (type, quality, quantity) that you’ll be working with.

Lesson from academia: Collect data that matches your hypothesis

Quite simply: what kind of data do you need to answer your research question? If you formulated a hypothesis, the type of data to collect will flow naturally from it. (If not, you may want to revisit your hypothesis.)

Don’t be fooled by the apparent simplicity of this step; it’s entirely possible to get stuck trying to squeeze an answer out of irrelevant data. Returning to the example of jumping dog breeds: if your working hypothesis is that breed size is related to jump height, but you only collected data from small dog breeds, you may have under-sampled the data space in a way that is inappropriate for your research question.

Lesson from industry: Use the data you have

As mentioned, data collection is a large resource drain. As a PhD student I was regularly surprised by the availability of diverse, high-quality data sets, many of which could have been easily re-analyzed with a different research question in mind… and yet everyone collected their own data. (For what it’s worth, I worked exclusively with data collected by others, which freed up my time to do much more interesting analyses.)

Take a critical look at what’s already out there and whether it can be used to answer the questions you’re interested in. However, do pay close attention to the fit between your data and your question. Any effort you save on data collection is undone if you can’t use the data for its intended purpose.

The Donders Institute neuroscience data repository; one of many sources of freely available research data.

Lesson from academia: Collect data as cleanly as possible

I learned this when working with brain recordings: there’s no substitute for clean data. No matter how cleanly you try to filter out artifacts, nothing beats having clean recordings in the first place. It’s potentially even more important in industry, where you’re working in a much less controlled environment. Try to get your inputs as clean and constrained as possible at the earliest stage, whether that means checking electrode conductance or validating form inputs on surveys before your participants submit their answers.

Lesson from industry: Learn to work with dirty data

No data set is perfect, regardless of how carefully it has been curated, and real-world data can be especially messy. Data cleaning and validation are necessary first steps regardless of your analysis. Clean your data, be conservative in your assumptions about what the data look like, and check the assumptions that you do make.

Do your best to ensure that your data is clean and of good quality before you start your analysis. If you don’t, there’s a chance of P-hacking by going back and forth between data analysis and data cleaning until your data show the expected result.

Choosing and executing your analyses

You have a research question, you have data matching it, and the whole wide world of analysis is open to you. You know to focus on those analyses that answer your question and produce actionable insights, and have perhaps made a data analysis plan, but where do you go from there?

Lesson from academia: Look at your data

It’s easy to get caught up in descriptive statistics of your data, or in analyses building on analyses building on analyses. The best remedy is to visually inspect your data. Plot early, plot often, plot intermediate steps in your analyses. It helps you catch errors (if something looks off, there’s a good chance it is), get a sense of the nature of the data, and stay grounded in understanding what you’re doing.

A word of warning: while looking at your data is exceedingly useful, be careful not to double-dip. Seeing a possible pattern and then applying a significance test to determine whether it’s actually there will invalidate the integrity of your test.

The Datasaurus Dozen, an argument for visually inspecting your data. (Credit: AutoDesk Research)

Lesson from academia: Explore your data

Performing only hypothesis testing is like looking at your data through a keyhole. Exploratory analysis, on the other hand, can offer you a wealth of insights that you didn’t even know were in the data. Given how resource-intensive data collection can be, I would always recommend exploring your data, even if just to generate the next round of hypotheses.

Lesson from industry: …but know when to stop exploring

The issue with exploration is knowing when to stop. This is a pitfall that I definitely fell into as a PhD student: there was always just one more curiosity to explore, one more ad-hoc question to answer, one extra level of analysis to stack on top of my analyses that were already five steps removed from the original data. Industry gave me much clearer time constraints and priorities, at the cost of occasionally (or frequently) putting aside a potentially interesting question. There’s no single good answer to when to stop exploring; it’s heavily project- and purpose-dependent. Remember that data analysis is a means to an end and keep your purpose in mind.

Lesson from academia: Understand your analyses

The existence of analysis toolboxes and modules has made data science more accessible than ever. You, as a data scientist, have a responsibility to use them wisely.

While analyses aren’t necessarily “right” or “wrong”, they can be appropriate or inappropriate for the data and the research question you have. Every type of analysis introduces some kind of bias, and it’s up to you to understand what that bias is and how to mitigate it. This means that you should have a conceptual understanding of the transformations that are applied to your data and the assumptions and constraints that are in play. One good way (though certainly not the only way) to get to this point is to manually program out the different analysis steps as an exercise in understanding.

Lesson from industry: Don’t reinvent the wheel

While personally implementing every step of an analysis is excellent for understanding and offers you unparalleled flexibility, it’s not always the way to go. The academic circles I moved in were somewhat heavy on DIY and light on code sharing and reuse; I’m sure a lot of wheels were reinvented along the way.

Once you understand how and why an analysis works, save yourself some effort and consider using other people’s code. There’s a good chance others have done the heavy lifting for you, and have probably implemented it much more robustly than you have time to do. Any time saved on implementational details can be reallocated to your core business.

Interpreting your results

There’s a difference between knowing the results of an analysis and extracting an insight from your data. That difference is in interpretation: what do these numbers say about reality?

Lesson from academia: Check whether your results are robust

False positives happen. Whether because of a quirk of the data itself, an inappropriately chosen analysis method or parameter, or malicious data massaging, it’s possible to find a result in data that does not correspond to reality.

In academic science, a result that’s not reproducible (in different data, using a different analysis method) is, ultimately, invalid. Industry data science — being a lot leaner, and more focused on business intelligence than on the grand body of truth overall — tends not to pay as much attention to reproducibility or generalizability.

Nevertheless, it’s worth considering whether your results are robust enough that they’re not entirely dependent on the particulars of your data or analysis. You could use a different analysis method to answer the same question, subsample the data and see if results persist, or see if results align with those from a different-but-similar data set.

Lesson from academia: Look for alternative explanations

When presented with an analysis outcome, academia goes for the why: what is the reason we are seeing this effect? After theorizing about possible causes, a follow-up analysis can be done to distinguish between them (in effect crafting a new hypothesis and performing a new hypothesis test).

Lesson from industry: Take results at face value

While academia values digging deep into a finding, such depth of analysis is not always feasible or desirable in industry. Sometimes, it pays to just take results at face value. Whether or not to do this depends on what the results will be used for. Will knowing the why affect any decisions that are made? Do you need to know exact numbers for something, or just that one is greater than the other? If an educational campaign is found to be effective, will knowing which parts are most effective actually have bearing on the way the campaign is implemented in the future? If not, it may be enough — and much more budget-friendly, in terms of time/money/effort — to accept being only directionally accurate.

Sharing your work

Knowledge lives when shared. Whether you’re exchanging information with your peers or presenting for executives, there’s value in communicating your results — both for others and for yourself.

Connections being made at a scientific poster session. (Photo by Steven Rose)

Lesson from academia: Share your work with your community

Science is collaborative. As such, sharing data, theories, observations, methods and approaches is par for the course. Any nugget of information could help others, it’s an excellent way to receive feedback, and there’s an expectation of reciprocity: everything you share contributes to a body of resources that you can also benefit from.

Corporate data science is a somewhat different beast. Depending on your field, it may be impossible to share your material publicly, lest you share proprietary information or reveal too much of your internal process and lose your competitive edge. However, you may have a more local community (other data scientists within your organization? A network of fellow learners? Visitors at a conference?) to engage with. I would suggest exchanging what you can: for others to learn from, for yourself to learn from, and to foster a culture of sharing resources. In my experience, there’s no faster way to improve your skill set and refine your methods than by talking to others. It’s the best way to receive feedback and inspiration that is tailor-made to your skill level and to your current activities.

Lesson from industry: Invest in data visualization

Nothing beats a strong and clear visualization in selling your findings and communicating your point. It’s worth investing time into making good, clear figures that are pleasing to the eye. The holy grail is to make figures that are so clear that they’re self-explanatory (though this may not always be possible if they represent extremely complex data or analysis results); especially because your data may go where you don’t, and you won’t always be there to provide an explanation.

There are many excellent guides to data visualization out there; seek them out. Good rules of thumb include: representing your data honestly, thinking about the point you want to make (storytelling with data — perhaps related to your hypothesis or actionable insights), keeping graphs simple, and using a single figure to make a single point.

Conclusion

While there is no single best approach to academic or industry data science, it is possible to distill best practices from either end of the field. I hope this article encourages you to look beyond “the way things have always been done” in your area of work, and to appreciate the value of trying a different approach.

Notably, the distinction between academic and industry data science is not always as strict as I’ve made it out to be here. While this article focuses on the extremes, there’s plenty of mixing and matching. Orikami is a good example of a mixed company: we take an academic approach to digital biomarker development, and a more pragmatic approach to our consultancy projects for various health care businesses.

If you enjoyed this article, would like to talk more about this topic, or are curious about what Orikami could do for you, I would be happy to hear from you! Feel free to contact me at marrit@orikami.nl.

--

--