Intro to Data Science for Academics

I went to Reed College, a notorious feeder school for PhD programs. I regularly talk to former classmates who are looking to transfer from academia to industry. There’s a pretty stark contrast between the current boom in tech and the rather anemic state of the academic job market so this line of thinking makes sense for many erstwhile TAs and professors.

Data science is a good match for many former academics because it leverages some of the math and statistics knowledge that many PhDs learn and use. For a quick visual explanation of what data science is check out this handy diagram.

Types of problems data scientists may tackle in industry are varied and broad: automated medical diagnosis, analysis of ad auctions, matching drivers of a rideshare service with riders, detection of fake reviews on an ecommerce site, summary analysis of a firm’s customer acquisition pipeline and churn. Some tasks may require learning tools that help manage very large data sets while other problems will involve techniques to wring information from sparse or small data sets.

Some data sciences tasks will focus on exposition and visualization while others strive to produce automation with machine learning.

Priorities

The first important thing to understand is that moving from academia to the private sector does not mean getting paid to be an academic. There may be cases where research style work or giving papers makes some sense — but in most companies and most roles this will be a very rare occurrence.

When you join a company your priority should be to create the most value you can.

As an academic you thought about success as getting more personal notability in your field, giving talks and publishing peer reviewed papers, etc. In the private sector the measure of success is wildly different.

To understand how to create value you need to understand the business that the company is in: how it makes money, why the customers value the product, how the product works.

You also need to understand the connection between the work you (would) do and the company’s customers being happier, business costs lower or revenue higher.

The more directly you can contribute to growing the business, solving critical problems or lowering costs the more the company will be incentivized to reward you and keep you.

At a company it’s important to create value — an important concept in the business world. Creating value is subtly distinct from other forms of work and has a specific meaning in business, the inline links above may help illuminate this concept.

Tip: If you want to make a positive impression and better understand what you might be working on ask the hiring manager “What metrics does your team use to measure success? How do you know if the product changes you ship are good or not?”

Workflow

If you’re in one of the more research oriented roles (say, building machine learning models) you should think of your work breaking down into roughly three piles:

a) Scaffolding

I refer to all the work you have to do prior to what you probably think of as data science as scaffolding. This work involves discussions with other people who know about the data you want to use to make sure you understand the origin, quality and quirks associated with your data set.

Essentially you’re investigating providence and veracity of data and doing wrangling and munging.

What do I mean by wrangling and munging? Writing bits of code to pull data out of some other system, clean up irregularities for the data and implement sanity checks, recombining data to create new variables, storing data in a format that is ready for your work, etc

b) Research/modeling

This is all the things you think of as work today. Training models, analyzing outliers, validating models, summarizing and reporting on data, models, results, etc

c) Productization

Ensure your collaborators can understand and read your code and that it’s setup to be maintainable. E.g. variables and functions should be well named, code should be modular and tested, etc.

Depending on the size of the company and the philosophy they have about data science you may be expected to productize your own code or that may be done for you by another team. Either way you’ll have to write your code with others in mind.

If you’re in a startup or a more full stack engineering team then you’ll want to make your code fast enough and memory efficient enough to work within the system you’re shipping.

If another team is productionizing your work you’ll need to make sure they can understand how your system works and replicate it with fidelity. The less full stack you are the stronger your communications skills need to be.

Either way a major part of productization is ensuring everyone very clearly understands what the expected behavior of your system is. This means writing technical documentation that can be used to modify, QA+test or debug your system as it rolls into production.

You’ll need to clearly understand and communicate things like “What does it mean for the system to break?” and “This expects values 0–1 what happens if we have a bug upstream and pass in values outside the input range?”

It’s important to realize that often the bulk of your hours will go into items a) and c) — not the things you may currently picture as data science i.e. mostly b).

Employer

There is a huge array of possible employers but I’ll oversimplify things a bit to make an explanation tractable.

i) Established tech companies

Advantages: Many extremely talented collaborators you can learn from, More likely to have infrastructure built that allow you to focus a greater percentage of your time on b), Great benefits and compensation, good for resume building w/ recognizable brands

Disadvantages: You may learn to do your job dependent on many sophisticated proprietary tools your employer has created limiting the portability of the skills you develop, Your work may feel a bit siloed or small in scope due to the scale of the team/organization/company.

Watch out for: Pick your manager/team as carefully as you pick your company as these will have outsized impact on what you learn. Be clear to ask for how you’ll be judged and what success looks like in your role,

What it takes to get in: Passing rigorous technical interviews that will test knowledge of statistics, probability, computer science and software engineering. Different companies and different roles will emphasize various selections from the above mention skill areas — it can be very tricky to know if you’ll pass ahead of time.

Applicants should take a portfolio strategy and be prepared to fail some interviews and pass others. Don’t get discouraged, some of the very best data scientists I know have failed interviews at great companies but passed with flying colors at others. Technical interviewing is hard and not a solved problem. Accept that it’s a noisy indicator and don’t let the epsilons get you down.

What it takes to succeed: Ability to communicate effectively both 1:1 and in presentation settings, willingness to focus on key problems the company has to solve and become the world expert in this niche domain, a relentless focus on contributing to your team’s top goals.

This will be a good match for you if you want strong mentorship with structured training on the job and won’t mind feeling a bit like a cog in a big machine.

j) Startups

Advantages: There will be more opportunity to impact the businesses directly, You may learn a wider array of skills due to everyone wearing multiple hats, If a tail event happens and your company is wildly successful it will be a life changing experience, You’ll get to chart your own path

Disadvantages: The default state of startups is failure and you’ll have to pedal hard against it, It can be hard to learn if there aren’t (m)any other folks with your skillset in the company, less pay and amenities than a big company

Watch out for: Startups reflect their founders and this can result in brilliant success or toxic mess, Startups go out of business all the time, Your title and accomplishments on your startup will matter more/less depending on the success of the company.

What it takes to get in: A can-do attitude to pick up anything and everything that needs to be done, signs that you’re an autodidact.

What it takes to succeed: Startups are like a small row boat in a choppy sea — each person you add can help you row but they also take up space. You have to provide positive return in a startup — there’s no fat to mooch off of. You’ll need a lot of grit and you’ll need to be passionate about the product you’re working on.

There won’t be a lot of hands on management and attention, everyone is stretched thin and super busy. You’re going to need to be able to figure out how to create value and do it quickly and independently.

This is a good role for someone who is very self motivated and has an aptitude for understanding customer problems and is very willing to wear any hat.

k) Established non tech companies

Advantages: You may be one of the few/only technical/data hires so you can chart more of your own path but the business is likely more stable than a startup

Disadvantages: Work culture and perks are often not as good as in i) and j), You’ll have less people to learn from, You may pick up bad software habits from other inexperienced and unsocialized software creators.

Watch out for: In some companies that data people may be seen as the “nerds in the basement” trapped under many layers of managers who don’t understand your skills or value. In others you may become a key advisor to the CEO or other senior leaders, try to pick the latter and understand your position in the organization before joining.

What it takes to get in: a good sales pitch about your abilities, academic credentials, possibly some technical discussions — but unlikely to be as rigorous as in i).

What it takes to succeed: There likely won’t be great systems for getting your new work integrated into the old ways of doing things — you’ll need to overcome this strong communication and coordination skills.

Putting it all together

There is a large and growing array of opportunities for clever and capable folks with data skills but finding success in these industry positions requires the right mindset going in.

It also requires finding the first job for your personality and your interests. You’ll be infinitely more successful if you’re working in a field where you can bring personal passion, excitement or possibly even special domain knowledge to bear on your

It has been a little over three years since Harvard Business Review called data science “The sexiest jobs of the 21st century” in late 2012. So far this appears a reasonable call, a few weeks ago The Economist reported that demand for data analysts has grown 372% while demand for data visualization skills (as a specialized field within data analysis) has grown 2,574%.

We will likely see increasing demand for data scientist for years to come as the amount of data available for analysis and the number of automated systems in operation continue to climb.

The biggest winners of the data science bonanza will be technical folks with strong communication skills and a domain of focus.

Good Luck!

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Thank you for reading, please share with anyone you think might benefit from this advice. If you have feedback on the piece please reach out on Twitter.

Thank you to Will and Hillary for reading earlier drafts.