Office Hours

Transitioning from Social Science to Data Science

What You Know and What You Should Know

Danny Kim, PhD

Published in

Towards Data Science

13 min readNov 29, 2021

As someone with limited formal training in computer science or engineering or math, it can be daunting to try to make the leap into data science. It seems like almost every job description calls for someone with a degree in some hardcore field that, while interesting and important, never caught the fancy of those of us in the social sciences who prefer the study of people and societies over the study of cells, chemicals, energy, numbers, or what not.

It certainly felt that way to me when I was getting started in data science. After wading through countless job descriptions looking for someone with a degree in something I never studied, it was difficult to not wonder, Is there a place for people with a social science background in data science? or, I didn’t major in a physical science or engineering, can I really make it as a data scientist?

Well, having worked as a data scientist now, I’m happy to say there is and you can in response to the above questions! That said, it can be difficult to grasp at first how your social science background can be useful in data science, and there are definite some key things that are helpful to be familiar with that you may have never learned in school. I hope this article provides you with insight on how your social science background can help you be a strong data scientist and helps you determine what skills school never taught you you should pick up to smooth the transition.

Social Science Background

My own social science training comes through the field of communication, specifically the study of mass communication formats (broadcast media, social media, so on; specifically entertainment media in my case) and people’s interactions with and responses to them. But the label of social science can be applied to all manners of fields beyond my own — e.g. psychology, sociology, criminology, economics, political science, etc. In fact, graduates of such fields make up a big portion of bachelor’s graduates [1] each year. And among them, you are not the only one looking to break into data science, nor will you be the first to make it — so don’t despair!

Relevance of Social Science in Data Science

It actually shouldn’t be crazy to realize that social science provides an exceptional background to move into data science. At the core of any social science is investigating how various stimuli or elements of people’s (or groups’) background, thinking, or behavior are associated with said people’s (or groups’) other thoughts or behaviors or outcomes or characteristics, and social scientists’ significant experience with such analyses is what gives them their edge — we may take practice in considering the effects of people’s backgrounds, thoughts, and behaviors for granted, but it’s not something most people spend a lot of time doing! Of course, the particulars of the data you work with will depend on the position, but I feel comfortable saying that in many jobs out there, much of what you do will entail investigating and predicting what some type of people do — be it buyers, sellers, viewers, users, interactors, drivers, riders, whomever — and how it varies depending on who they are (e.g. demographics, psychometrics), what they’ve done (e.g. past behavior features), and what they think.

More specifically, experience with data about people as commonly gained in the social sciences comes in handy in three particular regards. First and most directly, having thought a lot about demographic and psychological differences in outcomes really helps with designing methodologically rigorous analyses that investigate such differences. This is particularly true in the case of those who have received formal training in research design during their schoolwork, capable of fluently identifying and explaining mediators; moderators; threats to internal, external, construct, criterion validity; and so on, perhaps even able to design and analyze solid randomized experiments. Your experience considering and accounting for such elements will help design rigorous analyses that cleanly measure what they proclaim to measure.

Building directly on this, secondly, coming from social science can be great help in modeling. Having a sense of the range of demographic and psychometric factors that can affect a particular variable as well as how such factors can interact with each other is a valuable skill to have when it comes to constructing regression and classification models. Such intuition can be of use in feature transformation (e.g. knowing certain variables, like income, are usually skewed) and feature selection (when more systematic methods aren’t ideal or feasible), and can also help with specification of more complex models (e.g. hierarchical linear models, structural equation models) as the structure of the data deems fit.

Lastly, a less obvious way a social science background helps in data science is with regard to feature engineering. Those from social science have no doubt encountered various theories that aim to explain particular phenomena or behavior, often with an accompanying diagram showing how certain constructs directly or indirectly lead to an outcome. Such theories can be useful frameworks for feature engineering, especially if one has experience coming up with clever proxy variables for otherwise difficult to measure variables. For example, the integrative model of behavioral prediction [2] posits that intent to perform a behavior can is predicted by attitudes (how one feels about the behavior), perceived norms (perceptions of how ‘normal’ a behavior is), and self-efficacy (whether one feels they can perform the behavior). In the context of, say, predicting whether user A will listen to song X, existing consumption data and relevant metadata could be used to generate features that approximate such constructs — for example, user A’s attitude toward song X could be represented by user A’s consumption of songs with high metadata similarity to song X; user A’s norms about listening to song X could be represented by the extent to which users with listening histories similar to user A have listened to song X; and self-efficacy could be represented by the variety of different artists, genres, and nationalities of songs user A has listened to.

All of the above points are especially true if you’re fortunate enough to work in a field where the research is directly applicable to what you do, as is the case for me. Having conducted significant research on media preferences and effects, I feel exceptionally comfortable developing models aiming to predict media consumption at various levels. But even if your field of study doesn’t perfectly project onto your industry, you will over time observe parallels in methodological approaches that you may wish to look into. For example, you might have a lot of experience applying text mining techniques to documents, but then you might realize that such techniques can be reasonably applied in any context where the data can be considered a body of text, not just documents — whether user metadata, song metadata, consumption history, or what not. Keeping an open mind about how certain frameworks from a field you’re familiar with could be applied elsewhere is critical.

The Nuts and Bolts: Key Technical Skills

But all your schooling and wit won’t take you anywhere in data science if you don’t have a certain set of key technical skills. Especially with all the different ‘specializations’ you see touted on various jobs pages — experimentation, inference, machine learning, visualization, and so on —it can be difficult to get a sense of what tools you really want to have a good grasp. Certain jobs will definitely call for heavier application of certain techniques, but I do feel there is a ‘core’ toolkit that all data scientists are expected to be familiar with.

Spoiler alert: Practically all such skills involve some elements of coding. I understand the idea of coding can be daunting for some, but the bottom line is that if you want to be in data science, you have to get comfortable with coding and the trials and tribulations that come with it. Take a course, get a certificate, do whatever you need to do to get comfortable with the basics. A good place to start might be to Google for a tutorial on a data project that sounds interesting and go from there. No matter where you start, just know that Googling problems and troubleshooting with the help of StackOverflow et al, where other people have run into the same issue, is a timeless tradition that even professionals do all the time, nothing to be frustrated by. So let’s dig into what I think are the key skills and concepts you should be familiar with.

SQL: This really tripped me when I was on the job market — none of my classes taught it, and it’s something I’ve also heard from industry folks that people coming straight from academia often lack skills with. Sure piecing together and transforming data may sound boring, but the number of data-related job descriptions that call for SQL is really high. Plus, on a day to day level, there can often be a lot of quick and dirty analyses that can be done directly in SQL to save time. Yet at the same time, SQL is one of those things that it’s hard to gain functional practice with without some kind of live database that you can practice on. In that case, try to get comfortable with basic core dataframe operations like selecting, joining, filtering, and aggregating in other contexts like Pandas in Python or dplyr in R. With the right libraries, you may even be able to practice SQL in Python/Pandas [3] or in R [4].
Python: Also, I love R as much as the next person, but Python is really the lingua franca of DS languages as far as I can tell. Python is far more versatile than R outside of the rare analysis where a particular function you need isn’t as robustly implemented in Python (and just forget about SPSS/Stata/SAS). When it comes to serious data science work, especially that which needs to be put into production (i.e. run regularly in an automated, robust, stable manner), Python is second to none at least at time of writing in 2021.
Spark: At least a basic familiarity with the Spark framework — how distributed computing, compute clusters, and so on work — is really critical when it comes to working with big datasets. It’s easy to think 32GB or even 16GB of RAM on some local workstation is enough for chunky datasets, but you will undoubtedly encounter a situation where working on a single machine with a massive dataset loaded dreadfully slowly into memory is simply not feasible. This is another one of those cases where it’s difficult to gain practice with the tool without a live deployment, but at least be familiar with the underlying concepts if you can.
Statistics, hypothesis testing: People love to ask to compare data and tell them whether they’re significantly different, often without really understanding what statistical significance means. As the data expert, you yourself definitely want to be familiar with hypothesis testing and basic statistics to know which tests are appropriate to use in which scenarios as well as to be able to competently explain the outcomes of statistical tests to others.
Classification/regression: I feel like people often like to focus on classification over regression for some reason. Maybe it sounds a little glitzier when you can talk about precision and recall and some model can identify whether something is X or not, but let’s be real here: a lot of times your model-building will be for the purpose of predicting values, not labels, thus the importance of regression. Relatedly…
‘Core’ machine learning algorithms: No matter the type of data scientist you are, I’d say these days some familiarity with the mechanics of the ‘core canon’ of machine learning algorithms like LASSO, random forest, SVM, boosting, etc. is necessary to not be outdated. Regardless of whether you’re looking to do classification or regression, you will sound outdated if the only techniques in your toolbox are multiple regression and logistic regression. Such nonparametric algorithms are great when you don’t care as much about understanding the underlying relationships between the predictors and the outcome and just want to try to maximize predictive performance. That said…
Linear models: Although such black box algorithms are great, when you want to make inferences about patterns in the data, linear models are still arguably the most useful. So having some fundamentals in techniques like multiple regression, multilevel regression, logistic regression, and so on is important. Plus, you’d be surprised just how far relatively simple methods like multiple regression and logistic regression can get you in a lot of cases, at times performing even better than the new-fangled methods mentioned above.
Basic dimension reduction, clustering : Both of these methods are useful for visualization at the very least. Dimension reduction is also useful when you don’t want too many interrelated predictors in a model, and similarly, clustering can also help ascertain patterns in the data so as to be able to categorically label certain chunks of data as falling into a similar category. Principal components analysis and exploratory factor analysis are good places to start for dimension reduction, and k-means and DBSCAN for clustering.
Clean coding technique: Lastly, this was another one I struggled with because I never had any proper computer science training or experience: it’s important to be able to write clean code. When you write code for work, you have to remember that you are not the only one who will be reading the code. You can’t just use acronyms and abbreviations that make sense to only you all the time, and you have to have some consistent and logical structure to the way to structure your code, name variables, and space out text. Get used to ordering your analyses in a clean, straightforward flow; naming variables very clearly (as opposed to ‘df,’ ‘test_df,’ ‘df_1,’ or the like); and perhaps checking out code style guides like PEP 8.

Book Recs

A big part of being a data scientist is constantly learning new concepts and methods. You’d quickly go broke if you took a paid course for everything you have to learn, so often times your go-to resources tend to be online tutorials and the like. I find books to be a happy medium between courses and tutorials, effectively a set of tutorials structured like a course. Here are a few books relating to the above I found useful as I prepared to enter data science, both to pick up new material and to brush up on concepts:

SQL in 10 Minutes (Forta): The first book I used to start learning SQL, providing a series of lessons with a nice difficulty progression.
SQL Practice Problems (Vasilik): Once I picked up the basics with the Forta book above, I used this book to practice writing SQL queries.
An Introduction to Statistical Learning with Applications in R (James et al.): Yes, it’s an R book, but I think this book presents some of the most accessible explanations of common machine learning algorithms.
Experimental and Quasi-Experimental Designs for Generalized Causal Inference (Shadish et al.): Not a technical book, but a very good, if dry, tome on designing methodologically rigorous studies.

Some General Parting Advice

Okay, so far, I’ve given you an idea of how your social science background can be valuable in data science, outlined the tools and concepts you want to be familiar with to help make a successful transition to data science, and listed a few books you might find useful as you get started. In closing, I have some general advice for social scientists aspiring to enter data science that felt a little too broad to include in the previous sections.

First, no matter what field you choose to go into to as a data scientist, make sure you understand the field as best as you can. It can be assumed that many of the candidates for any given data science position have hugely overlapping technical skillsets, and clear domain knowledge pertaining to the industry at hand can push you above the top compared to other candidates. Think about it from the manager’s perspective: when everyone has a roughly similar skillset, wouldn’t you rather hire someone who understands the concepts and metrics on which those skills would be applied?

Second, never, ever be afraid to think ‘weirdly’. This is another area where I think coming from social science really helps, because there’s often so many theories that aim to explain the mechanism driving various outcomes and the process of innovating often entails making a ‘weird’ connection between two things that hasn’t been made before. If you feel like there’s connection between two things, a method from one context you think could be applied in another, don’t hesitate to look into it or bounce the thought off of others. I don’t mind airing the occasional thought that makes me sound like a bit of a lunatic, because I know that every once in a while, one of those comments will get at something really insightful or novel.

And lastly, no matter where you go and what you do, pursue your passions. There’s a Ralph Waldo Emerson quote that’s long been a favorite of mine:

The voyage of the best ship is a zigzag line of a hundred tacks. See the line from a sufficient distance, and it straightens itself to the average tendency. Your genuine action will explain itself, and will explain your other genuine actions. Your conformity explains nothing. Act singly, and what you have already done singly will justify you now. [5]

I never grew up wanting to be a data scientist, but at every step I pursued my passions, drawing the zigs and zags of my ship through my genuine actions. Looking back now, I can see them straightening to their average tendency, enabling me to see how everything I’ve done has brought me to where I am now. I hope that as you progress in your career, you pursue your passions and draw the zig zags of your own best ship.

Signing Off

That’s all for me, folks! I hope this advice is useful to you, I really tried my best to encapsulate here all I wish I’d known when I was on the job market as a social scientist aspiring to be a data scientist. If you’d like to connect and chat more, by all means feel free to add me on LinkedIn. Best of luck in your journey toward becoming a data scientist!

At time of writing, Danny Kim (PhD, University of Pennsylvania) is Senior Data Scientist at Whip Media and Visiting Researcher at the University of Southern California Annenberg School for Communication and Journalism.

References

[1] NCES, Most Popular Majors (2021), https://nces.ed.gov/fastfacts/display.asp?id=37

[2] M. Fishbein, A Reasoned Action Approach to Health Promotion (2008), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603050/

[3] A. Bivona, How to Use SQL in Pandas (2020), https://towardsdatascience.com/how-to-use-sql-in-pandas-62d8a0f6341

[4] G. Grothendieck, sqldf: Manipulate R Data Frames Using SQL (2017), https://cran.r-project.org/web/packages/sqldf/index.html

[5] R. W. Emerson, Self-Reliance (1841), https://www.gutenberg.org/cache/epub/16643/pg16643-images.html#SELF-RELIANCE