The plight of the Frustrated Data Scientist
A few weeks ago, we wrote about why we were building and distributing our Data Science Communicator Toolkit. Part of our initiative included collecting information from people who work with data so we could shape the toolkit to help bridge the communication gaps between them and their colleagues. We found some interesting results and are excited to share them with you here, as well as some recommendations for how you can alleviate the road blocks that your organization faces on its way to becoming more data driven.
We asked for demographic information about industry, company size, and job titles. Out of 226 responses, over half identified themselves as either data analysts or scientists, but we had a wide range of other respondents from CEOs to consultants and psychologists. It seems that, nowadays, there’s not a field that data science doesn’t touch.
Respondents were also asked what they wished their managers or leadership knew about data science (note that respondents could select multiple options). Overwhelmingly, people who work with data or use data in their work want their leadership to understand the power and limitations of data science, as well as how they can leverage it to better serve their team.
We received a few “Other” responses, one of which was “Anything dealing with a database or spreadsheet is not Data Science”. This short sentence tinged with frustration was a harbinger for responses to come.
“It seems that 5% of our respondents are office wizards.”
We asked an open-ended question: “What is the biggest misconception your manager has about your job?” While we saw some patterns in the answers, we got a range of responses, including:
“That I am an actual magician”
“How time-consuming it is to prepare a data file to begin the analysis to tell the data story”
“That the speeds in which data can be analyzed are comparable to a hundred metre dash”
The “magician” comment was not the only one — 11 responses included the word “magic” in it. It seems that 5% of our respondents are office wizards. Most of the open-ended responses centered around the misconception about the length of time it takes to both collect and clean the data, and then run the analysis. More often than not, managers believe that “the data are easily available” and “that data is clean and usable for analysis without scrubbing.” However, it is a rare case when both of those statements hold true.
When we asked our respondents about the percent of time they spend explaining their analyses and answering simple data questions, the mean was around 30%, with a median of 25%. Two respondents wrote “999” and “909” — these answers were omitted from the graph below, although I am curious to find out how these analysts managed to manipulate the time-space continuum to allow for such a high number.
Time travel aside, 30% of anyone’s time is a substantial amount to spend on communicating the results of an analysis or helping colleagues understand data. And half of our respondents spend more than that. In fact, data literacy and data standards are key areas that can accelerate or hinder an organization’s path to becoming data-driven.
“Garbage in, garbage out”
The truth about data science is that it is useless without the right data. If you have bad data going into an algorithm, you’ll have bad results on the other side — garbage in, garbage out. In order to reduce the amount of time that data scientists and analysts spend on data cleaning, an organization needs to lay the groundwork before any analyses are performed.
Here are the best practices we recommend for data collection and storage:
1. Define the problem you want to solve or the question you want to answer. By defining what you want your outcome to be, you can focus on which metrics you will need to collect. One of the best ways to avoid future issues is to clearly outline what you’re looking for.
2. Collect and store individual records, not aggregate numbers. This means that each individual sale is one row, not multiple records that are bundled together. This will allow you to slice and dice the data at a more granular level and find patterns that you may have otherwise missed.
3. Make sure column names and formats are standardized. In order to combine spreadsheets and find insights from disparate data sources, you have to be able to merge them based on a similar column. This could be a user ID, product ID, or something else. Create a data dictionary that contains how you name each metric you’re tracking, and make sure that the format is the same across the board to make it less of a heavy lift.
4. As you’re collecting data, make sure that there is variance in it (A/B tests, for example). If you’re looking to make marketing decisions, make sure to collect data about many products, different customer types, etc. If all the phone calls or e-mails in the study are placed at the same time on the same day of the week, it will be hard to analyze those variables going backwards. You want to be looking at the distributions of the data and correlations of the variables the whole time you’re collecting the data. The correlations and distributions will be early signals about the way your data is behaving.
If you want to learn more about building a data driven culture in your workplace, you can fill out our 5-minute survey to download our free Data Science Communicator Toolkit and become a data champion (cape not included)!