How not to jeopardize a data scientist’s productivity

Yaekyum Lee
Kaldea
Published in
7 min readDec 8, 2022
Where analytics productivity hits a stall
The battle between docs, processes, and tribal knowledge

As a popular adage has it, data scientists spend 80% of their time preparing the data and only 20% building the models. If only this were true! In reality, the 80–20 split only applies to the time productively spent, while the rest is wasted on repetitively asking and answering the same questions, looking for necessary information, or trying to get out of the knowledge silos.

The recent Stack Overflow Developer Survey sheds interesting light on productivity frictions among developers, including data scientists, data and business analysts, and scientists. Over 36,000 responses analyzed in the Productivity Impacts part of the survey suggest what to do in order to keep data professionals productive. Let’s take a look at the most important conclusions.

More data tools = more silos

Almost 68% of the Stack Overflow survey’s respondents say they encounter a knowledge silo at least once a week, and some experience it even more than ten times a week. At the same time, nearly half of all respondents report that knowledge silos prevent them from getting ideas across the organization.

The Cambrian explosion of data tools has brought many advancements to each feature set, like discovery, observability, modeling, etc. However, it has also brought on more knowledge silos than ever before. To go through a single analytics process today, how many tools do you have to utilize?

Here there is only one way to prevent silos, by connecting through ‘you’.

Navigating tribal knowledge

Have you ever found yourself looking at a dataset and wondering what the heck these column names mean? Who created it, and why are there four different derivatives of it? If you are lucky, Datahub or your taxonomy page on Notion or Spreadsheet may be up to date.

Sometimes, you are seeking a little bit of business context and domain knowledge to further understand the data set. You want to understand what was discussed between your colleagues about this data set, what analysis has been done before using this data set, etc. What do you do? You have no choice but to talk to your colleague in charge, but what if they are no longer there?

The Stack Overflow survey authors calculated that for a team of 50, the amount of time spent navigating tribal knowledge adds up to between 333 to 651 hours lost per week across the entire team. How much time is tribal data knowledge costing your team’s productivity?

Down the road to too many questions > documentation > process

Most of the time, you do not even know where to thoroughly search for something or if there is anything to be searched on, so communication becomes inevitable. Questions do not only interrupt workflows but, in a remote environment, are extremely expensive because a simple question on Slack can cost an organization a full day of turnaround time (i.e., delays).

“Hey, what was this query that you used in last week’s retention numbers?” You have three options here.

  1. Write it yourself.
  2. Ask your colleague.
  3. Look through 1,000 saved queries on Redash, hoping it is there.

Imagine your next team meeting where everyone brings up how much time is wasted answering repetitive questions, and now someone suggests creating a documentation or record-keeping process that you run each month or quarter. We all know where that goes. Documentation is a great way to keep track of things, but it is not a great system because the documentation process requires people to be the system’s engine.

Dependency in disguise as collaboration

Over 90% of the Stack Overflow survey respondents report interacting with people outside their immediate team at least once a week. Over 75% admit they need help from them. There is nothing wrong with getting help or collaborating with different teams. Still, there is a clear line between collaboration and dependency that can be ridden.

Waiting for the other team to finish what they are working on, or to explain to you how to use it, is a significant productivity friction for data scientists. Imagine you’ve come up with a brilliant feature engineering method that should make it much easier for your machine-learning model to learn. You would like to test it in production by directing some users to the new model. To do that, you need the data engineering team to update the production pipeline. Now your workflow has stopped, and to stay productive, you move on to start a different analysis or ticket in parallel.

But this dependency relationship does not impact you alone. On the other end, the same thing is happening: the data engineering team is re-evaluating the priority of your request versus what was being worked on or has to be worked on within their team.

Resolve productivity frictions by relying on an analytics system

Review your analysis workflow

How does your analysis workflow start and end today? When we interviewed 200+ data scientists and analysts across America, we found that most teams go through a process that resembles the following:

  1. Initiation: Receive tickets (process/team assigned, self-assigned) through Jira, review ticket details, and quickly search for previously done related analysis.
  2. Define the scope: Send a Slack message for clarification or set up a call depending on the volume of clarification on the need and definition of what to be analyzed
  3. Locate data source: Move to internal documents or catalog solutions (e.g., Datahub, Acryl Data, Data.world) to find the right table.
  4. Verify data source: Write a simple query to check if the table is good to go
  5. Check for downstream impact: Move to internal documents or catalog solutions to check on downstream tables.
  6. Avoid reinventing the wheel: Check the team’s query repository (e.g., Bigquery, Redash, docs, Slack) for existing queries that can be used and to check for consistency.
  7. Port results into…: Slides, Notion, Looker, Tableau, Data App (Hex, Domo, etc.), spreadsheets, docs, etc, and share.

There were common difficulties we learned from our interviews:

  1. The current process is complex and difficult to get through, but it is not impossible
  2. It requires more verbal and written communication than thought of
  3. Search does not work well
  4. It is difficult to reuse a colleague’s previous work without having a meeting about it
  5. The process improvements haven’t worked and many accept it as is

Any documentation or clean-up process?

There were a few common documentation processes we identified between companies

  1. Taxonomy: On spreadsheets or Notion, where commonly used and highlighted tables are defined and explained. Teams utilize the sheet through Ctrl+F.
  2. Ticket efficiency and accuracy: Iterating to find out the minimum required category and content of a ticket submitted to the data teams so that the back and forth is minimized. Also, it ensures there was a proper search effort done before the ticket submission.

Common clean-up process

  1. Catalog: New data engineer hires update the table catalog and work as a bridge between the analysts/scientists and data engineering.
  2. Queries: Title and description requirement process to save queries in BigQuery, Redash, and other query shareable tools.
  3. Reports: Report archiving in shared folders to reduce redundancy.

The unfortunate is that we all know where documentation and process-driven efficiency projects end. Data is not the only domain where teams have failed to run a system dependent on human diligence.

Connect your core analysis workflow

The best way forward to save the data scientist and analyst’s productivity is to ensure you equip a system in which all your core analysis workflow (from discovery to query to reporting) is connected and/or available in one place. Such systems are called a few different names at different places. Often internally built at places like Airbnb, Linkedin, Uber, Netflix, etc, are called ‘data portals’, vendors that provide such systems name it the unified analytics platform. Airbnb and Neo4j have a good Slideshare that details their approach.

The challenge with data portals at Airbnb, Uber, and similar companies is that it is nearly impossible for most companies to adapt, let alone replicate, such a product internally. However, the experience with a connected analytics platform from discovery to reporting is too good to forego.

Two paths we recommend:

  1. Buy: One simple path is to try out Kaldea or a similar product that provides you with a unified analytics platform that contains all core components of an analysis workflow ranging from a catalog, query editor, query management system, reporting and visualization, jobs, and ACL.
  2. Build: The other option is to internally develop joint by joint, slowly expanding to cover most of the workflow. Be warned that it will be nearly impossible to replicate what the tech giants have done unless you can dedicate a team of 10+ product and engineering folks to develop and maintain the system. Based on what we have seen so far, we recommend tackling the easier two parts 1) query management and sharing portal and 2) table discovery portal.

Curious to find out more about what can be done? Talk to us, and we will be your discussion partner even in your journey to internally develop your unified analytics system.

--

--