Sharing & Scaling Knowledge

True/False?: Small data science teams struggle to share knowledge and build from each other’s work?

When talking data analysis, the knowledge sharing problem is more than the team size and communication skills; this is because it’s nearly impossible to communicate nuance around data in a concise conversation or document it in a way that is comprehensive without being overbearing.

[So we believe the answer is true, and a resounding true if a large team].

Defining knowledge sharing is pretty simple, if your team shares knowledge, there’s fewer mistakes, less time wasted, and better products produced, because they can leverage historical work more effectively.

Knowledge breeds productivity.

Let’s show the questions that must be asked by a team member in a team that doesn’t share knowledge when they are given a new task:

  • Has anyone worked something similar to this before?
  • Where are your steps you used to produce your results?
  • Are these the only data I should be using to find my answer?
  • How did you get access to these data — can I?
  • What do these columns mean?
  • What do these values mean?
  • Where can I find other potential data sources to use?
  • Why did you do it this way?
  • There’s something wrong with these data, who can I tell about it?
  • I need to do some pretty common upstream translation steps, — haven’t we done this before?
  • Do you think this algorithm makes sense?
  • I think I did something ok in the time I had, now what?

Now questions from a team with knowledge sharing:

  • [This page intentionally left blank] — actual work happening

Yes, you get the value, but can’t you just keep track of knowledge in git? 
 
Maybe some of it, if you’re diligent and have some very intensive process/review policies. There are things, however, that don’t fit nicely into git, namingly the data specific questions; do they belong in a git project? How do you know what to search for? They also assume everyone has access to the data source (or worse, everyone uses the same connection). Yes, there are also data dictionary tools, but they don’t pull together the data part with the code/analytic/problem part.

It doesn’t have to be this hard — at the end of the day, you need a dedicated tool to not only distribute knowledge, but more importantly scale it in an unobtrusive and natural way.

In addition to our Immuta data fabric, we have our data source knowledge viewer. The data fabric allows creating, sharing, and security policies to come easily. The data source knowledge viewer allows knowledge to be captured easily, but more importantly, discovered quickly.

Here’s a look at some data knowledge sharing within Immuta:

Note the SQL Statement on the left — this is literally an entire knowledge page dedicated to a SQL Statement, or as we like to call it, a “data source”. It includes an editable description and wiki about why this data source was created, some tags, organization, and category are exposed for coarse grained search as well.

If we dive into the data dictionary tab:

You can see, at a glance, what columns exist, their type, and details about the distribution of data. Helpful, but more importantly, each column can be commented on, can be assigned an issue, or sample data exposed. We also have experts, these can be assigned, but more interestingly, they can be discovered based on user’s activity against these columns. You can also assign experts and issues to the overall data source. In the queries tab you can see the queries run against this data source. These will show popular queries as well the users that ran them.

So far we’ve already knocked off all of these questions:

  • Are these the only data I should be using to find my answer?
  • How did you get access to these data — can I? (using our data fabric)
  • What do these columns mean?
  • What do these values mean?
  • Where can I find other potential data sources to use?
  • There’s something wrong with these data, who can I tell about it?

There’s more questions to knock off, let’s keep going!

The analytics — we haven’t really talked about that yet. We have jupyter notebook directly embedded in Immuta. When you spin up a new “script” you can describe what it’s doing and attach it to the data sources it uses. So, one could either start from analytics, searching through the already created “scripts” and work backwards to the relevant data sources, or one could start with the data sources and work forward towards the relevant scripts. You can even comment and submit issues on your scripts for others to review and retain those as the scripts are made public and shared to keep “thought process history”.

So now we’ve killed:

  • Has anyone worked something similar to this before?
  • Where are your steps you used to produce your results?
  • Why did you do it this way?
  • I need to do some pretty common upstream translation steps, haven’t we done this before?
  • Do you think this algorithm makes sense?

Now one question remains:

  • I think I did something ok in the time I had, now what?

Well, first, they did something powerful, or avoided a dead end that took too long to find, not just something “ok”. This is due to the knowledge sharing — the full allotment of time was spent on building the logic of the algorithm, visualization, predictor, etc and not spent on knowledge hunting. What was built can also be reused by others and discovered easily. Knowledge shared, knowledge scalable.

But what if what you created needs to be more scalable than a script/notebook, how do you put it into production? That’s what our reactive engine is for, but too much for this blog.