Data Studio @ Pluralsight

Why knowledge sharing matters in data science, and how we get better at it

Our Journey

When I joined the data team at Pluralsight back in 2015, there were only three data scientists, one shared data science server and one GitHub repo. Knowledge sharing wasn’t a real challenge — we could almost always find what we needed by searching the code base, ping colleagues on Slack, or browsing our shared Google Drive. PowerPoints were still the way we communicated to stakeholders outside our discipline — sometimes a static data science notebook, if the audience cared about the analysis flow.

The trajectory of our data science team is similar to other successful and data driven companies — we grew dramatically. Over the past years we have added almost 30 data scientists and machine learning engineers, providing both strategic decision support and product support for the entire company. Our data practitioners on the Experience side are embedded with multiple cross functional product teams, working side by side with product managers and engineers everyday.

Our knowledge sharing strategy adapted and improved along the way. In 2016, we adopted Shiny Server Pro, a product made by RStudio to publish and host shiny applications. Being able to demonstrate ideas and models in an interactive way was a big step forward in sharing and democratizing data science — algorithms became less intimidating and more appreciable when presented in a dashboard or web application. Gradually, our shiny server became a centralized place to store data science outputs — and thanks to its OAuth authentication, we can serve it outside of the data VPN so everyone in the company has access to our work.

We referred to Shiny Server Pro as our data science platform, but it was far from ideal. The UI was pretty shabby (if a blank page with lots of hyperlinks could be called a UI) with almost zero discoverability, the deployment process was clunky, and it had no native support for Python. We worked around the limitations until sharing became too much a burden — too many publishers had to share the same shiny service account, which created issues in version conflict, package management and environment configuration.

In 2019, we upgraded Shiny Server Pro to RStudio Connect after a thorough evaluation, and officially branded it as “Pluralsight Data Studio” later on. Why did we choose RSConnect as our publishing and knowledge sharing platform? Because it has a friendly UI for a wide range of audiences; it is flexible across multiple types of data science output, and it has adequate Python support (adequate is not the correct word any more: as of today, RSConnect has added support for Dash, Flask, Bokeh in addition to Jupyter Notebook. It’s a game changer for Python users!). Compared to other products, it makes the process of deploying and sharing much easier for all levels of data practitioners.

Currently, Data Studio has served more than a hundred pieces of content, and welcomes dozens of regular visitors each week. In the journey of sharing data science knowledge among data practitioners and business partners, we realize that knowledge sharing is much more than making your work public. It is a principle that we practice through discovering, creating, getting feedback, validating, and evangelizing our work from day to day. And by doing that, everybody, not just the knowledge receivers, gained from it. If you wonder how knowledge sharing benefits our organization, our data science team, and our own data practitioners, please keep reading.

Left: as of 2018, we used Shiny Server Pro to host data science outputs. Top right: customized landing page for our current data science platform: Data Studio. Bottom right: the content browsing page of Data Studio.

Knowledge sharing breaks down knowledge silos

At Pluralsight, product teams (with embedded data scientists) are designed to operate independently to increase accountability and efficiency. But sometimes, this team structure leads to information silos and duplicated work across teams. True story: almost half of our data scientists have been tasked to develop a model to automatically tag Pluralsight content for their teams.

Data Studio is an environment that makes data insights more accessible and discoverable, in order to improve information flow. One of the most-viewed artifacts on Data Studio is ExHub (short for experiment hub), which is a one-stop place to browse all ongoing and completed A/B testings from each product team. Each experiment has a one-liner description of its purpose, the feature-to-be-tested, and statistics that show the impact on company key metrics. Viewers could easily gain context of past experiments from other teams, and leverage the knowledge for their next design. To learn more about ExHub, please read our previous post.

ExHub on Data Studio: a one-stop place to browse all A/B testings and their results.

Knowledge sharing demystifies data science

As more and more teams use data-driven approaches to build solutions, democratizing data science becomes mission critical. It’s not an easy task — for data scientists, creating a polished Jupyter / R notebook with narrative sentences, inline code and visualization, and committing it to a GitHub repo is the finish line of their projects. But for non-technical stakeholders, a notebook with chunks of reusable functions and data steps is still unrelatable, not to mention the hurdle to create a GitHub account and get permission for access.

We learned from our experience that data science knowledge needs to be shared in a way that is comprehensible and accessible to a wide range of audiences. We, the product-focused data scientists, started to leverage Shiny and Dash as an essential tool in our data science workflow — going beyond algorithm development and validation, and demonstrating our work in a manner that’s closer to the end product. Thanks to RStudio Connect’s functionalities, we can easily publish Shiny and Dash applications to Data Studio, which securely connects to our databases to achieve live data extraction, and is also open to Pluralsight employees without VPN restriction.

These types of demos received positive feedback from product leaders, and they always spawn lively discussions. When data science is no longer mysterious calculations happen in a black box, and when everyone can play with the data interactively, our work is more likely to earn trust and get buy-in from product stakeholders.

A shiny application that demonstrates algorithms to generate a customizable learning path. A VAE model and an LSTM model run on the fly to produce content recommendations. The plot in the middle is a visual aid to show content embedding vectors.

Knowledge sharing fosters empathy and product oriented thinking

As data scientists, we sometimes got obsessed with finding the cleverest way to solve a challenging statistical or machine learning problem, but ignored the real struggle of our customers. Knowledge sharing is a good remedy for this problem. Being committed to knowledge sharing means proactive communication and feedback acquisition from all stakeholders and collaborators.

  • Communicate early: knowledge sharing usually happens before starting a new project. We make efforts to collectively frame the problem statement with stakeholders, exchange thoughts and ideas to co-create hypotheses and outcomes with them. With that, we can largely avoid jumping into the data and building a solution without understanding how the solution will be adopted.
  • Communicate clearly: effective communication requires empathy and the ability to tell the story with data. With the adoption of Data Studio, more and more data scientists started to leverage interactive data visualization, dashboard or prototype mock-ups to transform results into a format that stakeholders understand and can engage with.
  • Communicate frequently: quick feedback cycles with stakeholders keep our work aligned with product vision. For example, the most accurate recommender model based on offline data will never be implemented if the product goal evolves to promote fresh and new types of content. Without frequent check-ins with product stakeholders, data scientists could waste weeks trying to optimize a wrong metric.

Having a product mindset is not only critical to product leaders, but also to data scientists. When knowledge sharing becomes a part of our workflow, it constantly reminds us to keep our eyes on the practical application of our research, so the solution we built would be useful and usable to our ultimate customers.

A “straw-man” Dash application that mimics a new browse experience. It effectively presents: 1. The variants of taxonomy and their hierarchies in the dropdown menu; 2. Generic or personalized landing page for a selected topic. Our data scientists built each block incrementally with constant input and feedback from product manager and designer.

Knowledge sharing helps cultivate a data science community

As a knowledge sharing platform, Data Studio is not only a stage to showcase data science products to stakeholders, but also a playground to learn from and collaborate with other data practitioners. For example, some data scientists wrap up their machine learning model as a Flask API, and before the model is fully deployed to production, they publish the API to Data Studio and encourage other data scientists to test and validate. A soft rule of publishing to Data Studio is to always include the link to GitHub repo in the content description box. We invite anyone in the data community to learn, review and provide feedback to our work by providing full transparency and openness to our code base.

We also collectively created a “Handbook for Data Practitioners at Pluralsight”, and published it to Data Studio using the R bookdown package. It centralizes tons of useful data science resources that were previously scattered across Google Drive, company intranet, Slack messages, etc. The handbook covers topics from onboarding, data sources, data science tooling and utilities, as well as our principles and best practices guidelines. Every data practitioner is invited to directly compose or revise the book on the areas they have domain expertise. Making contributions is easy — people can simply commit changes to a designated GitHub repo, and thanks to the “Git Publishing” function, all changes made in GitHub are pushed to Data Studio instantaneously.

The collective effort we made in building data science assets that could be shared and benefit our peers, promotes the culture of contribution to a data science community. The fact that Data Studio supports both R and Python products brings flexibility to our contributors, and keeps the data community diverse and inclusive. And in return, the feeling of being part of a collaborative community boosts enthusiasm and empowers everyone to contribute and exchange knowledge.

Handbook for Data Practitioners at Pluralsight, published on Data Studio

Conclusion

Knowledge sharing is knowledge gaining. It helps break down information silos among teams, democratize data science to the entire organization, strengthen the product mindset of data practitioners, and develop an open and collaborative data science community. Admittedly, a knowledge sharing culture can’t be achieved merely by creating a knowledge repo, but having a data science platform like Data Studio can efficiently reduce the hurdle in sharing, promote visibility for our research, and help the knowledge sharing culture to grow organically.

We still have a long way to go in this journey. Knowledge sharing is not an effort of only one person or one team — it will only succeed when it becomes a default behavior in everybody’s workflow. We believe that regardless of its form, whether a model to improve a product feature, an insight to inform strategic decisions, or a reusable tool to boost productivity, data science work has to be shared to have an impact and produce business value. Having an environment where people clearly see how their knowledge makes a difference and fits in the mission of the organization, reinforces the behavior of sharing. We are proud to be advocates pushing to create such an environment. We will continue to explore the untapped potential of Data Studio, and wish to see that one day, it becomes a part of a bigger ecosystem of knowledge sharing at Pluralsight.

--

--

Shan Huang
Data Science and Machine Learning at Pluralsight

I’m a Principal Data Scientist at Pluralsight, where we are democratizing tech skills.