Data Science Notebooks — A Primer

Astasia Myers
Memory Leak
Published in
4 min readApr 28, 2020

For data scientists, notebooks are a crucial tool. Notebooks are a form of interactive computing, in which users write and execute code, visualize the results, and share insights. Typically, data scientists use notebooks for experiments and exploration tasks. Increasingly, we are also starting to see other groups leverage the tool, including business analysts and analytics engineers. Netflix is a great example of a business that leverages notebooks across different functional units. As notebooks become a more mainstream and critical tool across organizations, their usability and functionality will need to improve. Specifically, going forward we envision users will continue to demand 1) easier set-up and management, 2) improved collaboration, and 3) better visualizations.

Stephan Wolfrom, the computer scientist and physicist, introduced Mathematica, the first computational notebook interface, almost 30 years ago. Since then the number of notebooks has proliferated and have transitioned from academia to industry. We can categorize notebooks by whether they are open source or hosted. Open source notebooks include Jupyter (formerly known as iPython) and Apache Zeppelin. Hosted offerings include Deepnote, Noteable, Databricks Collaborative Notebooks, Google Colab, Jovian.ml, among others. The second vector is language support. Some notebooks can even run code from multiple languages like Polynote. We believe this is an increasing trend. For example, at the beginning of the notebook using SQL to query data, then moving to Python or R for exploration. Below, we highlight +20 notebook offerings.

Jupyter is the most popular notebook. According to NBViewer, there are over 7M public Jupyter notebooks on GitHub today. Like most modern notebooks, it has two components: First, the client where users input programming code or text in cells in a front-end web page. Notebooks are represented as JavaScript Object Notation (JSON) documents. In turn, notebooks can interweave code with natural language markup and HTML. Second, the browser passes the code to a back-end “kernel,” which runs the code and returns the results to the client. The kernel can run locally or in the cloud.

https://nbviewer.jupyter.org/github/parente/nbestimate/blob/master/estimate.ipynb

Data science continues to be a growing, in demand profession. According to IBM, there will be 3 million data science positions in 2020. However, the University of California Riverside predicts a 60% shortfall in data science positions in the U.S. alone by 2020. As a scarce resource, data scientists have more leverage to pick their notebook of choice, so we’ll continue to see individual users drive purchasing decisions over senior management.

Going forward, we’re keeping an eye on three big developments in the space, as our research suggests notebook users are hoping for improvements across three vectors: 1) set-up and management, 2) collaboration, and 3) visualizations.

1) Set-up and management. Teams would find value in a solution that easily sets up their environment. They also want the ability to manage consistent environments that are shareable across individuals and teams. Importantly when sharing notebooks, they want to make sure there is fine-grained control of code, data, and infrastructure access. Right now, there are gaps: In our conversations, some noted concerns around sensitive data access that could vary team member to team member. In addition, some have stated concerns that individuals could start using Amazon Elastic Compute Cloud (EC2) for cryptomining if users’ authorizations were not part of the product.

2) Collaboration. Data scientists/ML engineers share notebooks today, but it isn’t easy to do with open source Jupyter. In contrast, Google Colab emphasizes sharing as part of its functionality. Individuals thought the opportunity to do “remote pair programming” in a notebook could be useful, especially for senior leaders trying to help junior individuals on the team.

3) Visualizations. Notebook users told us they want the ability to have better visualizations. Some use D3 for enhanced visualizations, and Observable emphasizes visualizations as part of its solution. Users want the ability to share exhibits and analytics that can be toggled by a user without changing the underlying code base. Streamlit helps build data applications, but is a separate solution. It is closer to Dash Plotly and Shiny for sharing out results. The ask for better visualizations suggests notebook users want to distribute their notebooks to non-technical employees, underscoring the expansion of notebooks into other functional teams — a big area for growth.

Notebooks are a core piece of the stack for data-driven teams. We are excited to watch as notebooks increase in popularity crossing the chasm towards non-data scientists. If you or someone you know is working on a notebook startup or adjacent offering, it would be great to hear from you. Comment below or email me at amyers@redpoint.com to let us know.

☞ To hear more from me in the future, follow me on Twitter, LinkedIn & Medium as well.

☞ If you liked this post, please tap the clap icon to promote this piece to others or share your thoughts with me in the comments

--

--

Astasia Myers
Memory Leak

General Partner @ Felicis, previously Investor @ Redpoint Ventures, Quiet Capital, and Cisco Investments