Jupyter notebooks have become incredibly popular amongst data scientists and general users of Python & R. While Jupyter framework is liberal and lets you be creative, it would benefit you and your team and readers if you define a structure and follow it. From my experience in developer evangelism and from authoring public facing notebooks for the last 3 years, below is my take on recommend patterns for writing data science samples using Jupyter Notebooks.
Use headings and markdown lavishly
Start your notebook with Heading level 1 and give it a title. Follow it with a narrative of what the notebook aims to do, where the data is sourced from and what the user can expect by the end of it.
Break down your notebook into smaller parts and use Heading levels 2, 3, 4… for hierarchy of topics and sub-topics. A notebook should ideally have just one Heading level 1, under which multiple levels 2, 3… are nested.
Insert a Table of Contents after your executive summary, so the reader can glance at your work without having to scroll a lot. You can auto insert ToC (and keep them up to date) with Jupyter notebook extensions.
Embed images. Use different typography (bold, italic, code) to highlight pieces of text
Use LaTex for equations
Here is a cheat sheet with common LaTex symbols. Insert them inline within two-dollar signs $…$. Insert multi-line equations within $$…$$ (double dollar signs).
Break longer segments of code into multiple cells
Try to keep your code cells as short as possible. Break them up by adding markdown cells in between and add explanatory text. A cell for a single line of code is too short, a cell with over 15 lines of code is too many.
Matplotlib is great, but checkout higher level plotting libraries like Seaborn, `Pandas.DataFrame.plot()` before you settle for matplotlib. Use `plt.tight_layout()` to auto size your plots to fit the notebook.
Use subplots when you want to show a grid of plots. And finally, ensure your plots have legend, title, axes names and discernable symbols.
Coding standards for your Python snippets
snake_name_your_variables instead of camelCasingYourVariables and function names. (You use underscores to separate words, instead of cases). An exception is Class Names where you use CamelCasing and start with capital case. Writing Python quite a bit? Invest some time to look at https://pep8.org/. Your code reviewers and readers will love you.
Do all imports at the top of the notebook. This way reader knows what libraries are used and can ensure their environment is ready.
Name variables such that they don’t clobber built-ins. For instance, call your map object as `map1`, `map2` instead of `map` which will hide built-in `map()` function. Don’t call your variables as `dict` or `list` which will hide built-in data structures of same name.
Round numbers for display purposes. You can quickly round your DataFrames during display by calling the DataFrame_obj.round() method. For instance: `usa_house_scaled.describe().round(3)` will display your numeric columns in your DataFrame rounded to 3 decimal digits.
Be explicit about uncommon libraries that you use in the notebook
It is generally a good practice to import all your dependencies at the beginning of your script. However, in the notebook medium, you might prefer to import them as and when necessary, in order to explain your work better. This is especially true if you import a lot of dependencies at the function level. If you use a library that is not shipped with base anaconda, then the user has to run install steps and relaunch the notebook. Hence, make this explicit at the beginning of the notebook as shown below:
Example structure of your data science notebook
By and large, structure your notebook as would a paper for a scientific journal.
Heading 1: Title: Cover the narrative / abstract. Include a ToC
Heading 2: Get data: Import libraries, search for and get required data sets.
Heading 2: Exploratory data analysis: Use maps and charts lavishly to show different aspects of the data.
Heading 2: Feature engineering: Use pandas and other libraries to prepare your data for training. After each significant transformation, show previews of your data by printing the first 3 or 5 records of your DataFrame
Heading 2: Analysis: Perform analysis, build and train models. Heading 3: Evaluation: Evaluate model. Bring out if assumptions are met using both charts and metrics. Run predictions, evaluate results using both charts and metrics. Use more than 1 metric for evaluation.
Heading 2: Act on the analysis: Persist the results, either by writing to disk or publishing them to the web. Elucidate with maps, charts as applicable. If you built a prediction model, publish it as a web tool (REST API). If you built an explanatory notebook, publish it as an article / report.
Heading 2: Conclusion: Summarize your work. Start from your problem statement, the approach you followed and the results you obtained.