The Ultimate Beginner’s Guide to Jupyter Notebooks
Jupyter Notebooks offer a great way to write and iterate on your Python code. It is an incredibly powerful tool for interactively developing and presenting data science projects. A notebook integrates code and its output into a single document that combines visualizations, narrative text, mathematical equations, and other rich media. The intuitive workflow promotes iterative and rapid development, making notebooks an increasingly popular choice at the heart of contemporary data science, analysis, and increasingly science at large. Best of all, as part of the open source Project Jupyter, they are completely free.
Project Jupyter is the successor to an earlier project IPython Notebook, which was first published as a prototype in 2010. Jupyter Notebook is built off of IPython, an interactive way of running Python code in the terminal using the REPL model (Read-Eval-Print-Loop). The IPython Kernel runs the computations and communicates with the Jupyter Notebook front-end interface. It also allows Jupyter Notebook to support multiple languages. Jupyter Notebooks extend IPython through additional features, like storing your code and output and allowing you to keep markdown notes.
Although it is possible to use many different programming languages within Jupyter Notebooks, this article will focus on Python as it is the most common use case.
Getting Started with Jupyter Notebooks!
As you would have surmised from the above abstract we need to have Python installed on your machine. Either Python 2.7 or Python 3.+ will do.
Install Using Anaconda
The easiest way for a beginner to get started with Jupyter Notebooks is by installing it using Anaconda. Anaconda installs both Python3 and Jupyter and also includes quite a lot of packages commonly used in the data science and machine learning community. You can follow the latest guidelines from here.
Install Using Pip
If for some reason, you decide not to use Anaconda, then you can install Jupyter manually using Python pip package, just follow the below code:
Launching First Notebook
To launch a Jupyter notebook, open your terminal and navigate to the directory where you would like to save your notebook. Then type the below command and the program will instantiate a local server at http://localhost:8888/tree.
A browser window should immediately pop up with the Jupyter Notebook interface. As you might have already noticed Jupyter’s Notebooks and dashboard are web apps, and Jupyter starts up a local Python server to serve these apps to your web browser. It makes Jupyter Notebooks platform independent and thus making it easier to share with others.
The Files tab is where all your files are kept, the Running tab keeps track of all your processes and the third tab, Clusters, is provided by IPython parallel, IPython’s parallel computing framework. It allows you to control many individual engines, which are an extended version of the IPython kernel.
Let’s start by making a new notebook. We can easily do this by clicking on the New drop-down list in the top- right corner of the dashboard. You see that you have the option to make a Python 3 notebook as well as a regular text file, a folder, and a terminal. Please select the Python 3 notebook option.
Your Jupyter Notebook will open in a new tab as shown in the below image.
Now each notebook uses its own tab so that you can open multiple notebooks simultaneously. If you switch back to the dashboard, you will see the new file Untitled.ipynb and you should see some green text that tells you your notebook is running.
Why a .ipynb file?
.ipynb is the standard file format for storing Jupyter Notebooks, hence the file name Untitled.ipynb. Let’s begin by first understanding what an .ipynb file is and what it might contain. Each .ipynb file is a text file that describes the contents of your notebook in a JSON format. Each cell and its contents, whether it be text, code or image attachments that have been converted into strings of text, is listed therein along with some additional metadata. You can edit the metadata yourself by selecting “Edit > Edit Notebook Metadata” from the menu bar in the notebook.
You can also view the contents of your notebook files by selecting “Edit” from the controls on the dashboard, there’s no reason to do so unless you really want to edit the file manually.
Understanding the Notebook Interface
Now that you have an open notebook in front of you take a look around. Check out the menus to see what the different options and functions are readily available, especially take some time out to scroll through the list of commands in the command palette, the small button with the keyboard icon (or just press Ctrl + Shift + P )
There are two prominent terminologies that you should care to learn about: cells and kernels are key both to understanding Jupyter and to what makes it more than just a content writing tool. Fortunately, these concepts are not difficult to understand.
- A kernel is a program that interprets and executes the user’s code. The Jupyter Notebook App has an inbuilt kernel for Python code, but there are also kernels available for other programming languages.
- A cell is a container for text to be displayed in the notebook or code to be executed by the notebook’s kernel.
Cells from the body of a notebook. In the screenshot for a new notebook(Untitled.ipynb) in the section above, the box with the green outline is an empty cell. There are 4 types of cells:
- Code — This is where you type your code and when executed the kernel will display its output below the cell.
- Markdown — This is where you type your text formatted using Markdown and the output is displayed in place when it is run.
- Raw NBConvert — It’s a command line tool to convert your notebook into another format (like HTML, PDF, etc.)
- Heading — This is where you add Headings to separate sections and make your notebook look tidy and neat. This has now been merged into the Markdown option itself. Adding a ‘#’ at the beginning ensures that whatever you type after that will be taken as a heading.
Let’s test out how the cells work with a classic hello world example. Type print(‘Hello World!’) into the cell and click the Run button in the toolbar above or press Ctrl + Enter.
When you run the cell, its output will is also displayed below and the label to its left changes from In[ ]
to In . Moreover, to signify that the cell is still running, Jupyter changes the label to In[*]
Additionally, it is important to note that the output of a code cell comes from any text data specifically printed during the execution of the cell, as well as the value of the last line in the cell, be it alone variable, a function call, or something else.
Markdown is a lightweight, markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags. As this article has been written in a Jupyter notebook, all of the narrative text and images you can see are achieved in Markdown. Let’s cover the basics with a quick example.
When attaching images, you have three options:
- Use a URL to an image on the web.
- Use a local URL to an image that you will be kept alongside your notebook, such as in the same git repo.
- Add an attachment via “Edit > Insert Image”; this will convert the image into a string and store it inside your notebook .ipynb file.
Note that adding an image as an attachment will make the .ipynb file much larger because it is stored inside the notebook in a string format.
There are a lot more features available in Markdown. Once you have familiarized yourself with the basics above, you can refer to the official guide from the creator, John Gruber, on his website.
Behind every notebook runs a kernel. When you run a code cell, that code is executed within the kernel and any output is returned back to the cell to be displayed. The kernel’s state persists over time and between cells — it pertains to the document as a whole and not individual cells.
For example, if you import libraries or declare variables in one cell, they will be available in another. In this way, you can think of a notebook document as being somewhat comparable to a script file, except that it is multimedia. Let’s try to understand this with the help of an example. First, we’ll import a Python package and define a function.
Once we’ve executed the cell above, we can reference os, binascii and sum in any other cell.
The output should look something like this:
Majority of times, the flow in your notebook will be top-to-bottom, but it’s common to go back to make changes. In this case, the order of execution is stated to the left of each cell, such as In , will let you know whether any of your cells have stale output. And if you ever wish to reset things, there are several incredibly useful options from the Kernel menu:
- Restart: restarts the kernel, thus clearing all the variables, etc that were defined.
- Restart & Clear Output: same as above but will also wipe the output displayed below your code cells.
- Restart & Run All: same as above but will also run all your cells in order from first to last.
- Interrupt: If your kernel is ever stuck on computation and you wish to stop it, you can choose the Interrupt option.
Naming Your Notebooks
It is always a best practice to give a meaningful name to your notebooks. It may appear confusing, but you cannot name or rename your notebooks from the notebook app itself. You must use either the dashboard or your file browser to rename the .ipynb file. We’ll head back to the dashboard to rename the file we created earlier, which will have the default notebook file name Untitled.ipynb.
We cannot rename a notebook while it is running, so let’s first shut it down. The easiest way to do this is to select “File > Close and Halt” from the notebook menu. However, we can also shut down the kernel either by going to “Kernel > Shutdown” from within the notebook app or by selecting the notebook in the dashboard and clicking “Shutdown” (see images below).
Shutdown the kernel from Notebook App:
Shutdown the kernel from Dashboard:
Once the kernel has been shut down, you can then select your notebook and click “Rename” in the dashboard controls.
Sharing Your Notebooks
When we talk about sharing a notebook, there are two things that might come to our mind. In most cases, we would want to share the end-result of the work, i.e. sharing non-interactive, pre-rendered version of the notebook, very much similar to this article; however, in some cases we might want to share the code and collaborate with others on notebooks with the aid of version control systems such as Git which is also possible.
Before You Start Sharing
A shared notebook will appear exactly in the state it was in when you export or save it, including the output of any code cells. Therefore, to ensure that your notebook is share-ready, so to speak, there are a few steps you should take before sharing:
- Click “Cell > All Output > Clear”
- Click “Kernel > Restart & Run All”
- Wait for your code cells to finish executing and check they did so as expected
This will ensure your notebooks don’t contain intermediary output, have a stale state, and executed in order at the time of sharing.
Exporting Your Notebooks
Jupyter has built-in support for exporting to HTML, Markdown and PDF as well as several other formats, which you can find from the menu under “File > Download as”. It is a very convenient way to share the results with others. But if sharing exported files doesn’t cut it for you, there are also some immensely popular methods of sharing .ipynb files more directly on the web.
- With home to over 2 million notebooks, GitHub is the most popular place for sharing Jupyter projects with the world. GitHub has integrated support for rendering .ipynb files directly both in repositories and gists on its website.
- You can just follow the GitHub guides for you to get started on your own.
- NBViewer is one of the most popular notebook renderers on the web.
- If you already have somewhere to host your Jupyter Notebooks online, be it GitHub or elsewhere, NBViewer will render your notebook and provide a shareable URL along with it. Provided as a free service as part of Project Jupyter, it is available at nbviewer.jupyter.org.
Data Analysis in a Jupyter Notebook
Now that we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re used in practice, which should give you a clearer understanding of why they are so popular. As we walk through the sample analysis, you will be able to see how the flow of a notebook makes the task intuitive to work through ourselves, as well as for others to understand when we share it with them. We also hope to learn some of the more advanced features of Jupyter notebooks along the way. So let’s get started, shall we?
Analyzing the Revenue and Profit Trends of Fortune 500 US companies from 1955–2013
So, let’s say you’ve been tasked with finding out how the revenues and profits of the largest companies in the US changed historically over the past 60 years. We shall begin by gathering the data to analyze.
Gathering the DataSet
The data set that we will be using to analyze the revenue and profit trends of fortune 500 companies have been sourced from Fortune 500 Archives and Top Foreign Stocks. For your ease, we have compiled the data from both the sources and created a CSV for you.
Importing the Required Dependencies
Let’s start off with a code cell specifically for imports and initial setup, so that if we need to add or change anything at a later point in time, we can simply edit and re-run the cell without having to change the other cells. We can start by importing pandas to work with our data, Matplotlib to plot the charts and Seaborn to make our charts prettier.
Set the design styles for the charts
Load the Input Data to be Analyzed
As we plan on using pandas to aid in our analysis, let’s begin by importing our input data set into the most widely used pandas data-structure, DataFrame.
Now that we are done loading our input dataset, let us see how it looks like!
Looking good. We have the columns we need, and each row corresponds to a single company in a single year.
Exploring the Dataset
Next, let’s begin by exploring our data set. We will primarily look into the number of records imported and the data types for each of the different columns that were imported.
As we have 500 data points per year and since the data set has records from 1955 to 2012, the total number of records in the dataset looks good!
Now let’s move on to the individual data types for each of the columns.
As we can see from the output of the above command the data types for the columns revenue and profit are being shown as object whereas the expected data type should be a float. This indicates that there may be some non-numeric values in the revenue and profit columns.
So let’s first look at the details of imported values for revenue.
As the number of non-numeric revenue values is considerably less compared to the total size of our data set. Hence, it would be easier to just remove those rows.
Now that the data type issue for column revenue is resolved, let’s move on to values in column profit.
Although the number of non-numeric profit values is a small fraction of our data set, it is not completely inconsequential as it is still around 1.5%. If rows containing N.A. are, roughly, uniformly distributed over the years, the easiest solution would just be to remove them. So let’s have a quick look at the distribution.
At a glance, we can see that the most invalid values in a single year is fewer than 25, and as there are 500 data points per year, removing these values would account for less than 4% of the data for the worst years. Indeed, other than a surge around the 90s, most years have fewer than half the missing values of the peak. For our purposes, let’s say this is acceptable and go ahead and remove these rows.
We should validate if that worked!
Hurray! Our dataset has been cleaned up.
Time to Plot the graphs
Let’s begin with defining a function to plot the graph, set the title and add labels for the x-axis and y-axis.
Let’s get on to plotting the average profit by year and average revenue by year using Matplotlib.
On the other hand, the Revenues are constantly growing and are comparatively stable. Also, it does help to understand how the average profits recovered so quickly after the staggering drops because of the recession.
Let’s also take a look at how the average profits and revenues compared to their standard deviations.
That’s astonishing, the standard deviations are huge. Some companies are making billions while some others are losing as much, and the risk certainly has increased along with rising profits and revenues over the years. Although we could keep on playing around with our data set and plot plenty more charts to analyze, it is time to draw this article to a close.
As part of this article, we have seen various features of the Jupyter notebooks, from basics like installation, creating, and running code cells to more advanced features like plotting graphs. The power of Jupyter Notebooks to promote a productive working experience and provide an ease of use is evident from the above example, and I do hope that you feel confident to begin using Jupyter Notebooks in your own work and start exploring more advanced features. You can read more about data analytics using pandas here.
If you’d like to further explore and want to look at more examples, Jupyter has put together A Gallery of Interesting Jupyter Notebooks that you may find helpful and the Nbviewer homepage links to some really fancy examples of quality notebooks. Find the entire code here on Github.
This post was originally published on Velotio Blog.
Velotio Technologies is an outsourced software product development partner for technology startups and enterprises. We specialize in enterprise B2B and SaaS product development with a focus on artificial intelligence and machine learning, DevOps, and test engineering.