Jupyter Notebooks

In case you stumbled on my blog series, here’s a brief about my blogging activities. I make deep dives on the important machine learning algorithms. After completing an overview of machine learning I had started with linear regression. Now before moving on, I want to take a kind of break from the flow and discuss the tooling that an everyday machine learning engineer uses. In fact, I would be specifically talking about the most important tool that he/she would need in their day to day work: Jupyter Notebook. If you have been interested in machine learning at least for some time then I am sure you must have heard of this tool and actively using it. Through this post, I am hoping that I will be able to shed light on some interesting ways in which you can use jupyter notebooks to take your analysis to the next level.

History
But first, let’s look at the origin of this awesome software. In the data-science field as well as the broader scientific community, the kind of analysis that has been resorted to has been largely exploratory. Not only is the output important but the methodology and the way the output is generated is also important. Previously it was difficult to share the line of thinking, hence, there was always a need for such a tool which facilitates it. Therefore, the IPython Notebook was developed in 2011 by a team of researchers under Fernando Pérez and Brian Granger. It was then called IPython because the kernel could only execute Python code. In 2014 the Project Jupyter started as a spin-off project from the IPython project. In the new notebooks, the kernels were separate from the frontend and support for other languages apart from Python was brought in.

Just as a side note, Jupyter is an acronym for Julia, Python and R, signifying that the project creators want to encompass the whole sky of machine learning under its umbrella and provide the power of lightning and flash to all its users.:)

Version Controlling
Once you start maintaining jupyter notebooks for reproducible results, it is important to have an effective storing mechanism. Github is an excellent platform to store and view the notebooks as it renders the notebook when you open the respective file link on any web browser. There are two caveats though. First, is if you have a lot of JavaScript in your notebook, then they are not rendered in Github. You will then need to see them on your local machine. Other than that sometimes differences in code are tough to see and hence, the changes with the previous code are pretty hard to see due to the file structure of the ipynb files.

The solution is to save python files with the jupyter notebook so that the differences in code can be tracked easily as well. This can be done by editing the config file:

~/.ipython/profile_nbserver/ipython_notebook_config.py

This would save a .py and .html file with each notebook saved. Then the differences in the code can be easily shown in the console.

Saving images as a separate file also has a lot of benefits as you can compare the differences between the two. Click on this link and see how the differences in the images can be easily shown.

Two kinds of notebooks
Since jupyter notebooks can be used as an exploratory tool as well as a final document that can be delivered and shown to the customer, it is highly advisable for analytics teams to have two kinds of folders for a specific use case. One can be named develop where exploratory analysis can be kept while the other can be named deliver to house the final notebooks. So you will have the following folder structure.

– develop

+ analysis1.ipynb
+ analysis1.py
+ somefig.png
– deliver

+ final_analysis.ipynb
+ final_analysis.py
+ fig1.png

Magics

There are a lot of magics that can be used in jupyter notebooks. A listing of all the magics will make this post too long, hence, will only focus on the ones that I find most useful in my day to day work.

The first one is using the magic bash. Bash commands make it easy to see the files in the current directory and also download files and other simple bash commands without leaving the comfort of the browser. You can also show the state of the directory where you are probably downloading some file and would like to show that the file is present in some remote directory like s3 and you will need to download the file.

Another one I like is showing math equations. Since we as data scientists work with a lot of algorithms, we need to articulate the algorithms that we use for other engineers who we are working with. Hence %%latex magic really helps us in that manner.

Structuring a Jupyter Notebook
In the beginning, there should be the imports and the dependencies. For Python, imports should be ordered in the normal PEP8 way. i.e.,
1.standard libraries

2.related third-party imports

3.local application/library/module imports

Next, the version information should be given on at least the main dependencies like numpy, scipy, pandas etc.

Once you have reached near the end of your notebook you should consider two things.
Is my notebook too big? In case its too big, then probably it will not be clear to the reader what the big picture is. Hence, probably you should consider breaking the notebook into separate notebooks. To do that you can take help of the FileLinks function.

Jupyter Notebooks are meant to be stories about data and hence should have a conclusion summarizing the main points and what has been discussed.

Debugging

If the code that you are trying to debug lands in the realm of pure python, then you can employ the age-old python debugging tools of pdb. The pdb commands and methods work in the jupyter notebook as well.

In my experience pdb should help you a great deal while debugging your models. This is based on the assumptions that you create your models in python. In case you are creating your models in some other language then you will need to use the debugging tools for the respective language, but I am going to assume that jupyter would work the same way in case the language is an interpreted one like R, Julia, etc. In case you are finding some other kind of errors that look more like weird matrix issues, then probably the shape of the data that you are passing is wrong and you need to look closer at how you are building your matrices.

Profiling

Since you will be dealing with huge amounts of data, it is a good thing to know such things as memory consumption, time spent in executing the cells, especially the cells that are doing most of the heavy lifting, and so on.
%%time will tell you the amount of time needed to run the cell.

You can also use line profiler or see how much memory the cells are taking. For that, you will need to install line_profiler and memory_profiler depending on your requirement. Then you can load the extensions.

Joydeep Bhattacharjee