Open source, Jupyter, and cloud-based computing in Astronomy
If I say the majority of computing done today is in the cloud itself it won’t be an exaggeration. Look at Google, Facebook, YouTube, Spotify, or streaming services like Netflix!
Ever wondered how these things work?
What is this cloud and how can it be beneficial when compared to traditional computing practices?
These things aside, when it comes to actual work, the workflow is still living in the traditional ways. The same text editors, terminals, and IDEs. When we are developing, how much care do we put so that others too may get a taste of the things?
The world of astronomers!
In astronomy, the size and scale of data are developing into a big challenge now. The telescopes are capable of producing 100s of Gigabytes of data every night and the task of manual reduction and analyzing this data and building your code on that is a tricky option.
“In many cases, it’s much easier to move the computer to the data than the data to the computer.”
I read this statement in a Nature article and the point is bang on target. The explosion in the astronomical datasets in the past couple of decades has made astronomy a science that goes hand in hand with data science. With the development of PYTHON as the de facto language in data sciences, of course, astronomers have adopted this language wholeheartedly for their computing needs as well. The development of the astropy project has been a remarkable achievement. Being actively developed by hundreds of astronomers trying their hands-on computer science, or computer science wizards fascinated by astronomy, it has revolutionized the way we compute astronomical quantities. Astropy is just one example, packages such as scipy, numpy, pandas and scikit-learn have become the day-to-day tools of astronomers working with data at the moment.
Jupyter Notebooks
The next stage was the development and easy adoption of the Jupyter project which made long and complicated codes pretty elegant and easy to read and document. Jupyter has enabled us to move the workflow entirely to the cloud thanks to the cloud computing services offered by a wide variety of companies.
In this article, I explore what is the current status of ‘general purpose’ computing in astronomy when it comes to exploring cloud-based services, and how early-stage researchers can harness the potential of these services. The focus of this article is to look for ways in which the traditional coding using your desktop/laptop can be substituted by cloud-based workflow. Of course, it is possible and easily deployable. The central part of such a deployment is the Jupyter notebook environment, which comes in different varieties and can be deployed both on local and remote desktops.
But what do we expect, when I say the “cloud-based” workflow?
Here is what I propose for a workflow assuming most of the general programming is done in Python, which I recognize as the de facto language of astronomers. I expect the following:
- The user works actively in the Python language. From execution to the generation of the end results, Python is involved primarily or even as a wrapper.
- The user gets the packages from either pip or conda. Building the source code from Github is also an option if the package is unavailable through the other two channels.
- The end results are saved into a repository, which works as a central hub for the project.
- The user is at least familiar with the Jupyter-notebook environment.
- The user pushes this repository actively to GitHub. Read this article to familiarize yourself with GitHub.
In order to use the Jupyter notebooks, we may use the Anaconda installer which is available here. It provides the necessary packages in Python and is super easy to install and use. One may launch the Anaconda-Navigator and find the Jupyter notebooks there. Once launched, the Jupyter notebooks can be opened in a web browser that looks something like this:
Now, one can just write their codes in the cells available there and execute them. A tutorial to use the Jupyter notebooks can be found here. While Jupyter is one step forward in the sense that you become OS independent as long as you have Python installed and a web browser running, you are still bound to your device and not strictly OS independent. While Jupyter itself provides you options to run the notebooks remotely, there is another free service that makes it super easy! Read on..
Cloud-based computing?
I’m not writing this article to pitch for cloud services. Rather, the intention is to provide a glimpse into this world and try to demonstrate what kind of services and applications we can utilize, keeping the field of academic research in mind. Well, once you start working with cloud-based workflows, devices become necessary but less important. Of course, with the popularity of google services (GSuite now Google workspace) for documents, spreadsheets, or slides or the Overleaf for LATEX formatting, we are working in the cloud to some extent. I bet most of the people involved in research write and submit their papers through Overleaf only. The point I am making is that working through cloud-based services is not that uncommon.
When it comes to writing codes, we still don’t think of putting in a cloud-based workflow. Perhaps, we are worried about the OS and its version, necessary packages, and compatibility issues between different versions of the packages! The entry of Jupyter has changed that definitely.
Think about it. Isn’t it more convenient for the user? It makes you platform-independent. Gone are the days when people used to prioritize their OS. Now any OS with a web browser is fine.
Google colaboratory
Google colaboratory (or colab as it is known popularly) is a cloud-based service that provides the computing interface for up to 12 hours, with an ample amount of RAM and storage. It provides a slightly higher amount of resources than similar services such as Kaggle (also offered by Google), or Azure notebooks, or Amazon web instance in its free version. These specifications are at par with what we expected from a ‘ usual’ computer being used for research purposes by individuals. Work being done in google colab is saved in the cloud (respective Google drives) and people can collaborate on code as they do on docs!
It is super easy to transfer the codes and people can run them at any instance, needing just an internet connection. Behind all this wonder is the Jupyter environment only!!! A piece of brief information about the features of Google colab is presented here. I’d recommend going through it. Familiarity with git usage and Jupyter environments is expected.
The basic requirements:
- Working knowledge of Python language and git version control.
- Enthusiasm for cloud-based services.
- Open to learning new things.
Okay, so assuming you tick all these requirements, let’s dive deeper into it. The first thing that comes to mind is a basic text editor and a terminal. Here, this requirement is covered by using notebooks instead of the traditional text editor and terminal. Now, the benefit of using cloud-based notebooks, is that we don’t need to worry about the various versions of packages being used. We do a “pip install” or “conda install” on the notebook itself, and the packages are available in an instant!
Now, we are ready with the notebook on Colab and ready to work. Yes, it seems similar to Jupyter Notebooks that we used on our local machine because it indeed opened a Jupyter notebook only! There are a couple of differences though:
- This notebook is running on Google servers, making your local machine irrelevant. You may run it from any machine that you have anywhere in the world!
- The entire data and analysis you perform here are available only for 12 hours, which should be sufficient for a majority of the tasks.
Well, it is built for machine learning (ML) enthusiasts, hence all the relevant ML packages are pre-installed. You may need not worry about the commonly used packages such as NumPy, scipy, matplotlib, pandas, or Keras. As an instance is launched at every session, you have to install additional packages for every instance. Your connection to the runtime and the remaining time can be displayed by using the commands available
Colab is good, but it still needs a lot of improvements in order to succeed and truly become a desktop replacement. There are other services like Google Colab. Kaggle is the most popular among them if we are willing to compromise on the resources a bit.
The usage of Python is growing by leaps and bounds and with the integration of web-based notebooks based on the Jupyter project, it will only become more popular. More and more people are publishing their codes on GitHub in the form of executable Jupyter notebooks, which is indeed a welcome sign for the things that we will be witnessing in the future. Gone are the days people used to worry about their compilers not working, or the version of software not being compatible, or their OS not supporting a particular library. With the web-based workflow, life is definitely going to be easier in the times to come!
Some articles that I found online related to this one are attached below for reference:
https://www.nature.com/articles/d41586-018-07196-1
https://www.nature.com/news/interactive-notebooks-sharing-the-code-1.16261
https://www.linuxjournal.com/content/jupyter-future-open-science
https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/