Scipy Stack: Python’s number cruncher

Bernard Brenyah
DS Biz
Published in
4 min readSep 15, 2017

Python has recently been gaining traction as the go-to programming language for data scientists. There seems to be a never-ending “Python v R” debate in the data science community. A quick google search will surprise you! Personally, I am not interested in those debates (nor should you be). I think the focus should not be on the these ‘tools’ but rather on the data science concepts. Once the concepts have been mastered, the choice of tools is down to personal preferences.
I chose Python because of its:

  • simple, logical and clean syntax
  • maturing data analytics libraries (scipy stack, statsmodels etc)
Scipy Logo. Credit: Scipy Twitter

The bedrock of number crunching and visualization in Python is the Scipy stack. The ability to understand and use the Scipy’s core libraries (NumPy, Pandas & Matplotlib) is crucial to the analysis of various kinds of data in Python. Luckily, Anaconda comes pre-bundled with the Scipy stack so you are good to go after installing this Python distribution.

My Recommendations

Credit: Dataconomy

I will briefly give an overview of these libraries and some recommendations on how to learn to use them in your projects.

NumPy: NumPy is an essential package for data analysis in Python. This library includes most linear algebra functions and random generating capabilities.

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object

- sophisticated (broadcasting) functions

- tools for integrating C/C++ and Fortran code

- useful linear algebra, Fourier transform, and random number capabilities

Pandas: Pandas in Python offer high-level analytic capabilities. It was actually built on top of the NumPy. Think of Pandas as a bad ass version of Microsoft Excel. You can do amazing things with Pandas!

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Before the development of Pandas, data analysts had to rely on R as data analysis and modelling was very hard to achieve with Python’s core packages. Now, Pandas have unlocked all the data analysis and modelling capabilities in Python. This breakthrough means Python data analysts don’t have to disrupt their workflow by using another tool (say R) for some specific computation.

Matplotlib: Visualization is an important aspect of data analytics. Matplotlib gives data analysts the ability to create mind blowing (yet insightful) plots.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

The use of these packages in your projects is the tricky bit. Before I proceed with my recommendations, I want to state that you DO NOT have to know ALL the functions in these packages. A good trick will be checking the official docs for any new method/function one encounter. Eventually, one’s knowledge base of these functions expands resulting in even more creative solutions.

Depending on your programming level, learning how to use functions and methods from these libraries will understandably vary. The official documentation for these packages should be sufficient for a seasoned Python programmer but then again if you were a pro, you wouldn’t even be reading this.

Adi Bronshtein has a nice introduction for NumPy and Pandas. After checking that out, I recommend (for total newbies) check the official documentation pages. They usually have a quick intro page for visitors. Take a sneak preview of what is ahead. Don’t fret if you don’t get it now (especially Matplotlib). With time and practise, you will code them with ease!

NumPy: NumPy Quickstart tutorial

Pandas: 10 Minutes to pandas

Matplotlib: Matplotlib Introductory Tutorials

After sneak previewing the official tutorials head to youtube where amazing people have uploaded numerous playlists on these packages and use cases. I recommend using these playlists from YouTube for video references:

NumPy: numpy tutorial

Pandas: Data analysis in Python with pandas

Matplotlib: Matplotlib Tutorial Series — Graphing in Python

You won’t master these libraries unless you start using them. So go ahead and play with them until the next blog post where we will start doing some real life financial analysis with Python!

We will finally move over to the practical stuff in the blog post with a practical illustration of the power of Python.

--

--