Top 15 Python Libraries for Data Science in 2017

Igor Bobriakov
May 9, 2017 · 10 min read

As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.

And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.

Core Libraries.

1. NumPy (Commits: 15980, Contributors: 522)

The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.

2. SciPy (Commits: 17213, Contributors: 489)

3. Pandas (Commits: 15089, Contributors: 762)

There are two main data structures in the library:

“Series” — one-dimensional

“Data Frames”, two-dimensional

For example, when you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:

Here is just a small list of things that you can do with Pandas:

  • Easily delete and add columns from DataFrame
  • Convert data structures to DataFrame objects
  • Handle missing data, represents as NaNs
  • Powerful grouping by functionality

Google Trends history

trends.google.com

GitHub pull requests history

datascience.com/trends

Visualization.

4.Matplotlib (Commits: 21754, Contributors: 588)

However, the library is pretty low-level, meaning that you will need to write more code to reach the advanced levels of visualizations and you will generally put more effort, than if using more high-level tools, but the overall effort is worth a shot.

With a bit of effort you can make just about any visualizations:

  • Line plots;
  • Scatter plots;
  • Bar charts and Histograms;
  • Pie charts;
  • Stem plots;
  • Contour plots;
  • Quiver plots;
  • Spectrograms.

There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.

The library is supported by different platforms and makes use of different GUI kits for the depiction of resulting visualizations. Varying IDEs (like IPython) support functionality of Matplotlib.

There are also some additional libraries that can make visualization even easier.

5. Seaborn (Commits: 1699, Contributors: 71)

6. Bokeh (Commits: 15724, Contributors: 223)

7. Plotly (Commits: 2486, Contributors: 33)

Google Trends history

trends.google.com

GitHub pull requests history

datascience.com/trends

Machine Learning.

8. SciKit-Learn (Commits: 21793, Contributors: 842)

The scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems. The library combines quality code and good documentation, ease of use and high performance and is de-facto industry standard for machine learning with Python.

Deep Learning — Keras / TensorFlow / Theano

9.Theano. (Commits: 25870, Contributors: 300)

Theano is a Python package that defines multi-dimensional arrays similar to NumPy, along with math operations and expressions. The library is compiled, making it run efficiently on all architectures. Originally developed by the Machine Learning group of Université de Montréal, it is primarily used for the needs of Machine Learning.

The important thing to note is that Theano tightly integrates with NumPy on low-level of its operations. The library also optimizes the use of GPU and CPU, making the performance of data-intensive computation even faster.

Efficiency and stability tweaks allow for much more precise results with even very small values, for example, computation of log(1+x) will give cognizant results for even smallest values of x.

10. TensorFlow. (Commits: 16785, Contributors: 795)

Coming from developers at Google, it is an open-source library of data flow graphs computations, which are sharpened for Machine Learning. It was designed to meet the high-demand requirements of Google environment for training Neural Networks and is a successor of DistBelief, a Machine Learning system, based on Neural Networks. However, TensorFlow isn’t strictly for scientific use in border’s of Google — it is general enough to use it in a variety of real-world application.

The key feature of TensorFlow is their multi-layered nodes system that enables quick training of artificial neural networks on large datasets. This powers Google’s voice recognition and object identification from pictures.

11. Keras. (Commits: 3519, Contributors: 428)

The minimalistic approach in design aimed at fast and easy experimentation through the building of compact systems.

Keras is really eased to get started with and keep going with quick prototyping. It is written in pure Python and high-level in its nature. It is highly modular and extendable. Notwithstanding its ease, simplicity, and high-level orientation, Keras is still deep and powerful enough for serious modeling.

The general idea of Keras is based on layers, and everything else is built around them. Data is prepared in tensors, the first layer is responsible for input of tensors, the last layer is responsible for output, and the model is built in between.

Google Trends history

trends.google.com

GitHub pull requests history

datascience.com/trends

Natural Language Processing.

12. NLTK (Commits: 12449, Contributors: 196)

The functionality of NLTK allows a lot of operations such as text tagging, classification, and tokenizing, name entities identification, building corpus tree that reveals inter and intra-sentence dependencies, stemming, semantic reasoning. All of the building blocks allow for building complex research systems for different tasks, for example, sentiment analytics, automatic summarization.

13. Gensim (Commits: 2878, Contributors: 179)

Gensim is intended for use with raw and unstructured digital texts. Gensim implements algorithms such as hierarchical Dirichlet processes (HDP), latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), as well as tf-idf, random projections, word2vec and document2vec facilitate examination of texts for recurring patterns of words in the set of documents (often referred as a corpus). All of the algorithms are unsupervised — no need for any arguments, the only input is corpus.

Google Trends history

trends.google.com

GitHub pull requests history

datascience.com/trends

Data Mining. Statistics.

14. Scrapy (Commits: 6325, Contributors: 243)

It is open-source and written in Python. It was originally designed strictly for scraping, as its name indicate, but it has evolved in the full-fledged framework with the ability to gather data from APIs and act as general-purpose crawlers.

The library follows famous Don’t Repeat Yourself in the interface design — it prompts its users to write the general, universal code that is going to be reusable, thus making building and scaling large crawlers.

The architecture of Scrapy is built around Spider class, which encapsulates the set of instruction that is followed by the crawler.

15. Statsmodels (Commits: 8960, Contributors: 119)

Among many useful features are descriptive and result statistics via the use of linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis models, various estimators.

The library also provides extensive plotting functions that are designed specifically for the use in statistical analysis and tweaked for good performance with big data sets of statistical data.

Conclusions.

And here are the detailed stats of Github activities for each of those libraries:

Source: Google Spreadsheet

Of course, this is not the fully exhaustive list and there are many other libraries and frameworks that are also worthy and deserve proper attention for particular tasks. A great example is different packages of SciKit that focus on specific domains, like SciKit-Image for working with images.

So, if you have another useful library in mind, please let our readers know in the comments section.

Thank you very much for your attention.

Short version of article available here: https://activewizards.com/blog/top-15-libraries-for-data-science-in-python/

ActiveWizards: machine learning company

Helping organizations to implement AI and data science initiatives

Igor Bobriakov

Written by

activewizards.com | AI and data science for startups

ActiveWizards: machine learning company

Helping organizations to implement AI and data science initiatives