Instruments for a Data Scientist Toolbox

Lehar Oha
Swedbank AI
Published in
5 min readJun 30, 2020
CC0 Public Domain

The number of everyday tools for data science keeps growing, covering an ever increasing spectrum of tasks. We at Swedbank use part of our daily scrum meetings to introduce and discuss some of these new tools. Below, we present a selection of recently “discovered” flavors.

In our team at Swedbank we work based on the agile principles and during the daily stand-up sessions we have introduced something we call “tech tips”. Essentially, it means that each team member can share technical tools or information that might be useful for the whole group. Be it some blog post about developments in deep learning, an efficient library to do natural language pre-processing, a condensed summary of a video lecture or anything else; all contributions are welcome, in the format of a three minute presentation.

Below, help yourself to a selection of tech tips (tuned towards the tools) that we have gathered during the past months.

Useful Python Tools

Pattern matching with pampy

Currently, Python does not have pattern matching statements in its core language. Pattern matching is the act of checking a value against a pattern and conducting an action. Such expressive constructs exist for Scala, Kotlin and a number of other languages. Until PEP 622 (structural pattern matching) lands, it is possible to use libraries like pampy to achieve similar goals. The API is rather self explanatory, accessible and easily worth a try.

https://github.com/santinic/pampy

Derived functions with functools

Python’s functools module enables working with functions and performing operations on them. One useful construct is “partial”, which enables creating a function that pre-fills selected arguments of another function. For example, one can construct a function specific for a need from a more general data reading function.

Database queries helper with functools.partial

There are other valuable functionalities from the library worth studying, such as “lru_cache” — the details of which are left to the curious reader!

PyAutoGUI for GUI based automation

Another tool is PyAutoGUI, which is one of the many tools to perform GUI based automation. It allows for programmatically controlling mouse and keyboard, interact with other applications, display messages and much more. This can be considered as part of a robotic process automation (RPA) toolkit where some processes are automated by software robots. Below is an extract from a script to automate email communication with a team member, but it can also be used for file manipulation, filling of forms or JIRA tickets etc. Basically any repetitive and tedious task in a GUI environment can be automated, and made even more efficient when combined with a dose of machine learning.

Outlook automation with PyAutoGUI

IDE- and Presentation-Related Tools

VS Code extensions for productivity

Microsoft Visual Studio Code (VS Code) is one of the leading source code editors and can, together with its extensions, provide a solid efficiency boost for developers. Beside a large number of Python extensions which add a lot of useful Python support, there are a number of tools worth highlighting:

  • autodocstring is a tool that generates docstrings for Python functions in different formats (like Google or numpy).
  • code spell checker is a tool that easily spots common spelling errors.
  • bookmarks extension is an effective tool in marking and jumping between important code points.
  • jumpy is not far from the tool above and helps to get the cursor fast to the wanted position.
  • markdown preview enhanced enables real-time preview of the documents when working with markdown files.
VS Code editor, https://code.visualstudio.com/docs/languages/python

Carbon

Carbon is a tool that creates and shares beautiful images of your source code. Select a theme, programming language, source code and export results for your presentation. As an example, we have used it to create a meetup advertisement; try it out here.

Carbon example

Machine learning related tools

Augmentor

Training data is a scarcity, especially labeled data. One way to expand on existing limited data is to utilize data augmentation. Augmentor is one very helpful such tool in the image processing domain. It can generate artificial data from existing data with the help of an augmentation pipeline, where series of operations are performed on images.

In the example below, the first image is the original and the others are samples from the pipeline that uses probabilistic rotation and zoom within certain bounds. Augmented images will be an input to a CNN-like model to detect handwritten characters and combat over-fitting.

Augmented images of handwritten “d”

Hummingbird

Hummingbird compiles traditional machine learning models into tensor computations for fast inference/scoring. This tool from Microsoft benefits from hardware accelerators as well as optimization of deep learning frameworks, and it also uses traditional tools without having to rewrite models. At the time of writing, Hummingbird only supports converting a number of tree based models (built with scikit-learn, XGBoost, LightGBM) to PyTorch, and tests have shown an average speed-up of 65x from scikit-learn to PyTorch. As seen below, converting is very easy and it’s worth to keep an eye on this library considering the promising roadmap.

Hummingbird Github example

Concluding remarks

Here we end our tour of tools. Hope you found something useful!

--

--

Lehar Oha
Swedbank AI

Data Scientist at Analytics & AI @ Swedbank Group