Instruments for a Data Scientist Toolbox

Lehar Oha
Lehar Oha
Jun 30, 2020 · 5 min read
Image for post
Image for post
CC0 Public Domain

The number of everyday tools for data science keeps growing, covering an ever increasing spectrum of tasks. We at Swedbank use part of our daily scrum meetings to introduce and discuss some of these new tools. Below, we present a selection of recently “discovered” flavors.

In our team at Swedbank we work based on the agile principles and during the daily stand-up sessions we have introduced something we call “tech tips”. Essentially, it means that each team member can share technical tools or information that might be useful for the whole group. Be it some blog post about developments in deep learning, an efficient library to do natural language pre-processing, a condensed summary of a video lecture or anything else; all contributions are welcome, in the format of a three minute presentation.

Below, help yourself to a selection of tech tips (tuned towards the tools) that we have gathered during the past months.

Useful Python Tools

Currently, Python does not have pattern matching statements in its core language. Pattern matching is the act of checking a value against a pattern and conducting an action. Such expressive constructs exist for Scala, Kotlin and a number of other languages. Until PEP 622 (structural pattern matching) lands, it is possible to use libraries like pampy to achieve similar goals. The API is rather self explanatory, accessible and easily worth a try.

Image for post
Image for post
https://github.com/santinic/pampy

Python’s functools module enables working with functions and performing operations on them. One useful construct is “partial”, which enables creating a function that pre-fills selected arguments of another function. For example, one can construct a function specific for a need from a more general data reading function.

Image for post
Image for post
Database queries helper with functools.partial

There are other valuable functionalities from the library worth studying, such as “lru_cache” — the details of which are left to the curious reader!

Another tool is PyAutoGUI, which is one of the many tools to perform GUI based automation. It allows for programmatically controlling mouse and keyboard, interact with other applications, display messages and much more. This can be considered as part of a robotic process automation (RPA) toolkit where some processes are automated by software robots. Below is an extract from a script to automate email communication with a team member, but it can also be used for file manipulation, filling of forms or JIRA tickets etc. Basically any repetitive and tedious task in a GUI environment can be automated, and made even more efficient when combined with a dose of machine learning.

Image for post
Image for post
Outlook automation with PyAutoGUI

IDE- and Presentation-Related Tools

Microsoft Visual Studio Code (VS Code) is one of the leading source code editors and can, together with its extensions, provide a solid efficiency boost for developers. Beside a large number of Python extensions which add a lot of useful Python support, there are a number of tools worth highlighting:

  • autodocstring is a tool that generates docstrings for Python functions in different formats (like Google or numpy).
  • code spell checker is a tool that easily spots common spelling errors.
  • bookmarks extension is an effective tool in marking and jumping between important code points.
  • jumpy is not far from the tool above and helps to get the cursor fast to the wanted position.
  • markdown preview enhanced enables real-time preview of the documents when working with markdown files.
Image for post
Image for post
VS Code editor, https://code.visualstudio.com/docs/languages/python

Carbon is a tool that creates and shares beautiful images of your source code. Select a theme, programming language, source code and export results for your presentation. As an example, we have used it to create a meetup advertisement; try it out here.

Image for post
Image for post
Carbon example

Machine learning related tools

Training data is a scarcity, especially labeled data. One way to expand on existing limited data is to utilize data augmentation. Augmentor is one very helpful such tool in the image processing domain. It can generate artificial data from existing data with the help of an augmentation pipeline, where series of operations are performed on images.

In the example below, the first image is the original and the others are samples from the pipeline that uses probabilistic rotation and zoom within certain bounds. Augmented images will be an input to a CNN-like model to detect handwritten characters and combat over-fitting.

Image for post
Image for post
Augmented images of handwritten “d”

Hummingbird compiles traditional machine learning models into tensor computations for fast inference/scoring. This tool from Microsoft benefits from hardware accelerators as well as optimization of deep learning frameworks, and it also uses traditional tools without having to rewrite models. At the time of writing, Hummingbird only supports converting a number of tree based models (built with scikit-learn, XGBoost, LightGBM) to PyTorch, and tests have shown an average speed-up of 65x from scikit-learn to PyTorch. As seen below, converting is very easy and it’s worth to keep an eye on this library considering the promising roadmap.

Image for post
Image for post
Hummingbird Github example

Concluding remarks

Here we end our tour of tools. Hope you found something useful!

Swedbank AI

AI, machine learning and deep learning at one of the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store