Instruments for a Data Scientist Toolbox
The number of everyday tools for data science keeps growing, covering an ever increasing spectrum of tasks. We at Swedbank use part of our daily scrum meetings to introduce and discuss some of these new tools. Below, we present a selection of recently “discovered” flavors.
In our team at Swedbank we work based on the agile principles and during the daily stand-up sessions we have introduced something we call “tech tips”. Essentially, it means that each team member can share technical tools or information that might be useful for the whole group. Be it some blog post about developments in deep learning, an efficient library to do natural language pre-processing, a condensed summary of a video lecture or anything else; all contributions are welcome, in the format of a three minute presentation.
Below, help yourself to a selection of tech tips (tuned towards the tools) that we have gathered during the past months.
Useful Python Tools
Pattern matching with pampy
Currently, Python does not have pattern matching statements in its core language. Pattern matching is the act of checking a value against a pattern and conducting an action. Such expressive constructs exist for Scala, Kotlin and a number of other languages. Until PEP 622 (structural pattern matching) lands, it is possible to use libraries like pampy
to achieve similar goals. The API is rather self explanatory, accessible and easily worth a try.
Derived functions with functools
Python’s functools
module enables working with functions and performing operations on them. One useful construct is “partial”, which enables creating a function that pre-fills selected arguments of another function. For example, one can construct a function specific for a need from a more general data reading function.
There are other valuable functionalities from the library worth studying, such as “lru_cache” — the details of which are left to the curious reader!
PyAutoGUI for GUI based automation
Another tool is PyAutoGUI
, which is one of the many tools to perform GUI based automation. It allows for programmatically controlling mouse and keyboard, interact with other applications, display messages and much more. This can be considered as part of a robotic process automation (RPA) toolkit where some processes are automated by software robots. Below is an extract from a script to automate email communication with a team member, but it can also be used for file manipulation, filling of forms or JIRA tickets etc. Basically any repetitive and tedious task in a GUI environment can be automated, and made even more efficient when combined with a dose of machine learning.
IDE- and Presentation-Related Tools
VS Code extensions for productivity
Microsoft Visual Studio Code (VS Code) is one of the leading source code editors and can, together with its extensions, provide a solid efficiency boost for developers. Beside a large number of Python extensions which add a lot of useful Python support, there are a number of tools worth highlighting:
- autodocstring is a tool that generates docstrings for Python functions in different formats (like Google or numpy).
- code spell checker is a tool that easily spots common spelling errors.
- bookmarks extension is an effective tool in marking and jumping between important code points.
- jumpy is not far from the tool above and helps to get the cursor fast to the wanted position.
- markdown preview enhanced enables real-time preview of the documents when working with markdown files.
Carbon
Carbon is a tool that creates and shares beautiful images of your source code. Select a theme, programming language, source code and export results for your presentation. As an example, we have used it to create a meetup advertisement; try it out here.
Machine learning related tools
Augmentor
Training data is a scarcity, especially labeled data. One way to expand on existing limited data is to utilize data augmentation. Augmentor is one very helpful such tool in the image processing domain. It can generate artificial data from existing data with the help of an augmentation pipeline, where series of operations are performed on images.
In the example below, the first image is the original and the others are samples from the pipeline that uses probabilistic rotation and zoom within certain bounds. Augmented images will be an input to a CNN-like model to detect handwritten characters and combat over-fitting.
Hummingbird
Hummingbird compiles traditional machine learning models into tensor computations for fast inference/scoring. This tool from Microsoft benefits from hardware accelerators as well as optimization of deep learning frameworks, and it also uses traditional tools without having to rewrite models. At the time of writing, Hummingbird only supports converting a number of tree based models (built with scikit-learn, XGBoost, LightGBM) to PyTorch, and tests have shown an average speed-up of 65x from scikit-learn to PyTorch. As seen below, converting is very easy and it’s worth to keep an eye on this library considering the promising roadmap.
Concluding remarks
Here we end our tour of tools. Hope you found something useful!