Data processing libraries in Python

Using the correct libraries for data types

Adam Oudad
Analytics Vidhya
4 min readMay 13, 2020

--

With most popular libraries and most of bleeding-edge technologies implemented, Python is usually recommended as a good choice for machine learning related projects. Yet it can be daunting when looking at all different libraries that exist, and difficult to choose a couple to get started.

Machine learning is founded on data processing, and performances of models will heavily depend on your ability to read and transform data in a suitable format for the task you wish to do. In this article, we will go over different libraries, with respect to the type of data they can handle.

Tabular data

Tabular data is most of what we sometimes call big data, as it appears in the form of loosely organized rows corresponding to samples and columns, corresponding to features (most of the time at least).
pandas is the library that can handle such data. Created for the use of stock markets that heavily require operations on series, such as moving averages, it has evolved in a full-featured library that can handle well tabular data.

Here is an example.

import pandas as pd
df = pd.DataFrame('data.csv')
df.head() # Prints head of the table
df.describe() # Describes each column with common statistics
df.prices.plot() # Will plot column "prices"

As you can see, syntax is quite easy to understand. A sometimes difficult part to understand is the way pandas manages indexing and selecting. It comes down to the what you define as your index column. If no arguments are passed to pd.DataFrame pandas creates an index column it increments at each row. You can use different ways for selecting a range of values.

Text data

It is first important to note that Python comes with quite a lot of powerful built-in text-processing capabilities.

raw_data = "Some text"
processed_data = raw_data.lower()
processed_data = raw_data.strip()

Yet natural language processing involves many processing techniques, such as tokenization, lemmatization, which are doable using NLTK.

import nltk
nltk.download()
tokens = nltk.tokenize(s)

For more advanced natural language processing, and if you are aiming for optimized pipelines, spacy is a solid choice.

Audio and musical data

Audio processing is enabled with libraries such as librosa and essentia for audio processing. These are very popular libraries among music information retrieval research community for example.

For symbolic music, that is when working with MIDI for example, mido and pretty_midi are a good choice. More advanced, there is music21, a powerful library which is aimed mainly at musicology analysis, with wide range of abstractions, for example dividing a score into Stream, Part, Track and Measure objects. It also has a straightforward syntax.

from music21 import converter
from music21 import note, chord
midi = converter.parse('file.mid')
# Lets print out notes and chords of the MIDI file
for element in midi.flat:
if isinstance(element, note.Note):
print("We have the note {} of duration {} at offset time {}".format(
element.pitch,
element.quarterLength,
element.offset
) )
elif isinstance(element, chord.Chord):
print("We have the chord {} of duration {} at offset time {}".format(
element.name,
element.quarterLength,
element.offset
) )
# Music21 has some nice display functions
midi.show("text") # Print out all MIDI file with indentation reflecting hierarchy
midi.measures(5,10).show("midi") # Show a PNG image of the score from measures 5 to 10
midi.measures(5,10).plot() # Display pianoroll of measures 5 to 10

Images

Pillow is the library in Python for handling images. It can do what an image editor program would do.

from PIL import Image, ImageFilter
im = Image.open('/home/adam/Pictures/test.png')
new_im = im.rotate(90).filter(ImageFilter.GaussianBlur())

scikit-image is also used for image processing, and provides most of filters and algorithms available. Opencv is a library aimed at computer vision, and can serve for processing videos or working with data from a camera.
If you work with uncommon or very specific image formats, imageio will be able to provide the image data to your python script thanks to its wide range of supported format.

Numerical data

All above libraries have the power to read specific data formats. When this is converted in python objects and data structures, numpy usually comes into play, to manipulate these numerical values.

Before getting deep learning bazookas out, it is recommended to perform some analysis of the data, using sklearn, scipy and/or seaborn.

Conclusion

Once all this data processing and analysis is performed, we get enough information on the data to consider what model to choose for our task, and hopefully machine learning techniques can be unfolded at their full potential.

This article covers most used libraries with respect to the type of data we want to process. These libraries are the ones commonly used in machine learning courses, and there is nothing better than building some experience using these tools on practical problems!

Read the article on my website.

--

--

Adam Oudad
Analytics Vidhya

(Machine) learning. PhD candidate, Keio University, Japan. I write about machine learning, statistics, computer science and maths.