Collections in Python

To help Java programmers achieve a “Pythonic” state of mind

Rajaram Gurumurthi
The Startup
8 min readNov 18, 2019

--

Leonardo da Vinci [Public domain], via Wikimedia Commons - Image transformed to grayscale using NumPy (gist)

This article is intended to help Java programmers who, on their path to machine-learning glory, must first ease into Python.

We’ll only cover the very basic collection types and their operations. References include more comprehensive tutorials and documentation.

Tuple and List

Tuple

A tuple is an immutable, heterogeneous, sequence of values. This is a very useful data structure that does not exist in Java.

A tuple is heterogenous as it can hold items of any type: primitives, objects, other tuples, arrays, and so on. It is a sequence because it is indexable and iterable.

Items cannot be modified in, added to, or removed from a tuple. However, tuples can be sliced and concatenated to form new tuples.

The most common usage of tuples is to return multiple values from a function.

List

A list is also a heterogenous but mutable sequence of values. Values in a list can be indexed, iterated, and modified.

Even though lists can be heterogenous, in practice, they are used to store similar items. Now we see how some common list operations in Java (8) can be performed in Python.

List Operations

Query/filter

Test for membership

Transform

Flatten

This involves the creation of a new list by extracting all items that are nested inside objects in an another list.

In Python, this requires the use of a built-in library (functools). There are other implementations as well.

Sort

Dictionary

Dictionary is a container for key-value pairs and is similar to the Map data structure in Java.

The main difference is that Java maps are strongly typed, whereas in Python, dictionary keys and values can be heterogeneous (but the keys still have to be unique).

Create a dictionary from a list

Iterate using keys

Having transitioned to a “Pythonic” state of mind, we can now lose our Java crutches and look at some advanced data structures and libraries widely used in machine learning.

Array

NumPy is a popular library for working with scientific and engineering data. Here, we highlight the array manipulation capabilities offered by NumPy.

A NumPy array is an N-dimensional grid of homogenous values. It can be used to store a single value (scalar), coordinates of a point in N-dimensional space (vector), a 2D matrix containing the linear transformations of a vector (matrix), or even N-dimensional matrices (not tensors though).

Now, let us look at some of the frequently used array operations.

Query/Filter/Mask

Reshape

Reshaping simply rearranges the existing items in an array into a new shape.

Transform

All the power of NumPy comes from its ability to efficiently transform large arrays of data for scientific and engineering computations. This is really a vast topic and we will only touch upon a few key transformations here.

Sort

Sorting is a bit tricky. The Python sort function does not behave the same way as it does for lists.

Sorting a NumPy array of vectors

DataFrame

The pandas library provides functionality to manipulate tabular data.

DataFrames are used in machine learning to load, analyze, process, and feed the input data sets into the model, and then format the fitted and predicted output for presentation.

Similar to spreadsheets and SQL tables, a pandas DataFrame is a 2D structure with named/indexed columns and rows.

Query

Transform

Sort

Aggregate

Summary

Below is a summary of basic Python collections and various techniques available to manipulate them:

--

--