Public Interface for DataFrame class

GSoC: Implementing DataFrame in Pharo

Data frames (tabular datasets) are probably the most important and powerful tools for data analysis that a programming language can offer. As part of my GSoC project this year I am implementing the data frame functionality in Pharo.

In this article I propose the public interface for a DataFrame class. It is based on the existing data frames in other languages, such as Python (pandas), R, and Wolfram Language, as well as the tools for selecting and indexing collections in Pharo.

It is very important that the public interface of these data structures does not reveal the internal representation. Hiding the implementation is probably the most important principle of object-oriented programming. Test-driven development is a good way of making sure that my knowledge of the internal design will not affect the public interface. I design the interface and write tests for it before all the actual parts of the project are implemented.

In this article I won’t be describing the whole interface of the project. For example, I will not talk about adding, deleting, or updating the data in the existing data frame. My goal here is to design the basic access functionality, as well as the interface for instance creation, that will allow us to write test and to understand what the end product should look like.

In all the examples I will be using the character-deaths table from the Game of Thrones dataset published on Kaggle.

Instance creation

Most of the time we will be creating data frames by reading the data from files in different formats. I think that the easiest way of doing this would be to call a readCsv:, readExcel: etc. method of DataFrame class:

dataframe := DataFrame readCsv: 'GameOfThrones/deaths.csv'.

Also, regardless of the implementation, it would be good to have an option of constructing a data frame from an array of heterogeneous rows or from a dictionary of homogeneous columns mapped to column names. The following two examples should create equal dataframes

dataframe := DataFrame rows:
#(('Peter' 17 true)
('Sam' 21 false)
('Jason' 27 true)).
dataframe := DataFrames columns: {
'Name' -> #('Peter' 'Sam' 'Jason') .
'Age' -> #(17 21 27) .
'Male' -> #(true false true) } asDictionary.

And of course, sometimes we will need to create an empty data frame in order to fill it later

dataframe := DataFrame new.

Name Indexing

I will start by designing the methods than would allow accessing the rows and columns by their names rather than numbers. Being able to do this makes complex queries much more readable.

First of all we need to clarify what do we mean by these names:

  • Each column in a data frame represents an attribute and has a symbolic name associated with it. For example, ‘Name’, ‘Gender’ etc.
  • Rows do not have symbolic names. But they can (and should) have unique identifiers. These are the values of one or many columns in that row. So when we talk about “row names” we actually mean these IDs. For example, in a table of deaths of the Game of Thrones characters, the name of a character, such as ‘Jon Snow’ or ‘Robb Stark’ are (can be) the unique identifiers. The implementation of these key columns will be discussed later.

Option #1

dataframe at: #name.
dataframe at: #('name' 'age').

Problem: How do we index rows and columns at the same time?

Option #2

dataframe rowNamed: 'Jon Snow'.
dataframe columnNamed: 'Name'.
dataframe rowsNamed: #('Jon Snow' 'Eddard Stark').
dataframe columnsNamed: #('Name' 'Allegiances').

Problem: Long and ugly method names make code less readable

Option #3

dataframe row: 'Jon Snow'.
dataframe column: 'Name'.
dataframe rows: #('Jon Snow' 'Eddard Stark').
dataframe columns: #('Name' 'Allegiances').

I think that option #3 is the best one. We refer to the row by its unique identifier and to column by its name. Its much more logical than using numeric indexes. So I think that simple ‘row:’ and ‘column:’ should be associated with names and keys.

These methods return another DataFrame object, so they can be chained:

dataframe
rows: #('Jon Snow' 'Eddard Stark');
columns: #('Allegiances' 'Death Chapter').

Most often only the columns: method will be used, and the rows will be selected with a condition (logical index). For example

dataframe
select: [ :x | x = 'House Stark']
with: #('Allegiances');
columns: #('Name' 'Death Chapter').

Or we can specify the columns first and then apply the condition

dataframe
columns: #('Name' 'Allegiances' 'Death Chapter');
select: [ :x :y | x = 'House Stark' and: y = 1]
with: #('Allegiances' 'Death Chapter').

Numeric Indexing

Numeric indexing should work the same as the indexing of ordered collections. But because dataset is much more complex and requires more advanced operations than a simple array, numeric indexing also becomes more complex:

  • We might want to get a row or a column from a dataset by its number
  • We might need a whole range of rows or columns (for example, from select all the columns from the third one till the end)
  • We also might want to specify which rows and columns to return (for example, I want rows number 2, 3, and 5)
  • And sometimes it is useful to have negative indexing — it allows us to index elements from the end of our collection (for example, give me the last three columns)
dataframe rowAt: 18.
dataframe columnAt: 1.

Following the conventions introduced by Matrix and PMMatrix, we can access a single cell in a data frame using the at: at: method

dataframe at: 18 at: 1.

Problem: It’s not clear which one is the index of row and which of column

To get a range of rows or columns we can use something similar to the from: to: method

dataframe rowsFrom: 2 to: 32.
dataframe columnsFrom: 3 to: 5.

And we can ask for some specific rows or columns in a specified order by saying something like

dataframe rowsAt: #(1 3 2).
dataframe columnsAt: #(4 5 1).

The four methods described above will return another data frame, so we can chain them to get any slice (we will decide later whether it should be deep or shallow) of our data frame. For example:

dataframe
rowsAt: #(1 2 17 20);
columnsAt: #(2 4 1).
dataframe
columnsAt: #(2 7);
rowsFrom: 1 to: 100.

Problem: What if we want to get one specific column and a range of rows? The question of whether these methods can be chained depends on the implementation of rows and columns. What will be the return type if we ask for a single column? Will that object understand the rowsFrom: to: message? We will answer these questions later.

And to use negative indexes we could implement methods rowAtWrap: and columnAtWrap: that are implemented as an index modulo size, similar to atWrap: in other Smalltalk collections

dataframe rowAtWrap: 0. "Returns the last row"
dataframe columnAtWrap: -10. "The 11th column from the end"

But I think that such queries would not be readable. Especially if we combine negative indexes with all the previous methods (for example ranges). So another approach would be to make this wrap implicitly

dataframe at: 2 at: -3.
dataframe rowsAt: #(3 -1 2) columnsAt: #(-2 5).
dataframe rowsFrom: 5 to: 0. "Till the end"

Problem: It may be slower to check the sign or to apply the modulo every time we request an item by its index. But this is not the time to think about the performance issues. It looks good, so we should keep it. And we can optimize it later.

Logical Indexing

This is probably the most powerful way of selecting the data. It allows us to specify the conditions on which the rows will be included into a slice based on the column values.

What do we have in Smalltalk?

Smalltalk provides a way for selecting or rejecting the values of any collection, similar to the SQL queries (or maybe SQL is similar to Smalltalk)

array := #(7 6 5 4 3 2 1).
array select: [ :x | x > 5 ]. "#(7 6)"
array reject: [ :x | x > 5 ]. "#(5 4 3 2 1)"
array detect: [ :x | x > 3 and: x < 5 ]. "4"
array select: [ :x | x > 10 ]. "#()"
array detect: [ :x | x > 10 ] ifNone: [ NotFoundError signal ].

Problem 1: In case of a data frame it’s not very clear what x is.
Problem 2:
We may want to get the values of one column based on the conditions applied to the other one. In other words, these queries must be applied to the whole data frame.

Option #1

Regardless of the actual implementation, logical queries can treat the data frame as a collection of rows. This way we can use select:, reject:, and detect: in a way we are used to, and assume that x is a single row which can be indexed with column names

dataframe select: [ :x |
(x at: 'Allegiances') = 'House Stark' and:
(x at: 'Death Chapter') = 1 ].

Problem 1: Some more complex queries can become incredibly long 
Problem 2: It is a bit redundant to write (x at: …), (x at: …) every single time. Because x is the only thing we are indexing, and the only way we are indexing it is by columns.

Option #2

One possible solution to the Problem 2 of a previous option would be to pass columns that we want to apply conditions to as parameters of the block closure. We could implement a method that will accept a block closure and an array of column names. Then it would get the actual columns by these names and evaluate the block at these columns. So our query could look like

dataframe
select: [ :x :y | x = 'House Stark' and: y = 1 ]
with: #('Allegiances' 'Death Chapter')

Or it could be something more readable

dataframe
select: [ :allegiances :chapter |
allegiances = 'House Stark' and:
chapter = 1 ]
with: #('Allegiances' 'Death Chapter').

Compared to the previous option, this approach is much better if we are dealing with long logical queries. For example, let’s look at all the noble male Lannisters, who have died in the first or the last (fifth) book

dataframe
select: [ :allegiances :nobility :gender :book |
(allegiances = 'House Lannister' or:
allegiances = 'Lannister') and:
nobility = true and:
gender = true and:
(book = 1 or: book = 5) ]
with: #('Allegiances' 'Nobility' 'Gender' 'Book of Death').

Now imagine writing this query with option #1.

Conclusion: what I propose

"Creating a data frame from an array of rows"
dataframe := DataFrame rows:
#(('Robb Stark' 3 true)
('Jon Snow' 'NaN' true)
('Eddard Stark' 1 true)).
"Creating a data frame from a dictionary of columns mapped to column names"
dataframe := DataFrames columns: {
'Name' -> { 'Robb Stark' . 'Jon Snow' . 'Eddard Stark' } .
'Book of Death' -> { 3 . Float nan . 1 } .
'Nobility' -> { true . true . true } } asDictionary.
"Reading the data frame from a CSV file"
dataframe := DataFrame readCsv: 'GameOfThrones/deaths.csv'.
"Getting and setting the column names"
dataframe columnNames.
dataframe columnNames: #('Name 1' ... 'Name N').
"Getting a single column by its name"
dataframe row: 'Jon Snow'.
dataframe column: 'Name'.
"Getting multiple rows or columns by their names"
dataframe rows: #('Jon Snow' 'Eddard Stark').
dataframe columns: #('Name' 'Allegiances' 'Nobility').
"These methods return data frames, so we can chain them"
dataframe
rows: #('Jon Snow' 'Eddard Stark');
columns: #('Name' 'Allegiances' 'Nobility').
"Getting a single row or column by a numeric index"
dataframe rowAt: 20.
dataframe columnAt: 1.
"A value of a cell specified by row and column indexes"
dataframe at: 7 at: 2.
"Index ranges"
dataframe rowsFrom: 1 to: 10.
dataframe columnsFrom: 2 to: 5.
"Specific rows or columns from the list"
dataframe rowsAt: #(23 42 17).
dataframe columnsAt: #(2 4 1 3).
"Chaining these methods"
dataframe
columnsAt: #(2 4 1 3);
rowsFrom: 1 to: 100.
"Chaining numeric and name indexes"
dataframe
columns: #('Name' 'Allegiances');
rowsAt: #(10 20 30 40).
"A simple logical query"
dataframe select: [ :x :y |
x = 'House Stark' and:
y = 1
] with: #('Allegiances' 'Death Chapter').
"Chaining the name indexes and logical queries. Asking for the names and death chapters of everyone from House Stark"
dataframe
select: [ :x | x = 'House Stark']
with: #('Allegiances');
columns: #('Name' 'Death Chapter').

Literature

  1. Adele Goldberg and David Robson. Smalltalk-80. The Language and its Implementation. Addison-Wesley, 1983 (The Blue Book)
  2. Updated Pharo by Example. Collections Chapter
  3. R Tutorial. Data Frame Row Slice
  4. pandas 0.20.1 documentation: Intro to Data Structures
  5. pandas 0.20.1 documentation: pandas.DataFrame
  6. pandas 0.20.1 documentation: Indexing and Selecting Data
  7. Wolfram Language documentation: Dataset