Reflective Series or The Power of doesNotUnderstand

GSoC: Implementing DataFrame in Pharo

Oleksandr Zaitsev
Aug 23, 2017 · 6 min read

Reflection is the ability of a computer program to examine, introspect, and modify its own structure and behavior at runtime.

In this post I will explain how Pharo allows us to create powerful DataSeries that reflect the behavior of its elements, and demonstrate how one simple reflective message can extend the functionality of DataSeries and make it suitable for working with any type of data you put inside it.

Problem with data types

DataSeries is a one-dimensional container for data of a certain type which allows us to manipulate this data easily and efficiently. The functionality of series is defined by the type of data stored inside it. If all the values are numeric, we can multiply this series by 0.01, find the average value, or calculate the square root of all the values. If the values are strings, all these messages should signal an error. However, series of strings should support the common string operations such as find, replace, asUppercase etc. What about time series? Boolean series? How about user-defined types?

One possible solution would be to create sublasses of DataSeries, such as DataSeriesNumeric or DataSeriesText and implement all the desired behavior in these subclasses. There are two major problems with this approach:

  1. We might need to write a lot of classes
  2. And a lot of methods

Every string operation, arithmetic operation, or statistical function which we want our series to understand, must be implemented in one of the subclasses of DataSeries. What’s worse, if we want to go beyond the 5–10 standard data types and create a series with objects of a different class (for example, graphical elements), we might need to define a new subclass of DataSeries for this type of data.

In the next section I propose a simple solution that allows us to create a single DataSeries class capable of working with any type of data.

The power of doesNotUnderstand

When an object receives a message, it first looks in the method dictionary of its class for a corresponding method to respond to the message. If no such method exists, it will continue looking up the class hierarchy, until it reaches Object . If still no method is found for that message, the object will send itself the message doesNotUnderstand: with the message selector as its argument. The process then starts all over again, until Object>>doesNotUnderstand: is found, and the debugger is launched.

A simple trick can be done by reimplementing the doesNotUnderstand: message in such way that every time a DataSeries receives a message, which it doesn’t understand, before signaling an exception, it sends this message to all its elements. This way we can communicate with DataSeries using the same interface as the data it contains.

doesNotUnderstand: aMessage   ^ self collect: [ :each |
each
perform: aMessage selector
withArguments: aMessage arguments ].

We are not inventing anything new here. This is just an implicit shortcut for collect: message, which makes data manipulation much easier and queries far more readable. Just look at the following example. We have a series of booleans and want to invert them

isMale := #(true false false true).

Inverting this series with collect: is easy, but not very readable. Imagine having this as a part of some complex query

isFemale := isMale collect: [ :each | each not ].

After reimplementing the doesNotUnderstand: message, the same inversion can be done like this:

isFemale := isMale not.

Looks better, doesn’t it? Being an object of DataSeries, isMale does not understand the not message. It sends itself the doesNotUnderstand: message. The implementation of this message is first found in DataSeries — it sends the message not to all its values (which are booleans) and collects the result.

In the following sections I demonstrate how this new reflective feature of DataSeries can be used for working with different types of data.

Example: String and Number

This example will show you how the reflection described above allows DataSeries to reflect the behavior of its values. Imagine this situation:

Organizers of ESUG have a list of people registered to the conference. They want to grant everyone from this list premium access to a website of one prestigious scientific journal. However, the website allows access only to users with edu email addresses. Having a table with two columns: participant name and email address, we need to add a third column specifying whether the email is in edu domain, or not. Based on this column we will be granting participants premium access or sending them emails asking to provide an edu address.

Here is an example of such table of participants

   |  name               email                 
---+-------------------------------------------
1 | Oleksandr Zaytsev oleks@ucu.edu.ua
2 | John Doe john.doe@gmail.com
3 | Jane Doe jane.doe@harvard.edu

Let’s store it inside a data frame by passing the values as an array of rows

participants := DataFrame fromRows: #(
('Oleksandr Zaytsev' 'oleks@ucu.edu.ua')
('John Doe' 'john.doe@gmail.com')
('Jane Doe' 'jane.doe@harvard.edu')).
participants columnNames: #(name email).

We want to grant premium access to Oleksandr and Jane, and ask John to send us his edu address. It can be done with the following query:

participants column: #isEduEmail put:
(((participants column: #email) findString: 'edu') > 0)

Let’s break it down into smaller steps.

Step 1. We ask our data frame for the #email column and get a DataSeries object. Every value inside this series is a ByteString.

emails := participants column: #emails.

Step 2. We ask this series to find a string 'edu' in every email address.

eduIndex := emails findString: 'edu'.

DataSeries does not understand the findString: message, so it sends this message to all its elements. We get a series of integers with an index at which the 'edu' substring starts in the corresponding email address, or 0, if it wasn’t found.

Step 3. Now we need to create a series of booleans, having true for each element that’s greater than 0 (email contains 'edu') and false for elements equal to 0 ('edu' was not found).

isEduEmail := eduIndex > 0.

Once again, DataSeries does not understand the > operator, so it is sent to every element. All the elements are integers, which means that all of them can be compared to 0.

Step 4. Now we just have to add this series to our data frame as a new #isEduEmail column

participants column: #isEduEmail put: isEduEmail.

Exercises

  1. Try converting a series of string values to uppercase.
  2. Try adding two series of numbers.

Example: Date and Time

As it was shown before, DataSeries reflects the behavior of its contents. This example demonstrates how this feature allows us to manipulate Date and Time values.

Let’s create a DataSeries with three dates.

threeDates := DataSeries fromArray:
{ Date yesterday . Date today . Date tomorrow }.

Today this series looks like this (of course, your result might be different):

   |       (unnamed)
---+----------------
1 | 22 August 2017
2 | 23 August 2017
3 | 24 August 2017

Now, if we want to convert each element of this series to DateAndTime, we can send a message asDateAndTime directly to threeDates object. Since it doesn’t understand that message, the doesNotUnderstand method of DataSeries is executed and theasDateAndTime message is sent to each element of series (objects of Date class). Result is collected into a new series.

threeDates asDateAndTime.

New series contains objects of DateAndTime with time set to its default value provided by Time class.

   |                  (unnamed) 
---+---------------------------
1 | 2017-08-22T00:00:00+03:00
2 | 2017-08-23T00:00:00+03:00
3 | 2017-08-24T00:00:00+03:00

How about converting all dates to strings using? There is a little problem with asString message — DataSeries does understand it. We might want to write something like this:

threeDates asString.

However, the asString message is implemented in Object and therefore every object in a system understands it. Including DataSeries. It means that doesNotUnderstand will not be invoked, and thus the message will not be applied to each date separately, but to the series as a whole. The result of that line of code would be a string generated by theprintOn: message

'   |       (unnamed)
---+----------------
1 | 22 August 2017
2 | 23 August 2017
3 | 24 August 2017'

In this case we must use the collect: message explicitly

series collect: [ :each | each asString ].

The result will be a new DataSeries with every Date converted to a ByteString:

   |         (unnamed)
---+------------------
1 | '22 August 2017'
2 | '23 August 2017'
3 | '24 August 2017'

Literature

  1. Stephane Ducasse, Dimitris Chloupis, Nicolai Hess, Dmitri Zagidulin. Pharo by Example. June 11, 2017
)

Oleksandr Zaitsev

Written by

PhD Student at Inria Lille, RMoD team. Researcher of software evolution at Arolla. Pharo contributor and GSoC org from Pharo Consortium. MSc. in Data Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade