Separating the Internal Representation

GSoC: Implementing DataFrame in Pharo

Oleksandr Zaitsev
Jul 24, 2017 · 2 min read

In this post I will describe the changes I have made some time ago that are supposed to loosen the dependency of a DataFrame on its internal representation. DataFrame defines a complex API on top of a data structure that defines the way how the data is stored in memory. Separating this data structure from the rest of the DataFrame and introducing some stable interface for internal representation will prevent us from having to rewrite half of that complex API whenever we take a different approach to representing data in memory.

Problem

In my post about the Mess Inside a DataFrame I have described several issues that arose when we implemented DataFrame as a subclass of OrderedDictionary. The problem of storing the data inside a data frame is rather complicated. Data frame has to be suitable for working with big datasets, for iterating through the observations (rows) very efficiently, and not to exceed the available memory. Other well-known data frames (like the one in pandas) use a highly optimized collection of blocks that allows users to work with big data and parallelize time-consuming operations. Designing such data structure is beyond the scope of my GSoC project, though I intend to do it later. At this stage I am storing the data in some existing Pharo collections, like Matrix and PMVector.

Problem: Every time we change something in the internal representation (which is rather simple, but requires a lot of optimization), most of the complex DataFrame’s API breaks down, since it is heavily dependent on the data structure it’s referencing. So, for example, if we decide to switch from Matrix to PMMatrix, we would have to rewrite almost every single method that was written so far. Therefore, as DataFrame grows, it becomes harder and harder to introduce changes.

Solution

The solution I came up with is similar to the Facade design pattern. It consists of three parts:

  1. DataFrame — defines the advanced user-level functionality. Depends only on the facade of the black box.
  2. DataFrameInternal — acts like a facade for the black box where all the advanced storage logic is implemented. It maintains a stable API
  3. Classes inside a “black box” — they define a very efficient way of storing and accessing the data

This way the internal representation can be changed and optimized at any time. As long as the public interface of DataFrameInternal stays the same, DataFrame will not be affected.

Results

After separating the internal representation, the methods of DataFrame became much more readable and straightforward. It allowed me to clean up all the mess and make all the tests green.

Oleksandr Zaitsev

Written by

PhD Student at Inria Lille, RMoD team. Researcher of software evolution at Arolla. Pharo contributor and GSoC org from Pharo Consortium. MSc. in Data Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade