Reproducible Data Collection: Retrieving Methods from Pharo Projects

Oleksandr Zaitsev
Jan 9 · 4 min read

I am working on the applications of natural language processing (NLP) to source code. As part of my work I build a dataset from all methods of selected projects written in Pharo programming language. To make this process reproducible I load the packages at a specific commit and share my dataset together with a list of projects and commit SHA.

In this post I will explain how to load a specific commit from a GitHub repository using Metacello, how to extract all packages that were loaded from a certain repository using Iceberg, how to collect classes, methods, and their source code from each package, and how to store all that data into a CSV file using NeoCSV.

Loading specific commit with Metacello

To load a specific commit from GitHub repository we can use the following template of a Metacello script:

Metacello new
    baseline: {baselineName};
    repository: github://{owner}/{projectName}:{sha}/{subFolder};
  • {baselineName} — name of a baseline to be loaded
  • {owner} — Name of user or organization hosting the project
  • {projectName} — Name of the project
  • {sha} — SHA of a commit to be loaded
  • {subfolder}: This parameter is optional in case the code is at the root of the project. It should point to the subfolder containing the code.

For example, the following Metacello script will load commit 0724b99 of Seaside3 project:

Metacello new
    repository: 'github://SeasideSt/Seaside:0724b99/repository';

For more information about working with baselines visit PharoWiki.

Extracting the list of loaded packages from the repository

To get the list of all loaded repositories we send a message registry to IceRepository class.

IceRepository registry.

We can see that repository of Seaside3 project is called Seaside. We select this repository and store it in a repo variable.

repo := IceRepository registry detect: [ :r | r name = 'Seaside'].

Now we select loaded packages from the working copy of that repository

icePackages := repo workingCopy packages select: #isLoaded.

These are the instances of the class IcePackage which stores the information about a package which is saved in a repository. We need to get the actual Smalltalk packages — instances of class RPackage. So we extract package name (as a string) from each object and collect packages with those names from the image.

packageNames := icePackages collect: #package.

This gives up the collection of all packages that were loaded from Seaside repository.

Collecting classes, methods, and source code

Each package can give us the list of its classes, classes hold references to their methods, and each method can give us its source code as a string.

aPackage := packages atRandom.
aClass := aPackage classes atRandom.
aMethod := aClass methods atRandom.
code := aMethod sourceCode.

We want our dataset to be a table where each row corresponds to a method. So first we collect an array of methods from all packages (we don’t lose the package information because each method know to which package it belongs)

methods := (packages flatCollect: #classes) flatCollect: #methods.

For each method in our dataset we want to know its project name, package name, class name, method name, and source code.

methodData := methods collect: [ :each |
    | projectName packageName className methodName source |

Now we have all the data in a table (array of arrays) which we can save into a CSV file.

Writing to CSV

CSV (comma-separated values) is a file format for tabular data where each row is written on a new line and values in a row are separated by commas. This means that to make sure that this file can be parsed unambiguously, the values that are stored inside must not contain line breaks or commas. Source code contains many commas and we do not want to remove them because they are an important element of the syntax. What we can do is separate values inside a CSV file with tabs instead of commas. Both tabs and newline characters in source code can be replaced with spaces. In fact, CSV files that use tabs instead of commas are called TSV (tab-separated values), but most of the time we use the term CSV as a general term regardless of the actual character that is used to separate values.

So we replace tabs and line breaks in the source code of all methods with spaces. And we also want to remove all duplicate spaces if there are any. In general, we want to replace all sequences of whitespace characters with a single space. This can be easily done with a regular expression.

regex := '\s+' asRegex.

Now we need to load NeoCSV. This can be done using Catalog Browser (World Menu > Tools > Catalog Browser). Search for NeoCSV and press “Install”.

First we create a writing file stream.

stream := (File named: '/Users/oleks/Desktop/data.csv') writeStream.

We initialize NeoCSVWriter on that stream.

neoCSVWriter := NeoCSVWriter on: stream.

Now we have to customize it to use tab character as separator and don’t surround values with quotes (this is done by default by we may have quotes in source code). We do this by setting fieldWriter to #raw (by default its value is #quoted).

    separator: Character tab;
    fieldWriter: #raw.

We write column names as the first row in our file:

neoCSVWriter nextPut: #(project package class method source).

And write all the collected data (it should be presented as an array of arrays):

neoCSVWriter nextPutAll: methodData.

For more information read NeoCSV chapter of Pharo Enterprise book.


Thanks Cyril Ferlicot for teaching me how to extract loaded packages from repositories and work with baselines.

Oleksandr Zaitsev

Written by

Relais thèse at Inria Lille. Pharo contributor and GSoC org from Pharo Consortium. MSc. in Data Science, BSc. in Informatics.