# Introducing scalaps: Scala-inspired data structures for Python

I’ve found that working on collections of elements by applying functions through well-defined algorithms (e.g., `map`, `filter`, and `reduce`) to greatly simplify my code and remove many sources of errors. Therefore I was delighted to discover that Scala really pushes this to the next level by introducing a plethora of built-in algorithms on data structures. These concepts share some similarities to Spark RDDs and Java Streams, but I find the Scala approach simpler and more elegant.

As I return to data analysis and machine learning with Python, I’ve found it helpful to port these concepts to Python in a new library, scalaps. You can find the code at github.com/matthagy/scalaps.

In this article, we’ll walk through a few examples of how scalaps can simplify our code. Let’s start with this basic, contrived example.

`ScSeq` is a wrapper around any sequence and it provides numerous methods for operating on its input sequence. Many of these methods return another `ScSeq` instance.

Rather than analyze the contrived example, let’s walk through a more realistic example of analyzing Reddit posts using scalaps. For background, we have a sample of Reddit posts in the following CSV format.

We start by accessing the data and parsing it.

You can see the following `ScSeq` methods used in this example.

• map(func): map a function across the current sequence to create another sequence
• to_frozen_list(): create a frozen, realized list of the current sequence as implemented in `ScFrozenList`.
• take(n): return a sequence that will have at most the first `n` elements of the current sequence
• for_each(func): call the function on every element of the sequence in order

Here’s the equivalent conventional Python for these operations.

Which is perfectly reasonable Python and we haven’t yet seen the strength of scalaps.

Next, let’s look at counting the number of elements that match a criteria.

This introduces two more `ScSeq` methods.

• filter(func): select elements that match a criteria
• count(): count the number of elements in the sequence

Note that `filter` is lazy. It doesn’t evaluate to a realized collection but instead, is a lazily computed sequence. The same is true for other methods such as `map`. In fact, an entire `ScSeq` can be a lazy sequence when sourced from an appropriate lazy source. E.g., readings lines from a file.

In contrast, `count` is a sink. It realizes each element of the sequence through a chain of operations starting from the source. `count` is a constant memory sink and can, therefore, consume massive lazy sequences. Other sinks include `to_fozen_list`, which realizes the sequence into an immutable list of type `ScFrozenList`.

Once a `ScSeq` has been realized, it cannot run again. Instead, we can reconstruct the sequence to realize it again. It can be useful to have functions that build a sequence from passed-in source(s) so that we can easily reconstruct the sequence as needed. E.g.,

Returning to the Reddit post example, let’s compute the most popular subreddits in this sample of posts. This is accomplished with the following code.

This introduces a few new scalaps concepts. First, we’re passing a string to `map`. This is interpreted as “select the attribute of that name for each element”. Similarly, integers are interpreted as integer item lookups in a collection.

Next, we use the method `value_counts()`. This sink computes an `ScDict` in which each key is an element from the sequence and the value is the number of times the key occurred. `ScDict` is an augmented Python dictionary that includes functionality such as returning `ScSeq`s for `keys()`, `values()`, and `items()`. In the example, we use the `items()` method to generate a sequence of key/value tuples.

The sequence is then sorted into a `ScList` using `sort_by(key)`. Note, we’re using the integer `1` as the key so as to select the second element, the count, of each tuple. Hence, the list is now sorted by the number of posts.

`reverse()` is used to generate a `ScSeq` that is in the reversed order so that the posts are ordered by descending score. `take(n)` is used with `n=5` to select the first five posts. Lastly, they’re printed through `forach(print)`.

I find this to be a more elegant description of this algorithm than the comparable Python. Do you agree? If not, in your opinion, what would the comparable Python be and why is it more elegant? Let me know in the comment section below.

Lastly, let me leave you with a more sophisticated use of scalaps that computes the frequency of title words in each subreddit.

I won’t explain the full example. Instead, see if you can reason through what the code is doing based upon the naming of the methods and the names of the functions used with them.

I will point out two interesting methods.

• flat_map(func): takes a function that returns sequence for each element in the original sequence. Each returned sequence is expanded, in order, within a returned `ScSeq`.
• group_by(func): construct an `ScDict` where the keys are computed by `func` and each value is an `ScList` with all the elements that have that same key.

In reading through this example, what are you’re thoughts on the legibility of the Python code with scalaps? I personally find this approach easier to reason through relative to conventional Python. Further, I’ve used such a style in Scala, Java, and (Py)Spark so as to structure my code as applying functions to collections using built-in algorithms. I’ve come to find this approach simpler, more legible, and less error prone relative to conventional imperative programming.

Thanks for considering scalaps! I hope it can help simplify your code, improve the readability, and eliminate errors. Let me know what you think in the comment section below.

Again, you can find the code at https://github.com/matthagy/scalaps. It is a nascent library and very much a work in progress. E.g., it needs tests and I’ll develop them once I get some feedback on the API. PR’s are also welcome.

These examples were derived from a Scala learning resource, Interactively exploring Reddit posts using basic Scala in your browser. Check it out if you’d like to learn more about the elegant and powerful programming language, Scala.

--

--

## More from Matt Hagy

Software Engineer and fmr. Data Scientist and Manager. Ph.D. in Computational Statistical Chemistry. (matthagy.com)

Love podcasts or audiobooks? Learn on the go with our new app.

## Matt Hagy

Software Engineer and fmr. Data Scientist and Manager. Ph.D. in Computational Statistical Chemistry. (matthagy.com)