New chapter in Functional Data Structures in R
Over Easter I’ve been writing on a little book on Todoist together with my good friend Amir, and that has been a lot of fun, but I am back from my holiday today and did a little writing on my next R book, Functional Data Structures in R. I’ve finished another chapter, and you can get the current draft on Gumroad by following the link to the book.
The new chapter is on bags, stacks, and queues, and the reason it isn’t highlighted as drafted in the progress figure above is that I am thinking about adding one more implementation of queues to the chapter.
If you implement queues in a language where you can modify pointers, you can get a very efficient version through a double-linked list, but if you are in a language where data is immutable you have to do a little more work. R isn’t entirely pure, as a functional language, but it does have immutable data, so the implementation I present in the chapter is a mix of a functional and an ephemeral data structure. Strictly speaking, you can implement a double-linked list with a little work in R, but you have to abuse environments to do so, and the solution I present does use environments but is less abusive than a real double-linked list. It relies on an amortised analysis that lets you implement queues as two lists where you have to reverse one to pop from the other when the other is empty. You have to think this way: whenever you add an element to a queue, you take twice as much time as you really do, but then the extra time is put in the bank and it will pay for getting the front element of the list later on. The figure below illustrates it, and you can read the explanation in the new chapter.
Now, a nice property about purely functional data structures is that they are persistent. That means that you always have access to older versions of them after you “modify” them — but this property destroys the amortised analysis you can do on them. When we do amortised analysis we think in terms of a sequence of operations we can do on a data structure and do an analysis that shows that on average they have a certain runtime complexity. There are some cheap operations and some expensive operations, but we can make the cheap operations pay for the expensive operations such that the average complexity is bounded. That works great for ephemeral data structures — that is data structures that are modified whenever we work on them — but if we have persistent data structures — those that we can access earlier versions of, this analysis goes out the window.
For queues, it works like this: you can add to a queue with cheap operations, but some dequeueing operations are costly. If you always have a sequence of operations where the next operation operates on the data structure the resulted from the previous operation, then the amortised analysis is correct. But if you have a persistent data structure, then nothing prevents you from doing a number of cheap operations and then a sequence of expensive operations, going back to the same data structure you had after all the cheap operations. This means that you cannot control the balance between cheap and expensive operations. The amortised analysis breaks.
I give an explanation of that in the new chapter.
If you want a guaranteed complexity, and also persistent data structures, you need to work with worst-case complexity, not amortised complexity.
You can get out of this using lazy evaluation, and I have a paper on how to implement queues this way on my desk right now, but I haven’t implemented it yet. R doesn’t really have lazy evaluation — as with much else in R, it is a half-measure thing: R does have lazy evaluation for function arguments but not evaluation of expressions in general. I am pretty sure I can implement lazy evaluation via some non-standard evaluation where I can delay evaluation via thunks, but I need to think about it a little more before I experiment with it.
I will get there, but I wanted to share the current chapter today and leave that update for a later time. I might be able to implement it tomorrow, but if I don’t, I will be away for a seminar until Thursday, so I might as well share what I have right now.
I would really like to add some thoughts on this to the queue chapter, but it depends a little on how difficult it will be to implement. I suspect that the lazy evaluation solution will be a lot slower than the amortised solution for any non-persistent use of the data structure, and depending on how hard it is to implement I might leave it out of the book. I will definitely try to get it into the R package I am working on in parallel with the book, though. You can get the code for that on GitHub anyway.
For the next chapter, on functional heaps, I have a lot of stuff planned. I have only implemented leftist-heaps in the package so far, but I have a stack of papers of data structures that I look forward to implementing and writing about, so that is going to be a lot of fun. It will probably take me a few weeks to get it all done, though. And the coming weeks are a little busy at work, so that might mean a month or so. Anyway, I will keep you guys updated on the progress.
For the chapter after that — search trees and implementations of sets — I have come up with a neat way of using non-standard evaluation to implement the kind of pattern matching you have in Standard ML and Haskell for R. It is inspired by my bindr package I wrote for my Meta-programming book. You can find it in my ralgo package if you are impatient, but otherwise you can read about it when I get around to explaining it. I might use it already in the heaps chapter to implement splay-heaps. What it really is, is a way of both testing the structure of a tree and testing properties about it, so you can implement tree transformations with little code. I want to experiment with it a little more to get it just right, but I am pretty happy with what I have so far.
I am having a lot of fun working with data structures in R right now, after returning from my Easter vacation, and I can’t wait to implement more. It will have to wait, though, as I would really like to finish the book I’m writing with Amir first and then I have to go for a seminar on how to interact with the media on Tuesday and Wednesday — those of you who know me will agree that this is probably a good idea; I need to know how to combat the fake news, after all (and also learn how to not say controversial things on national television).
Anyway, as always, I would love any comments or criticism of my manuscript so far — and for this particular book where I am doing a lot more research and experiments I would love to collaborate on the data structure package I am working on as well. I am playing with the idea of writing a piece for the R journal about the package once I’m done with the book, and I would love to find collaborators on that.
Enough about the update for that book now, though. On to the next post…