Software Engineering and Lego Bricks

Tales of Software Engineering — 1

A team of developers once had to clean up the interface between two components of a database system, the Storage, and the Evaluation components. Evaluation retrieved (compressed) Values from Storage via that interface. To retrieve Values, the Evaluation code had to jump through arcane hoops of obscure LookupInfo and LookupHelper data structures that were as ugly as they were inefficient. They had come up at the top of Profiles again and again. They were not documented, and nobody actually knew what they did anymore…

Here is some of what they had to clean up:

LookupConstraint lookup_constraint(…);
Store *store = db->store->GetStoreType(lookup_constraint).first;
LookupHelper lookup_helper(lookup_constraint);
LookupInfo info;
store->Find(&lookup_helper, &info);
bool needs_filter = 
NeedFilter((&lookup_helper)->lookup_constraint, store->key_order);
bool record_key_lookup = store->IsRecordKeyLookup((&lookup_helper)->lookup_constraint);
Iterator* iter = new Store::Iterator(store->store_.get(), info.page_offset, info.record_offset, info.filter, needs_filter, record_key_lookup);

A Refactoring was embarked upon. Lo and behold, the team missed the idea of Separating and Encapsulating. Here is what they produced:

struct LookupData {
const uint64_t cost_;
const RecordHeader* header;
};
struct Store {
uint64_t CalculateHash(const string& key) const;
void Prefetch(uint64_t hash) const;
LookupData Lookup(const string& key, uint64_t hash) const;
};

That was indeed simpler than the original mess. But they let details of the Storage (precisely, the actual memory layout of objects in Storage), percolate into the Evaluation — that RecordHeader pointer returned in LookupData pointed straight to an address in the Storage memory. They invoked Simplicity and Speed. They were adamant that any kind of further abstraction would necessarily be antithetical to Speed. They even worried about Prefetching…

Evaluation became deeply entangled with the memory layout of the Storage. It became verbose, hard to read and even harder to reason about. It devolved into page after page of nested loops and if-then-else statements, for reading Values from the Storage layout directly was not easy. They did not see that Storage should fully own the memory layout of the Values. That Storage should be responsible for decoding Values for its Clients. They said there was only one Client (the Evaluation), so that an interface was not really needed anyway. A careful interface would be needed only if there were multiple Clients, but there would never be multiple clients, they were sure of that. And the code had to be written somewhere anyway, right? Did it really matter where? They were also sure that they would never have to change the memory layout. So they saw no problem in cutting and pasting the code to iterate over and decompress Values from another part of the system. It was actually “simpler” and more “efficient”… Of course, since then, they had to change the layout of the Storage 3 times.

In a parallel universe, another team of Engineers faced with the same problem produced the following interface:

template <class Key, class Value> 
class Store {
public:
class Iterator { // encapsulates details of store
public:
Value operator*(); // dereference
void operator++(); // increment
friend bool operator==(Iterator& other);
private:
...
};
struct LookupData {
const uint64_t cost_;
Iterator begin, end;
};
LookupData Lookup(const Key& key) const;
private:
...
};

These Engineers in the parallel universe knew about the old STL idea of separating Containers from Algorithms, and mediating their interactions via Iterators. They had thought about this idea long and hard, and found it could be useful in many situations. And they had come to realize through years of hard work that the idea of Separating and Encapsulating responsibilities was a Good Idea. They had analyzed the interface between Storage and Evaluation to tease out the responsibilities and requirements of each. They had concluded that Evaluation needed nothing more and nothing less than an Iterator over the Values in Storage. The Iterator would give Values to the Evaluation while taking care of details such as the memory layout and decompressing. They decided that the Evaluation code should not reach into the internals of the Storage. That seemed simpler and cleaner to them. It also held promises of easier maintenance and evolution, as well as division of labor and easier testing by parts.

They packaged not only the decoding of the Values in the dereference operator*, but also, much later, some prefetching instructions in the increment operator++. This greatly boosted the speed of their Iterator. Their Storage had the clearly defined single role of containing Values. Their Iterator’s single role was to efficiently iterate over Values, and they had made its exact responsibilities explicit by realizing that an Input Iterator was the exact concept they needed. And their Evaluation code ended up being shorter and more readable. It was not encumbered by boring, cut-and-pasted code to decompress Values. It was also easier to reason about, because responsibilities were cleanly separated out and packaged in small units. They reaped bonus points by writing unit tests for the Iterator. That greatly improved their development velocity, because when they delivered the Iterator, it was bug free.

Their interface worked. It was not leaky. They had clear and simple units of reasoning and testing to structure their software, with no implicit assumptions. They could build upon those bricks safely and quickly. They had successfully understood and applied a key Software Engineering concept: separate responsibilities, isolate and package them so that they are easy to understand, test and re-use. Then you can combine these elementary pieces, like Lego bricks.