Box Drive — Undoing a bad engineering decision
The first post in this series talked about the LIS and how in BoxDrive 2.1 the local LIS was replaced with a Virtual Local LIS that saves startup time and reduces steady state memory significantly
Making this change was interesting for a couple reasons
- How a questionable engineering decision back in 2013 made this a much harder change than it should have been.
- The mechanics of implementing the rather tricky change.
A Bad Engineering Decision
Back in 2013 I was part of a small group tasked with designing the architecture for Sync4. Shadows, ConsistentShadows, Reorderers, Atomic Save Engines, Monitors, BufferedMonitors, Scanners, Executors, ExecutorActivity, and many other abstractions were being weaved together. If you recall (or re-read) my first post in this series, you can see a pretty natural relationship between the Monitor and the LIS. The Monitor gets data from the file system (called the Streem FS), passes it through the LIS, and the flow continues from there. The Monitor uses an LIS. The Monitor Has-A LIS.
Well, somehow that relationship was misunderstood back in 2013! We somehow thought the Monitor Is-A LIS. Here’s what these 2 relationships look like in UML and python:
What were we thinking? Since early 2015 I’ve been lamenting this decision. And the lament grew stronger with the realization BoxDrive didn’t need the startup time-wasting and memory-hogging LIS on the local side for 2 reasons:
- The LIS is what allows the Monitor to detect “echoback” notifications. But in BoxDrive “echobacks” on the local side don’t exist by design. Streem doesn’t generate notifications for changes made by the engine through a private APIs.
- The engine can have perfect information on the file system because it has the private API into Streem. So having an imperfect data source (a traditional LIS) is superfluous.
Awesome! BoxDrive can eliminate the Local LIS!! Saving startup time and memory. Let’s go! Unfortunately it wasn’t a trivial change.
Referring again to the high level architecture diagram of the engine
notice the symmetry. That’s a great part of the architecture — regardless of the direction:
- A local change gets propagated to Box.com
- A Box.com change gets propagated to local
A lot of the logic is the same! That was a great engineering insight/decision. One example of the sharing of logic is in the Monitor class hierarchy:
Where all the shared logic is in the common baseclass, and any specializations are in the subclasses. However, because of our bad decision the hierarchy actually looks like this:
And now in BoxDrive what we really want is this:
Where the Box monitor has a traditional LIS and the Local monitor has a Virtual Local LIS.
The transformation from what we had to what we wanted presented some interesting challenges:
- Finding some techniques / tricks to aid in the IsA to HasA transformation
- Making the transformation without a feature-flip to hide partial or in-progress work
- Making the transformation in small, hopefully easy to understanding Pull Requests (commits)
To me this is “engineering”. Real, professional engineering. I enjoy writing new code, who doesn’t. I also enjoy making existing code better. This engineering challenge was the later.
Techniques for IsA to HasA change
I did a bit of research looking for tips on doing a large IsA to HasA transformation. Didn’t find anything great. I also quickly scanned a great book, which I highly recommend — Working with Legacy Code (M Feathers) and didn’t find anything there either.
I did come up with one insightful (I think?) idea that helped facilitate the transformation — right away I added a pseudo HasA relationship, on top of the real IsA relationship:
class Monitor(LogicalItemShadow):
def __init__(self, ...):
...
# Define a HasA attribute, just make it self-referential
self._logical_item_shadow = self
...
Once this was defined, I could incrementally change code from depending on the IsA relationship to using the HasA self._logical_item_shadow
. It’s a pretty nifty trick! So for the 2-3 weeks of the effort the class diagram was more or less the strange:
And as work progressed more and more code used the HasA relationship. (NOTE: the engine is written in python, so there’s no such thing as a reference to only the LIS API. It’s just by convention that code would restrict itself to only calling APIs implemented in the LIS class.)
No Feature Flips allowed
In general I try avoiding Feature Flips, hiding work-in-progress until it’s 100% complete. I find it way more Agile to incrementally push new code, of course without negatively impacting the product, rather than develop some big feature that’s hidden, until the FF is turned on, thus exposing 100’s or 1000’s of new lines of code into Production all at once.
I wanted to get through this change without introducing any regressions into a daily internal builds. I was pretty confident I could. Unfortunately, I didn’t quite make it… 3 times. Each regression was interesting:
- New code triggers a deadlock.
Our automated end-to-end test framework starting hitting a deadlock. This coincided with my work. I quickly discovered the problem, which was actually in existing code. In terms of acquiring “locks” software engineers know 1) always acquire locks in the same relative order and 2) higher-level locks should be acquired before lower-level locks. Well… there was months old code that was acquiring a high level lock just after acquring a low level lock. Oops… that’s dangerous. The trap was set and once triggered… BOOM! One of my changes fell into this trap by acquiring the locks in the proper semantic order. Correcting that older code was easy and the deadlock disappeared. - Performance issue with icons.
Here my changes exposed existing code that was querying the LIS 3 times in one method, each time getting more or less the same information. With the in-memory LIS this wasn’t an obvious problem. But the switch to the VLLIS, which involves communicating with the local file system, such calls were more expensive. A bit of refactoring in that method solved this performance regression. - Incorrect icon and context menus for Ignored Files.
This was a super interesting issue and wholly my fault! It involves a subtle aspect of the old LIS vs the new VLLIS having perfect information. The old LIS didn’t have any knowledge of Ignored Files as they are “filtered” out by Streem. Only Streem knows about those files. But when the local side switched to the VLLIS, suddenly these Ignorable files where known to the engine. This subtle change in “contract” broke the UX related to Ignorable files. A couple small changes in engine logic and the regression was quickly fixed.
To date those are the 3 regressions, all found before release.
Small incremental changes
I’m a stickler for doing small commits. Small commits make for easier and better code-reviews. Large commits, they are hard to process. At Box we use GitHub Enterprise and one can see the size of a code review (aka Pull Request) here:
This project was going to touch a good number of files and change 100’s of lines of code. I typically want my commits to change no more than a couple hundred lines of code/comments. The VLLIS change was a big change, so I was going to be making lots of small commits (it turned out to be 24 commits). It’s almost a “habit” for me to work like this, making small commits, I don’t really have to think about it. Reflecting however on what it takes to avoid large commits, I see several techniques:
- Plan/envision how the change can be made incrementally. Find the “ends” or “beginnings” of interconnected code, like the beginning or end of a strand of spaghetti, and work through that code path in steps.
- Well designed code with clear contracts and solid test coverage is very helpful.
- Learn how to use the tools at your disposal (git and GitHub Enterprise in my case) so that you can stage and organize a dependent series of changes (Pull requests in the parlance of GitHub).
- Communicate with the colleagues reviewing the changes so that they can see the larger picture why reviewing the smaller changes.
Here’s a scatterplot of the 24 commits (over ~21 days of work):
While a few of the commits were over my “soft limit” of ~200 changed lines, even those weren’t monsters. Nothing like the massive commits with >500 line changes that I see all too often. Each commit left the product in working condition (modulo the 3 regressions described earlier). For me, this is solid, professional software engineering. That this was possible also speaks to the quality architecture in the product, allowing for such changes, and the quality unit and automated end-to-end testing in our CI process and our crack QA team.
It certainly felt great finally making right on that questionable IsA decision from years ago. As it so happens this was my final dance with this architecture, as I’m now off to a different team at Box. I go from intimately knowing the architecture and internals of a product, to knowing next to nothing on my new team! It’s an adventure!
Interested in joining Box? Check out our open positions.