Mitigating Risk: When Engineering Success Is Silent
How we moved our permission checks on files and folders to a new framework without breaking anything
Often times when we write or talk about a project, it’s because it’s a flashy new feature or uses a new technology or because we view it as super exciting in some other way. Meanwhile, we may not want to talk about it, but when a project causes a big failure, we also hear about it. Sometimes though, some of the best engineering work isn’t flashy and doesn’t even cause anyone to notice. In fact, it’s even better if no one notices; we mitigated the risk.
How you might mitigate risk will vary from project to project — that’s part of what makes it difficult. If we knew exactly how to prevent bugs or breaking something, we would all do it. It’s equally important to understand what risks you’re mitigating. You could spend all year and totally prevent a class of failures, but if those failures don’t really matter, that may not be the best use of time. To that end, there are four steps for this process:
- Understand your appetite for risk. Can you afford to break a few things? Can you afford for things to run a bit slower?
- Determine what sorts of problems you might run into. Will you bring down a service? Will you cause slower page load times? Will you lose data? Will you expose data?
- Find ways to mitigate each of these problems. What can you do that would prevent these from happening? If you can’t prevent, is there a way to add in early detection or minimize the number of people affected?
- For each problem and mitigation, decide if it’s worth the trade off. Does the time needed to put the mitigation in place out-weigh what would happen if the worst happened? Might mitigating the risk cause an even bigger risk?
I was recently on a project where we changed the entire framework we were using to check permissions on files and folders, but no one heard much about it because we also managed to not break anything. This was a piece of a much larger project that I’ve talked about before. Like nearly every other company the size and age of Box, we’re working to break up our monolith into microservices. As a part of that, one of the first things we did was build a new permissions service (I touch on that more here and here). Having a new service is great and all and building new things that use it is also great, but until we move existing permissions logic out of our monolith, we haven’t really accomplished our goal. So my team decided to tackle the project of moving out the permissions checking for files and folders from the monolith to the new service.
Understanding Our Risk Appetite and Our Likely Problems
Our permissions framework allows us to check if the currently logged in user is able to perform a given action on a given object. For example, we can check if the logged in user can VIEW a specified file or can CREATE_A_COMMENT on that file. At Box, files and folders are the heart of what we do — there’s very little you can productively do on our product if you weren’t able to access either. Likewise, our big selling point is that we’re enterprise-grade, providing fine grained access control and security. You can hopefully follow what I’m getting at here — if we were to either prevent access to a file or folder that a user should have access to, that would be a problem and even worse, if we were to grant access to a file or folder that a user should NOT have access to, that would be an even bigger problem. Furthermore, we do these permissions checks throughout a huge chunk of our code and were the service responsible for checking these permissions to go down, that would effectively bring the whole application down. The only good news is that much of our main application is slow enough that we have a small amount of tolerance for latency. So, as we moved our logic, we could make things a bit slower but we absolutely could not change a permissions outcome and we absolutely could not have the permissions checking go down.
Our Mitigation Plan
While we would have loved to migrate everything at once, the reality of the situation was that our existing items permissions code is nearly five thousand lines of poorly understood fun. Additionally, as we just established, we could not afford ANY mistakes.
Limit Our Change Set
Our first strategy to limit mistakes was to simply limit the amount we tried to change at once. We decided to start the migration with only a small slice of that five thousand lines of code. The old code is currently split up by why permissions might be granted or denied. So for example, all of the code deciding if I can see a file because I’ve been collaborated on it is together. Likewise, all of the code that decides if I can see the file because I’m an admin of an enterprise and viewing through the admin console is together. Because of this, it was an obvious split point to only pick one of these sections. On top of that, the new policies we are writing are split out by the action type — so one policy for view, one for edit, etc. Our old code isn’t currently written that way, but it also made sense to further limit the change space to only migrate one of these actions. With these things in mind, we decided to first migrate the code that determines if a user can VIEW a file due to coming from a shared link.
We picked the shared link code specifically because it is some of the least complex code and didn’t have dependencies on other teams needing to make changes to enable the move. That said, it still wasn’t simple, and it still wasn’t fully understood by our team (or anyone). Also, because we were only migrating a small section, we needed to refactor the existing code to support a partial cut-over instead of a full replace. Even with that, we still felt that it was still worth moving a smaller amount of code.
Add Controls to Detect and Limit Errors
To further limit our risk, and ensure that we weren’t changing behavior or breaking things with the new path, we also added parity checking. The parity checker that we wrote allowed us to run the new code path and the old code path and then compare the result. If the results differed, we logged that and returned the result of the old code path. Because PHP (what our monolith is written in) is single threaded, this has to be done synchronously and therefore means a performance hit. As we established earlier, though, correctness is FAR more important to us than performance, so this was also worth the trade-off.
The final piece we put in place to limit our risk was adding a feature flip to the whole thing. The feature flip, if off will use only the old code path. If on, it will do the parity checking. This allowed us to ensure that we could handle the added load on the permissions service through a slow roll-out by turning it on for only a small percentage of users and slowly increasing that. It also meant that if there were any errors (hopefully not, due to our other precautions), they would only be experienced by only a few users. Additionally, it allows us a fast off-switch if anything with correctness or load were to happen.
Test All the Things
Parity checking and feature flipping are good for catching mistakes later, but ideally we could prevent them earlier. To that end, before we did anything else, we made sure that our existing code was fully tested for all possible scenarios that we could think of. My teammate spent a full several days just cataloging the relevant tests we had and documenting anything she thought we were missing. While digging through the code and cataloguing the tests, she actually also found a number of small inconsistencies that we were able to fix as an added bonus. Once we added a full set of tests, we refactored the existing code to separate out the VIEW portion of the shared link permissions from everything else.
When we were confident that we hadn’t broken anything, we added a new policy to our new framework and wrote tests on that end. Then we hooked up the code path to call both the new and old code, using the parity check code to fail to the old code (and logging any disagreement). We could then run the tests through the new code path in addition to through the old code path and verify that nothing had changed.
As you can imagine, putting all of these safety rails in place took time. However, some of them, like the parity checker, can be re-used as we move future code paths. In the end, we found a few small edge cases we had missed thanks to the parity checker. We did introduce a slight performance hit in the process, but we otherwise did not negatively impact end users in any way, which fully met the goals of our project.
It likely will not not make sense to mitigate risk for all projects in the way we did or to the extend that we did. However, for all projects, it should be a conscious decision. The possible risk areas should be understood before starting the project. Additionally, it should also be understood in which areas it is and isn’t okay to have a degradation or problem. While we all want to get visibility and recognition for our work, we don’t want that recognition to come for the wrong reasons so we should make sure to mitigate all of the important risk.