An engineering quality guide for early stage startups

Peter Lu
Curai Health Tech
Published in
10 min readJul 8, 2020

Here at Curai, we are building a platform that scales high quality medical care for everyone. Every feature that is pushed to production impacts, either directly or indirectly, the care our patients receive. This also means that any bugs we push to production can detract from patients receiving the care they need. At the same time, we are a roughly 20-person engineering team without a dedicated QA engineer, and we frequently find the number of bugs in production rising with feature release velocity. More insidiously, tech debt builds up over time, slowing velocity.

Most startups are extremely resource constrained, and face unique code quality challenges. They are constantly in a state of feature development with constrained resources, often leading to the accrual of technical debt. Some of this is unavoidable, especially pre-product market fit, simply because of the level of thrash in the core product. However, another significant portion is due to the Iron Triangle tradeoff between cost, scope and time. Given constant constraints on cost and time, it stands to reason that code quality projects must be scoped as small as possible in order to be effectively executed, especially on engineering teams without dedicated QA engineers.

Internally, we had many discussions about code quality projects, ranging from comprehensive to trivial. The key tipping factor for reducing bugs in production and bug fix response time was finding low overhead solutions that, even if they didn’t fully solve the problems at hand, served as guide rails for engineers to incrementally improve the quality of their code. We ended up taking the following 5 steps:

  1. Identify pain points in quality code development
  2. Automate end to end testing
  3. Provide visual feedback on individual code quality
  4. Set up design review processes
  5. Elevate tech debt to first class work

1. Identify pain points in quality code development

Much of engineering at a startup occurs under the pressure of a tight deadline, so it’s important to make it as easy as possible for engineers to do the right thing with regard to code quality. When attributing the source of code quality issues, it can be easy to identify surface level problems (ex: “Feature creep caused us to run out of time on this project”). While it’s important to make sure projects are scoped to account for quality control on code, we found higher dividends by focusing on the aspects of development that made producing quality code slow and removing those blockers. We realized several issues with our past development workflow:

  • We required manual end to end testing before each deployment, a long and tedious process that was rarely executed in its entirety.
  • Bug fixing was not tracked in a visible way, causing projects to be scheduled with a higher priority than bug fixes and leading to slow response times
  • Architectural decisions were made with varying small groups of people and little documentation, making it difficult for engineers onboarding onto new code to work within the existing design decisions
  • Tech debt generated during high priority feature pushes were not tracked and often fell to the wayside in favor of new project development.

2. Automate end to end testing

One crucial dimension of code quality tooling is to minimize workflow overhead for engineers. If a tool improves code quality but requires engineers to remember to perform extra steps, chances are it will get neglected during times of large feature development stress. This was especially evident in our deployment process, which involved a manual app testing step. Here is an example of steps we were verifying manually:

  1. When the doctor starts typing, the customer sees a typing-message and vice-versa.
  2. When the doctor stops typing, the customer sees the typing-message disappear and vice-versa.
  3. The doctor types a paragraph long message, sees text box expand accordingly.
  4. The customer types a paragraph long message, sees text box expand accordingly.
  5. When the doctor sends a message, sees the dot in the feed disappears, the customer sees the message and vise-versa.
  6. Doctors type in a message in the message box, switches to another room, comes back in, the message typed gets restored.

There were over fifty such steps required for a full test, and even when engineers executed the QA steps manually, they often skipped those that they believed to be unchanged in the deploying change-set. Before each production deployment, one or more engineers were tasked with clicking through the app and verifying core behavior. When deadlines became tight, or a critical production error arose, this step was often skipped to speed up code deployment, leading to more bugs in production.

To ameliorate this issue, we implemented a series of end to end tests with Selenium to verify all core flows. This allowed us to verify behavior simply by executing a script on our local machines prior to deployment. Here is an example of some of that testing code:

Part of a Selenium test for verifying lab ordering flow

Importantly, Selenium also gave us the ability to create multiple windows and login as multiple clients, which is critical for us to test interactions between multiple users.

However, even though this step was much easier to execute, a few other issues arose. There was still an adherence problem; not all engineers ran the Selenium tests against all pushes, particularly ones they perceived to be small changes. More importantly, running the tests on local machines led to a proliferation of “It works on my machine” issues, because resource constraints on local machines led to test flakiness that were not a result of code breakage.

To remedy this, we further integrated the test suite into the developer workflow: continuous integration. By running the tests prior to merging, we ensured that our principal branch always contained code which could execute core flows.

End to end tests in continuous integration face an engineering challenge not faced by standard unit testing, which is that a true end to end test requires all relevant services to be running as if they were in production before it can execute. When running end to end tests locally, we could run our tests against the set of services on our local machines. However, to ensure maximum reliability in continuous integration, each pull request needed to spawn an ephemeral set of containers running our app. To do this, we adopted a docker-in-docker system, where we defined a docker container which in turn contained all docker container resources needed to run separate services, and spawned this parent docker container triggered by Github PR creation.

By moving our end to end test suite to continuous integration, we were able to achieve much more stable test runs, as well as codify adherence to good code quality practice. This helped us guarantee a working state in our primary branch without requiring any additional engineering mental overhead.

3. Provide analytics on individual code quality

Engineers generally want to push quality code, and reflecting that in a widely viewable dashboard can be all the incentive that is needed to reduce tech debt buildup and improve response times. We wanted to combine this with the above “meet your engineers where they are” principle to create a low mental overhead dashboard for code quality proxy metrics. At Curai, we check in all of our code and report all code issues to Github. Given this workflow, we sought to build dashboards of relevant Github metrics to incentivize and recognize quality engineering behavior.

This led to the development of the following Github issue metrics dashboard inside Google Data Studio, another existing part of our stack.

Example view of our Github issues

At a glance, we provide engineers and product managers with the visibility to see bugs and their SLAs by team, as well as detailed statistics per repository. Below that, we show direct links to over SLA bugs and bug fixing metrics by user.

All of this information is available via the Github GraphQL API v4. In order to keep this dashboard automatically updated, we created a python script to call the API and dump data to Google BigQuery with load_table_from_dataframe. Then, we setup a Kubernetes workload which runs the python script via bash:

Then, we create a scheduler job which instantiates the python script runner each day:

This is part of a cron job running with a daily schedule. After this point, the data is updated with a liveness of 24 hours into BigQuery, and can be added as a datasource into Datastudio for visualization.

This framework can be replicated for any metrics an engineering team may find useful to pull from git, which depends strongly on existing processes. At Curai, we use Github labels extensively to categorize issues into bugs or enhancements, as well as assign priorities. We also wanted to pull bug close times to celebrate engineers who closed bugs as well as those who pushed features. This led us to create the following GraphQL query fragment:

Github’s API interface is very flexible and this is just a small slice of the data it can surface. Another aspect of engineering we wanted to promote was code reviewing, which we roughly measured through both the numbers of reviews and the number of review comments provided. This led to the following visualization:

This shows a week-by-week tallying of reviews and review comments by our team members. Here is an example of a fragment used for that query.

Graph QL query fragment for PR reviews and comments

It turned out that simply visualizing some proxy metrics of code quality were instrumental in bringing our total number of bugs from 80 concurrently open to around ten now.

4. Set up design review processes

As mentioned above, ad hoc architectural decisions were slowing down our team’s development. To improve knowledge sharing and general decision quality, we established an internal process for design reviews as follows:

  1. The implementing engineer writes up a document on their planned code design.
  2. The implementing engineer shares the document with the team members they feel are most relevant to the code at hand, and sets up a meeting with these teammates.
  3. Any engineer who is interested can attend the meeting by requesting the implementing engineer.
  4. The design receives modification and sign-off at the design review meeting.
  5. The document is stored in an engineering-wide folder for later reference.

In this manner, the implementing engineer can get more perspectives before diving into development, and history keeping of architectural decisions becomes more visible and stable.

5. Elevate tech debt to first class work

To find the balance between pushing features out in a timely manner without degrading code quality, we had to treat tech debt as first class work, alongside critical bug fixes and feature work. This took place in a few ways.

  1. Project timelines and scoping would, as much as possible, factor in time for writing unit tests and automated end to end tests for their respective features.
  2. In the event that code had to be deployed without a full suite of tests, or with other technical debt, it could be merged alongside the creation of a tech debt Github issue.
  3. One day a week is set aside for engineers to tackle tech debt issues rather than focusing on feature development.

By giving engineers the ability to track tech debt and the space to make progress on it, many tech debt tasks formerly located on the backlog for months were fixed in weeks. A lot of the work here was on perception: tech debt work will never feel urgent until it’s too late to do anything about it, so it’s important to consider tech debt in its broader context rather than on its direct impact in the moment. Some examples of this high impact work that wouldn’t have made the urgency bar before were simplification work to the local development environment to reduce resource load of running the app on our local machines and automatic screenshots on end to end test failure, which sped up debugging formerly cryptic errors.

Takeaways

Below is Martin Fowler’s quadrant of tech debt:

In our experience, these quadrants are not sized equally. At the end of the day, most engineers have the capability and the desire to push clean code, and do so most of the time. It’s largely when the deadlines start compressing and production hot-fixes are being created that tech debt and bugs are introduced, corresponding to the deliberate half of the chart.

In order to tackle this issue in a scoped and efficient manner, we introspected into the specific difficulties our engineering team faced when trying to adopt best code quality practices, and developed minimal automation or process layers on our existing development workflow to encourage high quality development while still giving engineers the flexibility to apply creative solutions to the problems they faced each day. There’s no one-size-fits-all approach here; if there’s one thing to take away from this, it’s that improving an engineering team’s code quality practices requires an honest reflection on what is preventing engineers from pushing quality code, and even more importantly, designing the most lightweight processes to push through these issues.

--

--