Dealing with production issues while racing to expand features

brighterlink.io
4 min readAug 23, 2016

--

Our new Control Interface

We finished up “Tank7” and now we’re actively working on our current milestone “80-acre”. The big push for this release is to clean up our interface and start on-boarding our Solar lease customers. We’ve also spent a great deal of time refreshing our website and marketing and sales collateral and training the sales team to pitch BrighterLink with all of our energy deals. We’re pretty excited about this, as we demo our product and on-board more and more customers, we’ll be able to gather a whole slew of actionable data. However, as more people depend on the platform, pushing new features and releases gets more complicated. We ran into this in the middle of “Tank7" and obviously we’re continuing to deal with it as it becomes our new norm.

Stop and smell the Dashboards

We ran into a couple of fairly meaty production issues, the biggest one being missing data for our real-time monitors. It was particularly bad because this went on for days and our end-user noticed it first. The cause of this was really a bit of complacency. We built up our initial monitors on Runscope and Ghost Inspector and felt pretty good but we didn’t continually improve it as we built up our feature set. On top of that, our UI unit tests coverage was starting to fall below acceptable levels. Thus, we actually stopped feature development in the middle of Tank7 and dedicated the last two weeks to “production readiness”. A whole bunch of things were neglected as we raced to build up our features and we decided to clean them up

  • Dashboards for our production services (more to come on this topic)
  • Improved Unit Testing Coverage
  • Performance improvements with the API
  • Using React Router
  • Proper DB segmentation and backup strategy

Renewed focus on Continuous Improvement

Once again, we got a little lackadaisical around the concept of continuous improvement. We had a really good rhythm in building up features, fixing up tests and felt confident in that we were making improvements along the way. However, as we ran into production issues and miscellaneous process issues, we weren’t doing as good of job of tracking and building up things that needed to be fixed beyond our GitHub code base. As a reminder, we track our high level milestones in Trello and then track individual GitHub issues in their respective workspaces. However, we’re missing a middle layer that would track issues that could span workspaces or operational or outage information. Thus, we created a dedicated “Continuous Improvement” board, and started putting our outage and retro information there. We also improved our meeting schedule so we can go over these items together as a team at the start of each weekly Sprint.

Accounting for a Design Workflow

Another problem we ran into was changing the look and feel, and/or navigation of our product would impact our users. We hired Amanda, our new product designer (yay), and started introducing new navigation and UIs into our system. However, we didn’t properly account for the existing user experience and didn’t do a good job of hiding unfinished components to our customers. This had us rethink our whole process and how we can make this better. This is an on-going issue, but some things we already started doing are

  • Build up UI functional tests as non-admin users that will automatically run and test specific pages and features
  • Ensure new interfaces or major changes are “hidden” and only navigable by admins and only available when “turned-on”
  • Build a new Design/UI workflow (more to come on this)
  • Implement more design and front-end standards to avoid conflicts

Final Thoughts

When you start a new product, there’s a lot of things you can do and try, but as it matures and you gain traction and users, you start dealing with real production issues. It’s why its so important to gain customer adoption as soon as possible. Not only is it valuable from a use case and feature perspective, but it really solidifies the product and offering and requires you to expand your processes to take those things into account. It’s a balancing act between new capabilities and fixing existing problems, but its important to do both as it will help everyone (you and your customers) in the long run. By implementing just some minor changes, we’ve been able to really harden our system and continue to improve on it.

Note we cover more detailed technical solutions on our engineering blog. There you will find things we did to solve some of these challenges, such as handling Erlang stats and integration with Geckoboard.

--

--