Note: this article is a re-post originally from the Equinox tech blog. If you’re interested in learning React Native or emerging mobile tech and are located in NYC, my team is hiring at Equinox! DM me at twitter.com/seenickcode.
Time and time again, since my first jump into the mobile world five years ago I see the same patterns in engineering teams. After a long stint writing web-apps, backends and APIs in Java, .NET, Ruby on Rails etcetera, some multi-million dollar projects, some for startups, I’ve seen a lot. After some time the iPhone came out and I jumped to iOS development like a good Apple fanboy that I was back then — I absolutely did not regret it. But after five years working with mobile teams of all sizes, one starts to notice patterns. Patterns in that I felt like a software chef who attempted the same recipe over and over again.
Here’s how it goes when you join a new team: “we’re mobile first, take some time and try out our app and let us know what you think”. As a new mobile engineer, you give a humble yet constructive opinion on why this so-and-so widget behaves like this, why this so-and-so screen is presented to the user in this fashion, why this box is color so-and-so and not something else.
Stakeholders, product managers, and designers love this feedback because it seems like the more polish made the better the user experience, right? Then you sift through the app reviews — not bad, 4 out 5 stars, that’s great, everyone should be proud of that.
Yet for all those 1-star reviews such as “screen takes forever to load”, “can’t register” etcetera, time and time again, I see that a large portion of these poor reviews are caused by common sense steps aren’t taken by engineering teams.
Unless the aforementioned app is developed by a major company, in the past five years, these are serious issues for companies because they affect brand reputation and most importantly, profitability.
Where do these problems really come from though? From what I’ve seen, these 1-star reviews come from a lack of quality control and processes between the integration of mobile codebases and backend API performance. Far more 1-star reviews can be avoided than one thinks, it just takes investment in the collaboration between the mobile and backend team. It’s like anything else, if teams are not taking a proper investment in terms of time, focus and collaboration problems can snowball out of control.
Now on to what teams can need to do about this. My advice is part philosophy and part tactical.
The philosophy part is this: mobile engineering teams not only have to own what they release in production, they have to demand that the system they rely on (backend APIs in this case) are up to par with proper standards. Moreover, they have to work closely with the teams they rely on to jointly come up with a solution. It’s the difference between saying “yeah that 3 second response time is a backend issue and we have enough things to worry about” and “why don’t we sit down with the backend team to see how soon we can tackle this problem as soon as possible so this doesn’t spiral out of control”.
In short, this is about taking responsibility for the *end to end* product. You know that saying “if it touches production, it is production”? If the mobile product touches a system, that system is part of the mobile product as well.
Now on to the tactical advice, actions that can be taken to more easily identify, mitigate, monitor and ultimately reduce issues that can cause the 1-star app reviews I’m referring to.
If it’s live, it needs to be monitored…properly. Properly don’t mean installing product x, turning on its top three features, adding a dinky little alert then sitting back. Monitoring is a huge topic and team members need proper training on it. Try James Turnbull’s “The Art of Monitoring” for starters. Read up on this stuff.
Document some performance standards that mobile team does not require but *demands*.
Identify the critical features of the app.
– What external systems do these features rely on?
Typically they’re API endpoints. Define these somewhere because when it comes time to manage performance monitoring tools, a critical endpoint can get mistakenly forgotten.
– What are the top 5–10 most mission-critical endpoints?
These must be monitored closely and individually and not as a whole. For example, using New Relic, there an “Apdex Score”, yet this is a poor indicator of system performance for mobile because it’s an aggregate value that covers other consumers as well (i.e. a web app).
– Have these endpoints been load tested and longevity tested?
Can they handle 10x the traffic? Can they handle 2x the traffic for 24 hours straight? Are the right tools being used to answer these questions? I typically am a fan of using good ol’ SoapUI, running my tests locally then uploading the project export to something like BlazeMeter, a service that can run a load test in a distributed way.
– Is my system resilient to downtime or times of poor performance?
If a cluster node goes down, does the performance spiral out of control? If certain endpoints completely fail, does the app become completely useless? Are simulations like this scheduled in a project and done regularly? I recommend the book “Release It!: Design and Deploy Production-Ready Software” by Michael Nygard.
– If applicable, *why* are certain endpoints extremely slow?
Typically the bottleneck is the database. If this is the case, are team members taking steps to perform basic query analysis, to see if something can be improved? What about speaking with the consumer of the data at the other end to really see what data is *really* needed? Maybe the payload an endpoint returns needs to be broken up or may be needlessly overblown? I can’t tell you how often, time and time again this happens. Not enough communication between consumers and backend teams can cause this.
– Is my staging environment pristine?
If you can’t trust a staging environment, you aren’t mitigating issues in production. This means that the physical system, specs, cluster setup, configuration and most importantly data should be as product-like as possible.
– Can my environment be re-created easily?
Tools like Terraform, Ansible, Docker all make this easier than ever. This goes for setting up a system on a development machine to completely spinning up a new environment in the cloud. With a larger system, this may not be realistic but the important point is that *key parts of the system* can at least be taken care of. Say you’re a mobile engineer that wants to play around with a load test or investigate an endpoint closely. Why not give him or her the tools to set up their own system and break whatever they want? Reproducible and pristine environments are critical to ensuring that deployments and production code are running smoothly.
– What does my Site Reliability Engineering (SRE) team look like?
Don’t have one? Well, this is then your backend team. Mobile engineers rely on these people more than ever, and if a support team isn’t in place with proper processes then the mobile team needs to demand this. The team should have enough members to know the ins and outs of the backend system and need to have the right playbooks documented to address issues.
– How are issues identified, labeled and reported? Is everyone trained in this?
This goes beyond a Slack message saying “hey, somethin’s up with x”. It means that first off, issues are identified and labeled correctly. A “critical issue with the app” coming from a CEO may mean there is a minor with a non-critical feature in the app — identifying issues cannot be subjective. Therefore, a clearly defied list of what means what should exist. “Critical issue: a mission-critical feature such as payments is unavailable”, “Major: a highly visible feature in the app isn’t working 100% or has intermittent or consistent behavior. Or a non-mission critical feature is unavailable.”, etc etc.
Then, are these issues reported in an efficient way? Posting something manually on Slack isn’t count, because people may not see the issue immediately which leads to a very inefficient way of dealing with problems. Ideally, a system should have the right monitoring and alerting in place to report things even before a real person notices. But if someone notices, maybe they fill out a ticket which, based on the severity can alert the right SRE or backend team automatically in a reliable way (i.e. not Slack) such as a pager message, SMS, or automated phone call (or all of them).
Fallback channels need to be set up as well. What if someone doesn’t respond to an SMS within 10 minutes? Do they get a phone call or a page?
– Who responding to issues during off-time?
Have a schedule defined for your SRE or backend team to ensure that trained engineers are always able to be on-call and close to a computer. During expected holiday or high traffic times, there’ll be no need to *manually* ask people to be available as you’ll already have the system in place.
– Do I manage a set of playbooks for addressing production issues in a consistent way?
You will probably have engineers (ideally SREs) with varying skills levels. If a production issue occurs, is there a playbook defined for *exactly what steps* should be taken? Airline pilots use this extensively for a reason: during times of stress, people make mistakes, they don’t think clearly. If engineers have a documented way to address common issues, tackles issues live becomes *much* less stressful and much more efficient at that.
– Is the right logging in place across all systems? Are they structured? Can they be easily searched? Do they have sufficient context information?
Logs tell a story and teams should be able to very easily understand those stories, at all levels, high level and detailed. Logs should be collected at all levels, device and server. Mobile logging should be reported remotely and if a crash occurs, those logs should be accessible.
– Are changes to production automated? Are deployments automated?
Issues occur when team members make *manual* changes to production. Even if those changes are documented, humans aren’t perfect and these changes can eventually cause serious production problems because if they aren’t automated, they cannot be tested and applied consistently. Tools like Ansible (my favorite), Puppet or Chef are typically used to tackle this.
– What are our uptime goals? Are they part of the team’s KPIs?
A basic uptime goal for critical services should be defined and people need to be held accountable if those aren’t met — period.
– What kind of downtime allowance do we have?
Downtime allowance is the amount of time per year that is “ok” to have some downtime. For example, for certain system upgrades, migrations or major deployments, maybe some downtime is needed. If your overall uptime is good enough that at the end of the year, your allowance is high, you can use that “budget” of downtime allowance to run slightly riskier deployments for example. But if your allowance is almost gone, maybe you want to be super conservative and spend more effort on ensuring deployments for the rest of the year are done without downtime at all.
– Are critical errors in production being logged and is alerting in place? Are all critical errors being addressed immediately?
Every exception or even 500 response of an API needs to be logged and addressed as soon as possible. This is because that issues like these can snowball out of control, and especially during high traffic times, the problem is compounded and can even have a cascading effect on other parts of the system.
Stating that “it’s a known issue” is unacceptable. “It’s a known issue and we are in the process of a hot fix at the moment” is the best response. Address 100% of critical production errors is an absolutely higher priority than letting these things linger, even if they are infrequent or not easily reproducible.
– Are issues easy to track down and reproduce, end to end?
Does the mobile app properly send metadata in each request header such as the device used, session info, the current state of the app etc? Is a context identifier being used across the full stack of a request, so that issues can be tracked down easily? For example, mobile device makes request x, using a unique context ID, API longs
– Most importantly, is 20–30% of a team’s time dedicated to system improvement and maintenance?
Great teams invest in a great product, rather than just crank away on new features week to week. 20–30% is the right amount to invest in reliability, performance and resiliency, nothing less.
As you can see, there are a lot of topics here, and I tried to outline some very cursory, high-level questions with high-level suggestions. This list is only an example of what mobile and backend teams should be jointly tackling. Each is worth their own blogpost and each with dozens of books on these individual topics.
I hope these questions can start a healthy conversation for engineering teams, mobile and backend alike so that brand reputation, profitability, developer happiness and most importantly, a great experience for the end user can be at its best.