Using Natural Language Processing Techniques to Identify iOS App Eviction
Nerdiness: 5/5
Topics: iOS, Mobile Analytics, SQL, NLP, n-grams
Apps running in the background on iOS are subject to a set of rules, defined by Apple. In some circumstances, the operating system will force a backgrounded app to be terminated, usually to free up resources for an active app. The specific details of when this happens are not published by Apple. They simply advise developers to be aware their apps may be evicted in this way. In this scenario, the app lifecycle methods are not invoked, so there is no convenient callback we can use to log the eviction. We may receive a memory warning callback, but even that is not guaranteed. Instead, we must rely on our data warehouse and logging infrastructure to provide critical information regarding the state of the app in the wild.
Prerequisites
Before we can discuss the approach in depth, we need a clear picture of the tools we’ll need to achieve our goal.
Distributed Logging
This approach is predicated on existing logging infrastructure. We use Amazon Redshift to aggregate logging data. Our logger works based on unique strings representing significant events within the app. We use multiple logging levels to categorize logs, but for the purposes of this piece, we’re focused on the “info” log level, which has the following method signature:
protocol Logger {
class func info(_ name: String)
}
Each time this method is invoked, we effectively add a new row in a Redshift table. This table is queryable via SQL. This is important because we’ll be using some advanced SQL techniques later.
App Lifecycle Logging
At an absolute minimum, we need to track the lifecycle of our app in the wild. To do this, we annotate our AppDelegate with launch and termination logs, like this:
func applicationDidFinishLaunching(…) {
Logger.info(“app.launched”)
}func applicationWillTerminate(…) {
Logger.info(“app.terminated”)
}
Of course, other areas of your app will need to have similar logging for this approach to be useful. As long as you include logging for your critical systems, you can probably use this technique.
Methodology
Zooming out a bit, app logging manifests as a collection of streams, one for each user. We can take advantage of this naturally sequential phenomenon to study the flow of execution. If we include sufficient logging in the various components of our system, we can highlight sub-sequences where the normal flow is truncated. These represent moments within a user flow where the user encountered a crash or eviction. We can use SQL tools to identify and characterize these scenarios. Consider the following sequence:
app.launched
app.fetch_started
app.fetch_completed
ui.refresh
app.backgrounded
app.background_update_started
app.background_update_completed
app.foregrounded
ui.refresh
app.terminated
This simple example shows a sequence we might expect to loop over time, as the user interacts with their app. As long as the user remains active, this sequence or one similar to it will appear in the data warehouse, when querying for the given user and ordering by timestamp. If this stream is interrupted by a crash or eviction, we will abruptly see the first message in the sequence, in this case app.launched
immediately preceded by a message other than app.terminated
in the sequence. The following example shows a truncated sequence:
app.background_update_started
app.launched
In this instance, the last log we see before the app.launched
indicates the app was processing a background update before it was unexpectedly killed. We can learn quite a lot from analyzing sequences of logs in context with each other. These anomalies can help guide in diagnostics. This can help avoid costly exploration and/or logic audit by prioritizing high-value options first.
Natural Language Processing with N-Grams
There is a rich set of tools available for analyzing sequences. Here, we will employ a powerful tool from natural language processing (NLP) to organize the logging data in our warehouse. Log data is a time-ordered sequence of “words,” represented by individual log messages. In NLP parlance, these are known as n-grams, or a sequence of n words in a row. For example, a trigram is a sequence of three words in a row, like “take the bus.” If we map this technique to our logging data, we will find lots of unique n-gram sequences. Filtering these n-grams down to those ending with the app.launched
log, we can characterize the scenarios leading to early termination. This analysis results in a report showing the most popular eviction n-grams.
For our purposes, we will focus on trigrams. The general outline can be extended to support sequences of any arbitrary number. Let’s dig into the SQL query details.
SQL Lead/Lag
One very powerful suite of SQL functions makes this problem very easy to solve. Most SQL functions produce results for records matching conditional criteria. This is great for known conditions, but it doesn’t tell you anything about the context of the rows relative to the other rows. With LEAD/LAG, we can select adjacent rows and include them in the context of their neighboring rows. Here’s an example of a query to identify the number of rows matching our first log:
select
event_type,
count(*)
from logs
where event_type = ‘app.launched’
group by event_type
This query will count all instances in our data warehouse with the matching event type, which is very unhelpful for our purposes. However, adding just a little extra to this query will add substantial value. Here’s an expanded example:
select
event_type,
lag(event_type)
over (partition by user order by event_timestamp)
as previous_type
from logs
where event_type = ‘app.launched’
This will result in a list of all combinations of app.launched
and their preceding log. It’s important to note the options in the OVER clause. These determine the constraints of the LEAD/LAG function. If we do not include the user partitioning, the results will be nonsense, as they will include records from potentially different users. By partitioning, we’re instructing the database to include results where the adjacent records are related by a common user. Now we have a tool to find all the instances of termination, but we’re not quite done. We need one more thing to support trigrams.
The LEAD/LAG functions work beyond the adjacent rows. They also support an offset parameter. Using this parameter, we can specify an arbitrary offset from the given reference row. This allows us to construct a simple query to organize our trigrams.
select
event_type as e0,
lag(event_type, 1)
over (partition by user order by event_timestamp)
as e1,
lag(event_type, 2)
over (partition by user order by event_timestamp)
as e2
from logs
where event_type = ‘app.launched’
Now we have a result set with e0,e1,e2
columns and rows representing all combinations of three events leading up to app launch. This collection includes the normal termination scenarios as well, since they also culminate in app launch. We need to wrap this in a second query to filter these out. This looks something like the following:
select
e0,e1
from (…)
where e1 != ‘app.terminated’
Finally, we modify this to include consideration for statistics, so we can order our results. Observant readers will note we’ve removed the e2 parameter from the result set, as it will always be app.launched
. This simplifies the query and allows us to use the GROUP BY feature, resulting in the following:
select
e0, e1, count(*)
from (…)
where e1 != ‘app.terminated’
group by e0, e1
order by count desc
Now, we have a result set showing each eviction trigram and its corresponding frequency. The resulting table is immediately actionable, with the top offenders presented nicely along with their individual significance. The results might look something like this:
e0, e1, count
app.backgrounded app.background_update_started 987
app.backgrounded ui.refresh 658
app.foregrounded app.backgrounded 276
In this case, the chief offender is a stalled background update; a process starts but never finishes. Next appears to be the dreaded “UI update on background queue” causing the app to use too many resources on a main queue while in a background state.
Accelerated Diagnostics
Using these techniques, you can begin to explore some of the undiscovered sharp edges of your user experience. This targeted approach allows engineers to reduce the total time spent solving hard problems. This shines a spotlight on a traditionally dark and devious area of multithreaded systems — deadlocks. When your app deadlocks in the wild, most likely the operating system will simply evict it in the background. The user may not notice, especially if your app refreshes its state on becoming active. This means thankfully no users report problems, but it also means tragically no users report problems, so the problems go unreported. By applying some creative multidisciplinary thinking, using simple tools, we can achieve a powerful result, reducing time to resolve challenging bugs.