Framing Data.

Painting a clear picture with statistical analyses.

AJ Pain
Vizzuality Blog

--

“How much deforestation is normal, and what should we consider acceptable?”

It’s kind of a bad question, right? Or at least a vague one. Alone, it feels a bit like one of those “how long is a piece of string?” questions.

When we ask questions of this nature our prior knowledge has a huge impact on our interpretation of the answer we are given and the reaction it evokes in us. For example, what would your reaction be if I said that right now, in Brazil, ‘normal’ is a little over 10 thousand individual deforestation incidents a week?

It certainly feels like a lot, but without deeper context there really is no way of knowing since we have no way of framing this answer.

This is exactly the kind of situation we strive to avoid when we present or visualise data — we want be sure that the answer we give can be interpreted without confusion or guesswork.

So how can we improve this?

One place to start is with the question itself. I used to teach for a living and one of the things you notice early on is that the kids who learn the fastest are often the ones who ask intelligent questions, rich questions. Those students are able to leverage more understanding when the answer is specific and grounded to a familiar context. They understood the answer in relation to existing knowledge.

Let’s try to take a leaf from their book and do that.

Context is key.

How then do we give context to deforestation in this case? Well, deforestation has a lot of external factors acting on it — some human (the kind we want to stop) and some natural.

Setting the human aspect aside for a moment, the largest contributing factors to natural deforestation are forest fires and seasonal canopy change.

Natural fires tend to occur within hot, dry climates or in the hot summer months. And there is also the annual cycle of growth and recession of tree canopy cover as the intensity of light falling on the forest changes due to the Earth’s inclined axis.

Breathing Earth: throughout the year forests naturally cycle through periods to growth and loss as the intensity of sunlight falling on them changes with rotation of the Earth’s axial tilt relative to the Sun. Something that heavily influences the amount of recorded deforestation — though totally natural!

When we ask how much deforestation is normal, we obviously want to factor these aspects into our consideration — we need to think about temporal context as well constraining the location we are interested in.

So, instead of “How much deforestation is normal?” we should ask “How much human deforestation is normal in Brazil… during summer… compared to previous years?

Much better. And now that we have a better question, we can answer it!

Answering the question.

When we set out to create the revamped country pages on Global Forest Watch and began drafting out the visualisations we tried to frame each one of them as a question a user might ask… just like the one above.

The data, or visualization, would then be the answer to that question and would use location and time to provide contextualisation. We wanted to make sure that each answer didn’t skimp out on the all important framing that we have been discussing so far in this article. The user should be informed and aware of the whole picture. Thus ensuring that users are able to make an informed opinion on the state of forests around the world.

For me the best example of where we succeeded in achieving this vision is the GLAD alert widget, which answers the question about the amount of deforestation happening in an area of interest.

However, before we get into the details, let me give you little background information on the data.

The GLAD deforestation alert layer is one of the most used / looked at features of Global Forest Watch map. It is an awesome example of how we can use raw satellite data (LANDSAT in this case) to measure instances of tree cover loss (i.e. deforestation events) down to a 30m resolution.

Users are able to analyse a selected area to determine the amount of deforestation within that shape.

Each day we get an update on that number and show them on the map as a historical event in pink, or as recent event in yellow if it was within the last week.

As we defined earlier, this is a poorly framed answer and we will struggle to make sense of it — is this good or bad? What are we comparing this to? No matter how high the quality of data might be, users still have to interpret it, and still have to make sense of it.

Building the visualisation.

Using that analysis tool above we started building the mathematical machinery that would handle the data behind the scenes. In terms of context we opted to provide depth along two axes: temporal, and spatial.

The latter was easy since it is handled by the very nature of the Country Pages since the data can simply be filtered by location. A user can select a very specific area of interest; from national level down to small regional areas (called admin-2 regions, counties in the UK/US for example). This means that they can draw conclusions in that sense, even comparing to other countries or regions with a single click if they want.

However, the temporal aspect was a little less straightforward, and would require us to be able to statistically analyse the data dynamically (i.e. run the calculation on the fly) whenever a user selects an area of interest. In terms of available data, it turned out that we had four years of previous data in this case (and counting!) at our disposal, so it made sense to make comparisons at the same point in time, the week in this case, year on year. We also wanted to highlight the idea of ‘normal’ by painting a baseline for users to compare the levels of deforestation events to.

In terms of statistics this is not too hard to achieve. Essentially, the recipe is:

  1. Bucket the data into weeks.
    i.e. count the total number of alerts that happen in each week of the year.
    You now have 52 data points for each year of historical data. We could have bucketed by day but at that fine-grain level randomness makes it difficult to see meaningful trends above the noise.
  2. Do some stats.
    For the same week in each year, calculate the average and standard deviation.
  3. Rinse and repeat for all 52 weeks.
    You now have 52 data points, each with a mean and standard deviation attributed.
  4. Smooth over these values.
    You do this using a window function to get a sort of moving average. The aim here is to reduce the noise to a point where seasonality dominates.
    We chose to take average standard deviation over the year for ease of use — but in future we may choose to apply smoothing too in order to highlight seasonal volatility (i.e. there may be more randomness in summer than in the winter months)

At this stage you have three sets of data. For each week in the last year you have:

  • the mean number of alerts over the last four years,
  • the standard deviation of alerts over the last four years,
  • the actual number for that week.

Plotting the mean as a smooth line against the raw alerts data makes it easy to identify if the number of alerts are high or low compared to normal. And, to frame just how far from normal a particular week’s data is with minimal effort — the standard deviation can be plotted to quantify divergence from the norm.

It’s starting to come together, and we have all the tools we need. The final thing to do is to actually plot that data.

Seeing is believing.

And this is what we get. Beautiful, isn’t it? The visualisation is clean and communicates deforestation data without confusion.

Two-fold beauty. Richer information about deforestation than ever before, at a glance!

A location-specific answer, normalised temporally by comparing to historical data; clearly illustrating seasonal effects. We even threw in a dynamically generated sentence to boost readability if graphs aren’t your thing!

“In the last week of August 2017 there were an unusually high number of deforestation incidents compared to recent years.”

So what are we looking at? (My favourite thing about this visualisation is that you probably don’t need an explanation!)

Here the pink jagged lines represents the mean number of deforestation incidents detected by satellite imagery in a given week of the last 12 months, which means the higher the line vertically, the more deforestation events that are recorded. Where the magic lies, however, is in the light- and dark-grey bands meandering through the graph which uses the calculated standard deviation values.

That is your context, and the part of the graph you should pay attention to.

To put it technically, they represent statistical standard deviation bounds on how far from normal the pink line is. If you find the pink line within inner bounds, then that’s within ‘normal’ bounds (±1 sigma) for the amount of deforestation alerts in that week, for that location. If it’s in the dark-grey outer bounds then that can be considered to be unusual; either higher (above) or lower (below) than normal. And beyond those bounds you are into the extremes; either way more deforestation (bad!) or way less deforestation (good!) than normal for that time of year.

The really cool thing here is that you can easily see the seasonality at play. Whereas before a user might have come to the conclusion that April was a good month for Brazil, now they would see that this is actually about par for the course.

It displays the seasonality we have been speaking about — so there’s your temporal context.

What we wanted to do was normalise for these factors. We wanted users to recognise cases like the one above (Brazil, April) and say things like “Yes, the deforestation events are really low right now… but it usually is at this time of year!”

We can now separate human and man-made causes of deforestation — and thats powerful!

Indonesia displaying drastically different seasonal behaviour.

Compare this to Indonesia, a country that lies mostly along the equator compared to the latitude-straddling Brazil, and you can clearly see the stark difference in seasonality. Remember, we said we wanted to calculate this dynamically? So when you switch countries we are quickly re-generating the graph from that country’s raw data and normalising against the local seasonality.

The Future.

We now have a powerful tool that can dynamically give us a statistical analysis of deforestation, allowing users to truly understand the data given to them, but where do we go next?

In the future we want to expand this service to other alerts we have access to. Our next step is to do the same with our fires widget so that we can gain a better understanding of what a ‘normal’ level of fires is for a given area and time. This is particularly important moving forward as it will help us better distinguish between natural fires and slash and burn deforestation used to clear forests for logging roads and farmland.

Once we are at that stage we can start to correlate and normalise for naturally occuring fire deforestation to more accurately measure human impact on our forests.

Even better; each year we gather more and more data, sharpening our tools against deforestation and making the complex understandable with the power of context. Go wield them.

AJ is a Data Scientist and back-end developer at Vizzuality. He loves curating data that’s important for the world and seeks out newer, better ways to visualise it.

--

--