Musings on #FloodHack16
Flood prediction using neural networks and machine learning
Irecently attended an event at the ODI Node in Leeds, looking at the issue of flooding and how we might be able to improve aspects of how we predict, plan for, respond to and recover from flooding. Collectively there were several useful outputs from this, ranging from better communication tools for flood wardens and community groups on the ground to high-level overview systems to allow real-time response to unfolding situations.
I approached the FloodHack with no particular goal in mind or area of interest; I love all things data and hacky, and this was no different. Initially I ended up with a group looking at if it was possible to show potential future flood risk maps accounting for the Government’s recently published climate change allowances for flood risk assessments. This turned out to be a horrific rabbit hole, trying to understand exactly how flood risk maps are generated in the first place, which let to some conclusions which rapidly span the idea of accounting for climate change out of the scope of a weekend hack:
- The flood risk maps are generated using a complex computer model by the Environment Agency, then fine-tuned by hand.
- None of the data on individual property flood risk is publicly available since it’s considered sensitive (it may dramatically affect property values).
- The flood risk maps available to the general public reflect 4 of the Environment Agency’s 7 internal flood risk levels.
- The flood risk maps basically reflect one variable: “how likely is this particular 50m x 50m square to flood in the next n years”.
These revelations caused me to spin out my first “this should happen” of the weekend:
The Environment Agency should make publicly available the models they use to derive the flood risk maps.
Given it was now apparent that the original idea would require far more resource than we had available, as well as involving a range of data which we didn’t have or which didn’t exist, I began to wonder about exactly how flood warnings were generated, and which data that relied upon.
After a chat to people who knew about this sort of thing, some more interesting facts became apparent:
- The Environment Agency have a series of flood predictors which are distinct from the risk model, which are run at regular intervals to answer the “how likely is something to flood in the next n hours/days” based on the current weather data.
- Several predictions are run, and the ‘most likely’ based on the experiences of the meteorologists and flood warning teams is chosen. The flood warnings are then issued accordingly.
Of these two, the second is by far the most interesting to me. Flood warnings are effectively issued by somebody in a room going “I think that this area will/will not flood”, based on a combination of hard maths, experience and gut instinct. Since I have a background in computer science, and more specifically machine learning, this made me interested in what went on behind the scenes.
“So, how are the computer models trained to figure out what will flood?” I asked. In true BuzzFeed style, What I Learned Will Shock You.
Well, at least not in the true sense of machine learning. After a flood, the people who look after the flood prediction models will go back and make some updates until re-running the simulation results in what actually happened, in theory making the model more accurate in future.
Being the kind of person who enjoys challenging the status quo, my immediate response was “that’s nonsense, you should be using a neural network to produce a real-time flood risk predictor, then feeding the actual outcomes back into the model to train it”.
I set about seeing if it was possible to create a proof of concept, seeing if it was even remotely possible to take a set of inputs which seem flood related (rain, temperature, river levels, groundwater), train them using data on what actually happened (ie “did this bit of ground actually flood”) and then generate a reasonable prediction. It turns out that as expected, the plan didn’t survive first contact with implementation for the following reasons:
- The available data on rainfall and temperature is very sparse; weather stations are surprisingly far apart.
- The available data on rainfall and temperature is very vague; you can’t get higher resolution than monthly historical averages.
- The available data on river levels is surprisingly good, if you’re interested in the last year or so. Beyond that, no dice.
- River level meters are also few and far between.
- Groundwater data is even more sparse.
- Historical information on what actually flooded is for most intents and purposes completely and utterly useless. The data is only available for ‘big floods’, and even then there’s no discrimination between “it was up to my ankles” and “it was up to my nipples”.
These problems led me to my second “this should happen” of the weekend:
We need more data.
This one genuinely surprised me, and seemed to surprise lots of other people at FloodHack who were excited about the amount of data which was available. I think it’s probably worth for hacks in future being able to better understand the differences between:
- Data quality (the data is of high quality and is reliable)
- Data usefulness (the data is well described and understood)
- Size of data (how far back it goes)
- Spatial/temporal resolution of data (how many data points per unit distance/time there are)
For the available data, it was high quality and useful, but lacked sufficient depth and resolution for training purposes.
Fortunately, of the groups at FloodHack two stood out as being able to help solve this problem in the future. First up, Flood Network are busy telling people to put river level sensors everywhere (or more technically water level sensors, since they’re also good for pretty much all watercourses). Secondly, Things Network are busy building a network which allows super-low-power transmissions from aptly-named “things”, which would include the aforementioned water level sensors.
This, combined with data from things such as the Weather Underground personal stations network, would give us a pretty good picture of what’s going on in terms of inputs to the system.
In terms of output sensors (the “is this thing actually wet right now” part of the system), it gets a bit more complex since surprisingly this isn’t a piece of data which is currently captured in any useful way. It’s assumed (pretty fairly to my mind) that if a river breaches its banks, surrounding low-lying areas will be flooded, so it should be possible to derive this, but as we learnt during the wrap-up this is often not the case at all. During the Boxing Day floods, bits of Leeds which were in flood risk areas stayed safe and dry whilst areas which were in areas of no risk at all were underwater.
Flood extent maps are usually put together after the fact by people looking at satellite images and aerial photography, rather than having any kind of automatic feedback on what is actually happening. I imagine this is down to the fact that most people don’t need a sensor network to tell if something has flooded, they just look and go “yeah, that’s underwater”. Still, thousands of people own weather stations to tell them if it’s raining when a window works just as well.
What’s needed to train the network is a number of sensors (or at least sensor proxies such as regularly updated, decent resolution satellite data) which I’ll codename “floody-meters”, which figure out if any particular bit of land is actually in a state of being flooded, and ideally to what extent. These meters can be small, reasonably cheap, hard-wearing, last years on a battery and phone home via the Things Network. Most crucially, they need to be placed where things actually get wet.
I’m not talking about things like rivers getting high; we can tell that by the Flood Network sensors. I’m talking things like “if there’s a shedload of rain everybody knows this junction gets surface water”, or “when there’s been a wet weekend the village green gets boggy”. Ground floors in riverside properties, basements, underground car parks and so-on would all be great examples of this.
Once all these things are in place, the magic can begin.
Neural networks have been around in various forms for years, but as available computing power has exploded they have recently begun to show exactly what they can do. Neural networks have beaten world champions at games which “computers could never play” by changing the perception of how a computer is “taught” to play from one of a human programming how to respond to a given situation to one of starting with absolutely no idea, and replacing it with one where the computer ‘watches’ literally millions of games and teaches itself. Self-driving cars do the same; nobody working on a self-driving car has ever sat down and programmed “this is a bicycle, here’s how to not hit it”, instead the car has learnt “here is a thing, here’s how it usually behaves, here’s how I should behave around it”.
The most crucial part of this whole process is that it’s a feedback loop. Every single time a human chess (or most recently go) player behaves differently, or a bicycle takes a junction a different way, the network adds that to its understanding of what happens under a set of circumstances.
So, here’s how it works for flooding.
- Start with a set of inputs from the sensors we have which relate to water and how it behaves. Rainfall, temperature, groundwater, river height, river flow, humidity.
- Add these as inputs to the network.
- Add additional inputs for historic values of the various sensors, going back a reasonable amount of time. We know flooding is based not on what is happening right now, but instead on what happened minutes, hours or even days ago somewhere else.
- Add inputs based on computational predictions using things like topology. We know that they won’t be 100% accurate, but things like “this bit of land is lower than the nearby river level” will probably be a good indicator that flooding will happen.
- Set up outputs for ‘floodyness’ of the various sensors (or proxies) for which we will eventually know the hard values, in varying time intervals depending on how far we want to run predictions.
- Add each of the ‘floodyness’ outputs back into the network, since often the ‘floodyness’ of one location will be a good indicator of the state of another.
- Run the network, and get completely and utterly wrong results.
- Don’t panic, because as soon as the first prediction time interval rolls around, the network gets a chance to compare what it predicted with what actually happened, and begin to apply its learning ability by modifying its internal structure so that if it were run again with the same inputs, it would get a more reliable output.
- Sit back and relax. Every time the network runs, and gets to compare what it predicted with what happened it will get a little bit better.
The network will reasonably quickly learn how the ‘floodyness’ outputs behave for the prevailing weather conditions. As soon as more extreme weather hits, the predictions will be completely and utterly wrong because it hasn’t yet learnt what happens. The next time similar weather comes along, it’ll know what to expect and will be able to make an informed guess. Eventually the network will have learnt how the various factors combine to result in an outcome.
I feel it’s important to note that this won’t replace flood risk analysis or prediction models. The network won’t be able to understand situations like “what if we put a flood barrier up here”, since it’s based entirely on data of what has really happened which can’t be generated short of putting up a flood barrier and waiting for heavy rain. Modelling is still critical in figuring out how changes such as buildings might affect a flood, and that’s why the outputs from models should form inputs into the network. A neural network would be best suited to short-to-mid-term flood prediction.
At the moment, we really lack the necessary data and sensor infrastructure to even start to build a network which could have a stab at predicting flooding, but if the work of people like Flood Network can roll out enough sensors (both of river level and actual flood outcomes) then all we’re missing is training, which takes both time and actual flood events.
Based on the evidence, it looks like we should be expecting major flooding more often, so perhaps if we can hurry up and start collecting the data on it then within a couple of years we’ll be able to produce accurate predictions in time.