Hackathon: DACHACK no 2 round 1

tldr: sometimes you can fail well enough to win.

At the end of Jan my team, consisting of the Pash , Perry and I won the first round of the second DAC ( Data Analytics Centre) hackathon with fire and rescue NSW being the client this time.

As in most data-centric hackathons the goal was to take a given data set and produce something meaningful out of it (with the order of importance insights < model < analytics product). I’ve attempted to document the process and my takeaways from it.

The format for this type of hackathon is a 3 round 24 hour sprints (something something agile) in 4 teams of 3. Each round nominates winning ideas which are developed further in successive rounds (which take place fortnightly) with team members from unsuccessful teams being absorbed into the “winning” teams to form larger groups.

The goal this round was to produce predict a fire given a few constraints such as time of alarm, whether it was a repeat , the type of alarm, location etc for the UNSW campus. Due to the fact that 97% of their calls are false alarms they would like to scale their response according to the likelihood of an alarm being real (legislation states that they still have to respond to all fire alarms). To that end a dataset with the info above was provided for 2013–2016.

I’ve broken down the major events of 24 hour period below:

  1. Gathering information:

We started off by trying to explore the data set a little and play around with it. The provided data set contained about 700 entries with about 20 real fires. The conditions reported in the data set for real and false alarms were incredibly similar. Some quick tableau-ing was done to explore the base dataset but the patterns that we saw emerge seemed to weak to use for prediction. The organizers/judges were aware of the limitations of the datasets (they were not able to get the full data set out on time) but wanted us to proceed anyway to see what we could produce.

Clearly more data was needed (at this point I’m slightly panicked). We started by getting weather data for the dataset (thanks to Pash) while I looked for the UNSW key date info (as a predictor of occupancy).

Finally we attempted to mush the data into a csv for use in modelling.

2. Attempt at using said information:

Even with the info we managed to put together we only found weak correlations. One of the strongest correlations we saw was between cooking fume false alarms and time of day (which makes intuitive sense). Unfortunately this was too weak to actually make any predictions (tested via ANOVA).

Since no patterns were emerging I panicked completely. I decided to use a neural network in Theano (to overfit with the values we had) and have a go at making a prediction. This failed miserably due to various technical issues when using this new data set as opposed to one of the default datasets (more info on this in an upcoming post). Unfortunately this was un-fixable by presentation time.

Perry was remarkably successful in proving the data set was a poor predictor with kernel smoothing and ANOVA. Importantly he was able to stick this in a UI with ioslides (which is pretty amazing) which was used for the final presentation.

3. Presenting the model and information:

The entire presentation was done from R which enabled embedding of the UI (still cant get over how amazing this turned out with ioslides).

Importantly the information was contextualized . Graphing (thanks Pash for tableauing)the various types of alarms was useful as it helped to hammer home our points. Essentially we talked about the data some possible patterns and how with the limited set it would be illogical to make predictions.

The lack of mine-ability of the data was flagged as an issue by all of the teams and the organizers promised more data for the next round.

Positives:

  • Teamwork: By now we understand our roles in the team and know how to hit the ground running.
  • Exploration: there was very much an idea of “giving things a crack” partly due to the sparseness of the data.
  • Communication of ideas: both within our team and pitching has become easier and the act pitching seems to have a bit of confidence/flow to it.
  • Redundancies: after my code failed spectacularly (which I’m a bit salty about) Pash and Perry were working on their contributions and provide working code and damn good presentation.

Negatives:

  • Sparseness of dataset meant not much interesting data science could be done (probabilistic modelling is prone to failure in this kind of data set due to the low number of real positives).
  • Technical issues on my end

Takeaways:

Don’t use a low level tool unless you know it really well for a hackathon (which should have been obvious). Prior to the hackathon I had experimented with theano and worked through the examples. In my naivety my expectation was I could just re-tool the bits of code I needed and plug in the data set we needed to use.

In a similar vein prototyping tools are extremely helpful for hackathon type scenarios. These don’t necessary scale with the size of the problem but having something interactive to show during a presentation as opposed to just numbers and graphing is extremely powerful.

In Hemmingway’s The Sun Also Rises Belmonte a bullfighter is described as doing his best when he worked unsafely in the “terrain of the bull”. As he grows older (and returns post retirement) he abandons this approach and plays it safe in the “terrain of the bullfighter”. In this case engaging with the “terrain of the data” was to attack it head on and show why it doesn't work which is a fairly unsafe approach (any team could have made a prediction and blown us out of the water). The book also makes a point of him trying to live up to his reputation from his youth which might be an issue in the next round.

In short this was a great hackathon and while I’m annoyed at a personal lack of success I’m quite happy that our team did well and how much we’ve grown a little since our first hackathon only 6 months ago.