PyDataMCR + Open Data Manchester — Data Horror Stories — 29/10/2019
With a tagline of “Be Afraid. Be Very Afraid”, this horror filled evening was terrifying enough to strike fear into the hearts of the assembled audience. Everyone dealing with data and coding has encountered at least one of these horrors in their careers — but fortunately the story tellers had more imagination than telling a story about a join gone awry.
Story of horrors with data — Reka(https://twitter.com/r_solymosi)
With Halloween coming to Manchester there have been a number of monsters appearing around iconic locations in the town centre. As you might imagine the inquisitive part of many peoples personality immediately jumps to … which one is the crowd favourite.
Reka did a fantastic job of using the Instagram API to scrape the #mcrmonsters and #halloweenmcr (with some manual tagging on top) and ranking the results by number of photos, number of likes and mean likes per photo.
Happily for Reka, her favourite — the “Blob” — won. See more details in the blog post(https://rekadata.site/blog/halloween-mcr/)
The tasting adventures of a tea fanatic — Adam(https://twitter.com/Adamshackleton)
As a self professed tea connoisseur, Adam was thrilled to be working with a large tea company — a dream come true, some might say.
The mission: Build a website to capture feedback from sensory trials including features such as “woodiness, smokiness and grassiness” — none of which an uninitiated tea drinker such as myself could hope to identify.
So of course, the question springs to mind. Now that we have all this data, what do we do with it? ANOVA? PCA? Compare against controls?
We “interpret the data by eyeballing it”. Which means — we look at the data and find the mode.
Notably — there were outliers to exclude as well. Apparently the most notable of which was Janet from accounting who in the anonymised survey was most easily identified by her lack of ability to taste grassiness. Needless to say, I’m with you on that one Janet.
As Julian approached the podium I did muse to myself — At what point does a laptop become more of a vehicle for stickers than serve its intended function?
Spreadsheets against Humanity — Julian(https://twitter.com/Julianlstar) CEO of @opendataMCR
Drawing inspiration on of the most influential economics papers of recent time. Julian focused on a classic story of David vs Goliath.
A student economist, Thomas Herndon, was tasked with picking a recent economic paper — to see if he could replicate the results. So why not choose one published by a collaboration of the best economists — Carmen Reinhart and Kenneth Rogoff.
The paper “Growth in a time of debt” focused on the key features of achieving growth in a time of recession — a hot topic not so many years ago. One of the notable findings was that countries with 90% public debt to GDP ratio tend to go into recession — something which was used as a justification for austerity.
The issue — the results were not replicable.
It turns out, several countries — Australia, Canada, New Zealand and Denmark had been excluded from the calculations. Once they were included, the original -0.1% estimated growth actually grew to +2.2% — a drastically different summary to that originally found.
Moral of the story: Remember to check your spreadsheets thoroughly
Funnybones — Ellen(https://twitter.com/Julianlstar)
In the midst of writing her PhD thesis, Ellen braved the autumnal weather to share a candid look into what it means to be an academic working with sensitive customer data. All to the theme of a popular children’s book — Funnybones.
Words fail to do the narration justice, but suffice to say being given access to sensitive customer data means you have to go to extreme lengths. Including but not limited to: no phones, no notepads, multiple keypads, an otherwise empty computer lab, USB drives and a perennial fear that you would be locked in the lab over the weekend and left for dead.
On a more lighthearted note, Ellen did some digging and found there to be 193 Dark Lanes in the UK. Although sadly no Dark, Dark Lanes.
If you want to learn more the latest episode of the PyDataMCR podcast has a fantastic interview with Ellen (https://anchor.fm/pydatamcr/episodes/Episode-9---Smart-Meters--more-like-Dumb-Meters-Ft--Ellen-Talbot-e7s6it).
It came from Excel hell — Jon(https://twitter.com/namelessjon)
As a fan of interesting statistics, Jon claims that approximately 20% of gene analysis errors can be mapped to errors in Excel processing.
Jon was also keen to highlight some personal bugbears I believe all of us can get onboard with.
The offset table — why start a dataset at cell A1 when you can start it at cell B3. Oh the horror.
Evidently willing to one-up the offset table, why not have multiple offset tables of course with varying levels of offset.
Merged cells. Enough said.
Multi-level headers
Colour the cells, and have the colour mean something — this quickly devolved to Jon having to scrape the metadata out of each cell to identify the cells to exclude.
Jons section ended with a simple request: Please excel responsibly.
Lightning Talks — Hera(https://twitter.com/herahussain)
Government offices don’t have meaningful datasets because they are all Excel.
Whilst growing up in Pakistan, she was looking for a way to earn money — something which her friend was also very keen to do. They had the opportunity — 200 rupees per 10,000 details of US / UK citizens , and the tools for the job — Excel. All they needed was the imagination to make up 10,000 entries and the reward was theirs!
Lightning Talks — Pete(https://twitter.com/thedatabloke)
Working for a recruitment firm years ago in the early 00’s. Pete ran into an interesting issue.
How to identify if someone is already in the database? Three ways — same landline, mobile or email.
This system seemed perfect. Unfortunately what the system didn’t anticipate was people sharing a mobile with someone and a very awkward conversation for a recruiter trying to explain why they believed this person was a member of that household.
18 months of low hanging fruit — Joe(https://twitter.com/JosephAllen1234) / John(https://twitter.com/jaspajjr) https://twitter.com/PyDataMCR
Finally, with a couple of minutes to spare, Joe & John dragged out a talk from the last Data Horror Stories event. The story and slides had not changed. Joe took on the role of a business, getting excited at the potential the “sexiest job of 2019” had to offer. John was more realistic, he was happy to have an opportunity to learn and expand the skills. The slides shared the horror theme of the night and are available here(https://twitter.com/PyDataMCR)
Thank you very much to everybody who made this event possible. This event was one of the best data events I’ve been to and was a lot of fun to organise. Thank you once again to all the speakers as without you we would’ve been silently sat in a room for two hours.
