Premises for Data Science Magical Realism
By Michael Heilman
Magical realism is a genre of storytelling that often involves impossible, absurd, magical things happening in ordinary situations. Some places you might have seen this concept:
- Gabriel García Márquez’s Cien años de soledad (100 Years of Solitude), which follows several generations of the Buendía family and the ill-fated city of Macando.
- Haruki Murakami’s A Wild Sheep Chase, the story of an ad executive who has to track down a magical sheep in rural Japan at the behest of a mysterious political boss.
- Countless other works including but not limited to the short stories of Jorge Luis Borges, Toni Morrison’s Beloved, Salman Rushdie’s Midnight’s Children, Tim O’Brien’s Going After Cacciato, W.P. Kinsella’s Shoeless Joe (adapted for Field of Dreams), Beasts of the Southern Wild, Birdman, and Kentucky Route Zero.
Data science can sometimes feel like an adventure, particularly when one is trying to track down a software bug or understand the behavior of a machine learning algorithm. After being asked to emcee our bi-weekly staff meeting, which involves picking a theme (that may or may not relate to work), I chose to apply some magical realism to data science situations.
What follows are some premises for data science magical realism stories based (very, very loosely) on experiences I’ve had or heard about — premises, that is, for stories about impossible, absurd, magical things happening to data scientists in ordinary data science situations. Enjoy!
- The weekend before his presentation to Mega Burger Inc., Claude finds duplicate records in his analytics database for people living in Borges, IL. He double checks his ETL scripts but can’t explain it. He catches the last train out of Union station to Aurora. The next morning, he finds duplicate people there.
- Ada trains a model to predict whether people buy ABC brand paper clips, and it gets 98.3% ROC AUC in her evaluations. She looks at the coefficients and finds it’s using a combination of the number of letters in people’s names, their distance to the nearest water tower, and how frequently they eat Oreos. She tries these features for models of other variables and finds that they predict just about everything.
- Charles is quality-checking data in the Insights Consumer File and accidentally comes across a record that appears to be about him. To his surprise, it shows him as unemployed and with 2 kids. He laughs it off, but the next morning his office is closed and his key card doesn’t work. He gets a text about picking up his son from daycare.
- Luis is trying to optimize in-app purchases and engagement for his company’s mobile game Troll Legion IV. He accidentally deletes 6 months of logs. He frantically calls the cloud provider to recover the data, but their process might take weeks, so Luis drives out to the data center in western Iowa. Once there, he sneaks in to get physical access to the servers. A security guard stops him, but not before he sees a group of robed figures carrying jars of a shining liquid into a room marked “Data Lake.”
- Sara has just been asked to report some aggregate statistics about a new data set. It sounds simple at first, but she gets an error message about character encodings when she opens the file. She looks for a data dictionary in the README file, but it seems to be instructions for preparing a salad. She asks her manager for more details, only to find out that the documentation was automatically generated by AI and that accessing the data requires a 2-factor authentication device that was lost in transit during shipment from Europe.
- Marcus needs to integrate his code with a legacy system maintained by another team. He meets with the other team, but they didn’t actually build the system. In fact, they’re not even sure what language it was written in (their best guess is Fortran 78), and redeploying it on the servers requires a 33-step process involving woodwind music, chanting, and sacrifice of obsolete MacBook Pros.