Decentralized Data Science and the Ghostbuster of Starbucks
A philosophy and stories of shifting perspective in the Era of Data
My notes and the post-talk interview with him on “Why BlockChain is the Future of Data Science” deserved an article.
“Data science” is an overused and ambiguous term.
We’re collecting more information about more of everything than ever before.
To me, data science is the cover we use to ease the pressure and justify our lack of an answer to “What the hell do we do with all this data?” and “What does this all mean?”
At the little conference that could, BlockFiesta, Luciano impressed everyone with wit, theory, and a random Starbucks story to predict the future of data and the journey to understand it.
“You’re all going to be immortal now,” he joked.
“We’ve used SQL databases, expect that from our centralized systems’ training. That’s all going to change.”
Luciano then asked, “What happens when data isn’t centralized?”
The question hung heavy in the nerdy air. We all needed a story to imagine what he meant. So he told us about his encounter with a Ghostbuster at a random Starbucks.
The Ghostbuster of Starbucks
One afternoon, Luciano sat drinking coffee and being all the things he is, in a Starbucks. A Ghostbuster entered the shop. He wore a proton-pack, carried a laser wand connected to the pack, and began scanning the space. He looked exactly like Igon.
When Luciano asked, the Ghostbuster explained — he was collecting internal maps of every Starbucks in the world. They would all be mapped for accurate upgrades and adjustments. Want to plan a network-wide upgrade of every espresso machine around the world? There’s an accurate count and location of every eligible machine to plan and execute replacement.
The Ghostbuster was “quantifying the world in real-time.” He was building the initial, universal blueprint for a Starbucks, updatable as the individual layouts changed. It wasn’t that the process was completely “decentral” (they fed into a single database) — it showed the movement to a new way of modeling the world, of creating and using data.
“Only the govt can afford to collect data“ was the old quip. Government satellites, surveys, and tracking have given us immense stores and feeds of “free” information. The private sector is now exploding though, following a decentralized approach to surpass the collection volume and quality.
Main points I took from his analysis:
- Centralized systems encourage placating single points of failure and authority.
- 50–80% of data science is spent prepping data.
- Better data is now driving new economic models with unseen co-variants.
- Decentralized must be proven — Ex: “Can a group of people working independently outperform Powell as Fed Chair?”
Trustworthy over Trustless system is the message to carry
- can’t be held hostage by single individuals or institutions
- Huge belief in regulation and that “centralization is the path to the better world”. How does that compete with a new global, decentralized system with distributed, self-governance with clear rules and without the costs.
- $1Trillion enforcement costs for property crime currently in the US — what a waste
His interesting philosophy on Citizenship:
- “I believe you should be able to sell your citizenship. The ultimate sign of ownership is the ability to control something.”
- Residency based on token ownership
“Decentralization allows participation without enforcement costs”
- Integration and sharing economy has decreased global poverty tremendously.
- Stable citizenships and runs of citizenships — you can't jump out of citizenship right now as a system
On “Little Economies”:
- Comprehensive, unified feedback loops are missing from product/commercial ecosystems
- Can track a person through the entire lifecycle with a single identifier
Decentralized Data Teams
- no single person has all the skills, even with a huge toolkit
- “Two pizza rule”- must be able to feed everyone w/ 2 pizzas for size
- Analytics via Bayesian Crawlers. This is the primary tool of Decentralized Data Science— the Bayesian Crawler.
A short narrative history of data:
- Variance approach built on probability theory and parlor games/gambling. Then it moved to accounting. Sir Ronald Fisher in 1918 — publishes thesis, seen as founder of statistics and new genetics. Doing research in his backyard, it now has 100 yrs of momentum behind it.
- Frequentists vs Bayesians. Freqs (Proposal: pronounce abbreviation as “Freaks”?) have confirmation bias- throw out everything else because of 1 new piece of knowledge. A Bayesian takes it all into context, uses it to grow.
- Before Machine Learning, no one talked to Bayesians, they were heretics. Now they are the rock stars. Efficiency is there, better approach.
- When the ML Era hit, Bayesian systems began to outperform all traditional statistics approaches and systems/methods.
His abstract theory:
- Better to get to an answer fast than a precise answer slower. Only way to do this is to cut out the centralization layer.
- System must be real-time, solve analytical model iteration issue. A series of models arriving at a final model.
- Uses understanding of the “searcher” or researcher
- Decentralized is crawling a small set to start, Bayesian crawlers all over throughout blockchain, then scaling up and incorporating more as needed, as models fail.
- Cross-sectional vs time-series. Nature of time is so complex, must view all dimensions, all times of co-variants and their effects.
Supply chain modeling — only collects data at time of delivery of a commodity. We need a model that picks up and integrates continuously, uses simulations with sampling to predict.
Building a Collective Memory
To Luciano, Blockchain is the “Collective Memory.” It gets better with more things quantified and into the system/chain. With current computing, adding big data sets slow the chain down — but this is a limitation constantly being solved.
I share the optimism and feel a sense of relief and validation everytime a chain is used for something novel and functional to society — i.e. more than settling a few transactions and uses insane amounts of energy to solve algorithms. Current blockchain tech is still undersold (and unproven), despite all the hype. The biggest challenge is getting people to give up data to prove this decentralized data science can work.
Principle: “Be 100% yourself”
Luciano referenced channeling your inner Kanye, i.e. don’t give a f***. It made me think how much authenticity and honesty is the long game of real trust. People will be thicker skinned to start and less insecure — which is why we need all this encryption and consensus in the first place.
He gave an interesting example from a community bonded by a health condition:
The Cystic Fibrosis Foundation found clusters of patients not attending sessions and appointments from their network. Non-attendance was literally killing people. The organization learned the immunocompromised nature of CF because of this connection to mortality. Everyone needed to communicate this for it to be seen. People with CF had to self-report a check-in. They had to share — and entrust — their data to save lives.
Other notes of interest I never got to fully explore:
- “Agent-based modeling systems”
- Baysien Crawlers need a visualization. We need to use public, open-source crawlers.
- Blockchain’s primary function is to immediately recall and share — find and share, have visibility built into them.
- “A society of data science” — decentralized data science incentivizes quantification of things we’ve never had before.
- “If you don’t believe the world can and should be better, than data science is not for you.” It always looks to the future, requires near-perfect information.