Real-World Data Is Surprisingly Hard To Come By

But soon it will be possible to go beyond

Decision-First AI
Corsair's Business
Published in
4 min readNov 16, 2018

--

Where to start with this article? I have been staring at the title for days. Do I talk about the underwhelming nature of demo data sets? Do I vent frustration on security and privacy protocols that often seem to do more to prevent honest access than criminal mischief. Do I toss Kaggle under the bus for data sets that only Noobs get excited about? Sorry — some of you will take that last one hard.

Do I talk about the vast analytic divide? Do I try to connect the dots on why universities fail to produce job ready talent? Do I note that priority and scale prevent corporations from making inroads here? Or should I call out corporate propensity to advertise Entry Level jobs requiring 3–5 years of experience? I suppose I opted for all of the above!

Real-World Data Is Not Easy To Access

Simply put. Feedback and incentives suppress, discourage, and/or actively prevent access to real-world data. It is just too difficult for corporations, universities, or even software companies to provide real-world data to marginal users. Marginal users meaning those not currently employed by the company creating the data. The security concerns alone are overwhelming, but the P&L and time lines don’t help either.

At this point, some of you are railing about government data sources. It is true — they have come a long way. Sadly, they have a long way to go. I could now go on to lecture about self-reported, mostly anonymous data. I could call out that open source data sets often lack integrity, veracity, and connection. I could note that census data was only really “big data” several decades ago. The FAANG companies generate billions of records per day— most Fortune 500s come pretty close. Government data sets typically max out in the millions. Many fall short of even that.

Isn’t data just data? Does size really matter?

No and yes. But also, yes and no. Let’s dig in. Data is data when considering formatting and manipulation. And size is only directly meaningful when considering storage and CPU cycles. However, data varies greatly in so many other ways and size indirectly influences some of the most important elements of any data set.

Real-world data is dirty. Demo data almost never is. The most powerful data is thick, most demo data is thin. This is not to say that all real-world data is thick — quite the contrary, but leveraging the most robust data to build insight on the rest is part of the required hands-on training that most companies hope you got elsewhere.

Real-world data is dynamic, most demo data is static. Unless it isn’t… truth is most real-world data is a moving target. Some tables are real-time, others archived, and still more are batched. Try finding that in your demo database, your Kaggle, or .gov csv.

Aspiring analysts, students, and those looking to progress their skills need real-world data. They need data that is rich, dirty, varied, and sometimes Null. They need hands-on experience with data that has real-world context. Real-world connection is also critical. Remember data is a representation of the real-world. Data that stretches, thins, or distorts that connection is … well sadly, the standard.

So where does one go to find real-world data experience. Internships and apprenticeships, especially during high school and undergrad, are a great option. Assuming the company actually allows you to touch it! Many Fortune 500s are more concerned with whether you got an invite to the ice cream social than access to real-world data. But it is hard to beat a great internship… ten times harder to find it.

There is another answer. A much easier one. It is a bit like a data plan, one run by Netflix. Over coming articles, I will lay out what we have been quietly designing over the past few years. For now — know that a platform we’ve code named TradeCraft is coming. It is an opportunity for all aspiring analysts to access real-world data. An opportunity you can pre-register for below.

This January (2019), we will begin enrolling a handful of Beta-testers with the goal of launching our platform in Q2. Everyone who pre-registers will have the opportunity to participate and provide feedback during our testing and launch phases. Early registrants will also be provided discounts on all new content. Sign up today.

--

--

Decision-First AI
Corsair's Business

FKA Corsair's Publishing - Articles that engage, educate, and entertain through analogies, analytics, and … occasionally, pirates!