Why Data Science?

Christopher Berry
CBC Digital Labs
Published in
7 min readNov 16, 2020
A fine selection of the finest Canadian Geese hanging out on the water. Source: pixabay, Benjcoll
Image Source: Benjcoll, Pixabay

Why data science? Because it’s often to your advantage to do data science.

When Mike Loukides told us all about data science in 2010, he highlighted the importance of business, statistical, and technological capability. What was changing at the time was the amount of data production and the increasing sophistication of the problems that data was used for. Because not every question is interesting, the data scientist needs to understand business. Because not every answer is simple, the data scientist needs to understand statistics. Because not every solution can be done in Excel, the data scientist needs to understand technology. Data science, from its inception, has always been about the transformation of data into product. We are in the Information Age. It’s often to your advantage to do real data science.

In order to make that argument persuasively, you should understand two sets of distinctions. One set of distinctions is about technology — specifically the ability to write instructions to a computer. The other set of distinctions is about your relationship with information.

No Code, Low Code, High Code

When barriers fall, more people can do more things. Pressurized airframes and more fuel efficient engines lowered the barrier to commercial flight so that more people can fly. Simplified interfaces lowered the barrier to distribute pictures on the Internet.

Barriers fall because technologists make them fall.

There are barriers to collecting, reading, interpreting, and acting on information. One of the greatest barriers is the skill of writing instructions to computers. Computers are fantastically complex machines. And yet, they still don’t understand ambiguity very well. They demand precise, specific, unambiguous instructions. That demand for specificity is difficult to respond to and creates a barrier.

A High Code environment usually involves a Command Line Interface (CLI), a text editor (sometimes with assistive features like linters), and sometimes a more assistive environment (Jupyter, Rstudio, BeakerX). Often, the data scientist is polyglot, sometimes hiring python for some jobs, julia for others, and JavaScript for others. They may use different querying languages (Pig Latin, SQL, Cypher) depending on where their data resides.

A Low Code environment involves typing, in that the fingers need to be engaged to type verbs and nouns, inside some kind of larger application. The classical Low Code environment is Excel. Excel has around 500 functions, with around 40 doing the bulk of the heavy lifting. Many Business Intelligence middleware applications, just beneath their No Code front ends, feature Low Code environments. PSPP’s syntax editor is in the Low Code category — with around 250 primary commands and a large set of subcommands.

A No Code environment is where the instructions given to the computer are created by simplified user interface components. They abstract away complexity through use of the hyperlink, the drop down, the radio button, and the checkbox. Often, the only times fingers hit the keyboard is for search.

Descriptive, Predictive, Prescriptive

Different people have different relationships with information. The Descriptive, Predictive, Prescriptive structure to describe those relationships is useful. They’re often tied to very specific jobs people are hiring data to do for them.

The Descriptive realm is all about collecting, aggregating, and organizing information about the past. Examples of Descriptive artifacts include the Excel Dashboard, the PDF report, and the PowerPoint presentation summarizing everything that happened. Examples of Descriptive products include the Business Intelligence Performance Dashboard, and pretty much any report in a digital analytics product. All the usual information presentation methods are there — in particular the time-series, the cumulative count of metrics that can be cumulative, time-constrained counts of metrics that are non-cumulative, the count, the rate, the ratio, the listicle, the table, the bar chart, the scatter-plot, and sometimes, even, horror of horrors, the pie chart and stacked bar chart. Analysis in this realm is restricted to statements about the past and anomaly declarations.

The Predictive realm is all about predicting the future. Examples of Predictive artifacts include the projection and the forecast. Upstream processes create the factor analysis, the error estimates, a few Ordinary Least Squares, the causal random forest, and even, sometimes insights about cause and effect. Downstream processes create the targets, expectations, and emotions of satisfaction or dissatisfaction.

The Prescriptive realm is all about creating better futures. Examples of Prescriptive products include the outcome you are trying to achieve, the automated decisions from the recommendation engine, and new experience development. Upstream processes create knowledge from the Predictive realm. Downstream data exhaust includes information from the Descriptive realm about what happened in the past.

Putting The Three Together

Putting the three by three together and populating it in with a few products and platforms (not exhaustive!), you can see how these niches form. These are products that people use to create the artifacts they want to create.

Medium does not offer enough characters to fully describe the image.

These are the principal users of these intersects. These are the people who use the products to create the artifacts they want to create.

Medium does not offer enough characters to fully describe this table.

These are the principal creators or builders of the product itself at these intersects. These are the people who create the product that enable others, downstream, to create the artifacts they want to create.

Medium does not offer enough characters to describe this table.

It’s often to your advantage to do data science

When I was about 5, I remember asking why Christopher Columbus just didn’t fly to India. The answer came back that they didn’t know how to fly back then. That was all rather confusing. Why didn’t they know how to fly? Because they didn’t know how. Why not? Because they just didn’t know any better. But why? Christopher: enough.

It’s kind of funny to think about now, isn’t it? What didn’t Christopher Columbus know that blocked him from building and flying a plane?

The answer of not knowing any better has stuck with me. How much faster is it possible to know better?

Science is the best way we have to know better, faster.

Data science is science. It’s kind of funny that data is attached at the front. After all, it’s writing, not word writing. It’s music not sound music. It’s statistics not chance statistics. The great thing about science is that the more you know, the more you don’t know. And contained within science itself, the scientific method, there’s a wonderful little algorithm that tries to make sure that you’re always learning. It’s just sort of hard to imagine learning without data. I concede that it’s possible.

There’s always something more to know. And because there’s always something more to know, there’s a reason why things can get better. There’s always a new way to lower barriers to advance our condition.

In spite of all the data we have, all we know, we can still know better.

Deep, valuable, mysteries demand data scientists to use High Code products to solve. Right now, our greatest, most valuable, mysteries are at the predictive and prescriptive realms. To be sure, there are still mysteries in the Descriptive realm. Real barriers remain. However, the most valuable mysteries are in the future, about the future.

And there are dozens of incredibly interesting, valuable, and cool mysteries still to solve. We can know a lot better. And it is generally to our advantage to know better. Data scientists feel multiple pulls. Ahead of them, there are choices to pursue Prescriptive solutions in the Prescriptive realm, to be acted upon automatically by silicon. There are also choices to convert that knowledge into No Code environments, to be manually acted upon by carbon. Either way: barriers fall.

No Code and Low Code experiences are fantastic for discovering local optimizations and the refinement of heuristics. They are great for that because they reduce uncertainty in situations where a periodic action is necessary, and all the sociological use cases we won’t cover here. No Code and Low Code experiences lower barriers to information so that people can learn and they can hone their judgement. They are fantastic because they lower the barriers.

At CBC, there are a few mysteries that are to our advantage to solve.

There’s a mystery about the structure of Canadian’s attention. It has been known for years that in aggregate, Canadian’s attention to media follows an Elephant Curve when you zoom out to the week or month. It’s called an Elephant Curve because it looks like a line of elephants joined trunk to tail. There are a few obvious reasons for the total structure. There are a few that are less obvious. If we understood why it varies the way it does, we could serve Canadians better.

There’s a mystery about the relationship between awareness, interest cycles, engagement, and awareness. Canadians frequently open themselves up to serendipity. The vast majority of Canadians weren’t born yesterday. And if you were born today or yesterday, welcome! They often have latent, or hidden, preferences about what they find salient, and what they do not find salient. Some cycles are predictable. Some cycles are less so. If we understood more about the interaction among those variables, and make them visible in No Code environments, we could serve Canadians better.

Data Scientists at CBC do Data Science because there are mysteries that are to all of our advantage to solve. The shameless plug: is that if you are a Data Scientist, check us out.

Why data science? Because it’s often to your advantage to do data science.

--

--