What is data science, really?

It may be simpler than you think … and it holds power to transform public outcomes

Jacob Rozran
Mar 8, 2021 · 4 min read
Long and winding road, like data science and its iterative nature
Photo by Yaroslav Shuraev from Pexels

If you search online for “data science,” you’ll see snappy graphics showing it’s a blend of statistics, problem solving, computer science and engineering, data visualization, communication, and domain expertise.

You’ll also find dense content about the latest machine learning (ML) or artificial intelligence (AI) tools — and their custom fitted hardware. Heck, even Apple’s newest iPhones have custom hardware and AI built into them.

I’ve spent years mastering the skills above — and even embracing the new hotness in ML and AI. But I worry this spin makes data science less approachable. Worse, I think it confuses people about what data science really is.

In the end, data science is about finding data-supported answers to real problems. While it’s enticing to jump straight to the bleeding edge of the spectrum, data science is a long and iterative path.

Data science is rooted in communication. A data scientist must have constant conversations with stakeholders. This can be the executive asking questions, the database administrator who knows the data best, or the customer being affected by decisions. A data scientist works for all of these people and needs their input to be successful.

Data science is rooted in communication.

All of these discussions feed into issue discovery. For example, it’s easy to ask: “How many users do we have on our platform?” or “What should we expect in sales this holiday season?”

A data scientist must understand the context of these questions. For the first question posed above: What type of users? Web users? Active users? If it is active users, active since when? Paying customers? Anyone who has ever logged in? You get the point.

After clarifying the question, the data scientist zooms out. Why is this person asking that question? What are they going to do with that information? Is it worth spending a ton of effort to provide it? This all informs how to tackle the problem and the lens for exploring the data.

When it’s time to explore the data, it can get messy. There will be nonsensical values, missing values, duplicate values, etc. Data changes over time and is sometimes in weird formats.

Through it all, the goal is to really understand the data. This is a collaborative process that calls for more discussions with stakeholders, which can lead to more questions.

Sometimes, predictive analytics are necessary. This can be cool stuff — like random forests, gradient boosted machines and neural networks — but regression is often the simpler choice. It’s usually best to start with small tests. Data scientists will use these to see if prediction moves the needle in a meaningful way.

It’s worth noting that prediction comes in a lot of flavors. In marketing, you’re predicting who will buy your product; in finance, you’re predicting who will (or won’t) pay you back; and in computer vision, you’re predicting what an object is.

Turning raw data into relevant insight is the hallmark of data science done right. But, it’s only half the battle. Data scientists still have to communicate their findings to the outside world.

In this regard, data science is about thoughtfully crafting messages. It’s about knowing the best ways to convey them, too. This comes in the form of presentations and dashboards using simple words, tables, charts, and graphs. Data science should boil down complex information into consumable bits.

Data science is about thoughtfully crafting messages.

It’s rare that data scientists will get through the data discovery and storytelling steps without anomalies popping up. Investigating these quirks is often referred to as advanced analytics. These often turn into their own presentations and dashboards.

Also: Data science is not pie charts or 3-D effects. Period.

Data scientists often get access to sensitive information and are asked to make powerful tools that affect real people in real ways.

Knowing what the outputs are being used for, and why, is paramount. Data scientists should know when that crosses an ethical boundary, and speak up when they see it. We all come with our biases and need to make sure we are not propagating them through our work. It can be easy for data scientists to forget when they are steeped in data.

Knowing what the outputs are being used for, and why, is paramount.

No doubt, data science is a lot of things. But primarily it is problem solving. It fits within every business and industry — especially government.

In the public sector, data science can empower more informed policies, and even predict their outcomes. It can improve operations, both on the strategic and day-to-day front. It can bolster transparency and accountability, ultimately building public trust. And it can literally save lives, as evidenced during this pandemic.

With hundreds of data scientists joining the ranks of the federal government, it’s exciting to think about what’s on the horizon. We firmly believe that government is better able to serve the public when data belongs to everyone. After all, public data is a tool we all have a right to leverage. In that regard, data science is for all of us.

CivicActions

Modern and accessible government digital services