Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity

Armand Ruiz
5 min readAug 23, 2017

--

There’s no doubt that data scientists are in high demand. A job that didn’t exist a decade ago topped Glassdoor’s ranking of best roles in America for two years in a row based on salary, job satisfaction, and number of job openings. It was even dubbed the “sexiest job of the 21stcentury” by Harvard Business Review. So, what is driving the appetite for data scientists and their unique skill sets?

Analytics is changing the world, and applications of analytics are proliferating in every industry. It’s no longer just about canned reports and accounting — the use cases range from cracking down on crime to fighting poverty to lobbying voters in elections to enhancing cancer survival rates. The data revolution is in full swing and the possibilities seem to be endless, so it’s no surprise that data professionals are at the top of recruiters’ “most wanted” lists.

So what’s the problem?

Data scientists are scarce and busy. IBM recently published a study showing that demand for data scientists and analysts is projected to grow by 28 percent by 2020, and data science and analytics job postings already stay open five days longer than the market average. Unless something big changes, the skills gap will continue to widen.

Against this backdrop, helping your data scientists work more productively should be a key priority — which is why the news that data scientists spend only 20 percent of their time on analysis is a problem you need to address (and soon).

The reason you hire data scientists in the first place is to develop algorithms and build machine learning models — which are typically the parts of the job that they enjoy most. Yet in most companies, the so-called “80/20 rule” applies: 80 percent of a data scientist’s valuable time is spent simply finding, cleansing, and organizing data, leaving only 20 percent to actually perform analysis.

Hard work behind the scenes

At the beginning of any new analytics initiative, data scientists must identify relevant data sets, which is no small task. Many organizations’ data lakes have turned into dumping grounds with no easy way to search for data and little incentive to share it. Data scientists may need to contact different departments to beg for the data they need and wait weeks for it to be delivered, only to find that it doesn’t provide the information they need or has serious quality issues. At the same time, responsibility for data governance often falls to them, since corporate-level governance policies are confusing, inconsistent, or difficult to enforce.

Even when they can get their hands on the right data, data scientists need to spend time exploring and understanding it. For example, they might not know what a set of fields in a table is referring to at first glance, or data may be in a format that can’t be easily understood or analyzed. There is usually little to no metadata to help, and they may need to seek advice from the data’s owners to make sense of it.

Once they wrangle the data, there’s yet another laborious task to perform: preparing it for analysis. This step involves formatting, cleaning, and sometimes sampling the data. In some cases, they may also have to perform scaling, decomposition, and/or aggregation transformations on the data before they are ready to start training their models.

Why it’s such a conundrum

These processes can be time-consuming and tedious. But it’s crucial to get them right since a model is only as good as the data used to build it. And because models generally improve as they are exposed to increasing amounts of data, it’s in data scientists’ interests to include as much data as they can in their analysis.

However, in the real world, every project has a deadline. Consequently, data scientists can be tempted to make compromises on the data they use, aiming for “good enough” rather than optimal results.

The problem is that with machine learning models, “good enough” often just isn’t good enough. Making hasty decisions or cutting corners during model development and training can lead to widely different outputs and potentially render a model unusable when it’s put into production. Data scientists are constantly making judgment calls on how to approach an analytical problem. Starting out with bad or incomplete data can easily lead them down the wrong path.

Due to the need to balance quality against time constraints, data scientists are generally forced to focus their energies on one model at a time. This means that if they haven’t chosen the right line of inquiry, they are forced to drop everything and start all over again. In effect, they’re obliged to double down on every hand, turning data science into a high-stakes, high-risk game of chance.

Escaping these pitfalls

Data scientists’ time is precious. So how can you help them work to their full potential? The answer is to use automation to give them more time for analysis without compromising the quality of the data they use.

IBM Data Catalog, a new beta solution that’s part of Watson Data Platform, offers tools to automate and simplify data discovery, curation, and governance. Intelligent search capabilities help data scientists find the data they need, while metadata such as tags, comments, and quality metrics help them decide whether a data set will be useful to them and how best to extract value from it. Integrated data governance gives data scientists confidence that they are permitted to use a given data set and that the models and results they produce are used responsibly by others in the organization.

Rather than being limited to working on one model at a time, the goal is to give data scientists the time they need to build and train multiple models simultaneously. This spreads out the risk of analytics projects, encouraging the experimentation that yields breakthroughs instead of focusing resources on a single approach that may be a dead end.

Making it easy for data scientists to save, access, and extend models allows them to use existing assets as templates for new projects instead of starting from scratch every time. The concept of transfer learning — focusing on preserving the knowledge gained while solving one problem and applying it to a different but related problem — is a hot topic in the machine learning world. By developing visualizations to communicate how models work, solutions like Data Catalog promote re-use, saving time, and reducing risk.

Data scientists play an essential role in pushing forward innovation and garnering competitive advantage for companies. Disruptive solutions like Data Catalog give data science teams the tools to transform their workflows, break the 80/20 rule, and reclaim much of the time that they’re currently wasting on discovery and cleansing.

Request access to IBM Data Catalog beta today

Originally published at www.ibm.com on August 23, 2017.

--

--

Armand Ruiz

Lead Product Manager IBM Watson Studio and Machine Learning