Getting Your Data Ready for AI

O'Reilly Media
oreillymedia
Published in
5 min readSep 24, 2020

Editor’s Note: Preparing data is a crucial and unavoidable part of any data scientist’s job. In this post writer Kate Shoup takes a closer look at the data bottleneck that affects so many projects, and how to address it.

Most people enter the field of data science because “they love the challenge of developing algorithms and building machine learning models that turn previously unusable data into valuable insight,” writes IBM’s Sonali Surange in a 2018 blog post. But these days, Surange notes, “most data scientists are spending up to 80 percent of their time sourcing and preparing data, leaving them very little time to focus on the more complex, interesting and valuable parts of their job.” (There’s that 80% figure again!)

This bottleneck in the data-wrangling phase exists for various reasons. One is the sheer volume of data that companies collect — complicated by limited means by which to locate that data later. As organizations “focus on data capture, storage, and processing,” write Limburn and Taylor, they “have too often overlooked concerns such as data findability, classification and governance.” In this scenario, “data goes in, but there’s no safe, reliable or easy way to find out what you’re looking for and get it out again.” Unfortunately, observes Jarmul, the burden of sifting through this so-called data lake often falls on the data science team.

Another reason for the data-wrangling bottleneck is the persistence of data silos. Data silos, writes AI expert Edd Wilder-James in a 2016 article for Harvard Business Review, are “isolated islands of data” that make it “prohibitively costly to extract data and put it to other uses.” Some data silos are the result of software incompatibilities — for example, when data for one department is stored on one system, and data for another department is stored on a different and incompatible system. Reconciling and integrating this data can be costly. Other data silos exist for political reasons. “Knowledge is power,” Wilder-James explains, “and groups within an organization become suspicious of others wanting to use their data.” This sense of proprietorship can undermine the interests of the organization as a whole. Finally, silos might develop because of concerns about data governance. For example, suppose that you have a dataset that might be of value to others in your organization but is sensitive in nature. Unless you know exactly who will use that data and for what, you’re more likely to cordon it off than to open it up to potential misuse.

In addition to prolonging the data-wrangling phase, the existence of data lakes and data silos can severely hamper your ability to locate the best possible data for an AI project. This will likely affect the quality of your model and, by extension, the quality of the broader organizational effort that your project is meant to support. For example, suppose that your company’s broader organizational effort is to improve customer engagement, and as part of that effort it has enlisted you to design a chatbot. “If you’ve built a model to power a chatbot and it’s working against data that’s not as good as the data your competitor is able to use in their chatbot,” says Limburn, “then their chatbot — and their customer engagement — is going to be better.”

Solutions

One way to ease the data-wrangling bottleneck is to try to address it up front. Katharine Jarmul champions this approach. “Suppose you have an application,” she explains, “and you’ve decided that you want to use activity on your application to figure out how to build a useful predictive model later on to predict what the user wants to do next. If you already know you’re going to collect this data, and you already know what you might use it for, you could work with your developers to figure out how you can create transformations as you ingest the data.” Jarmul calls this prescriptive data science, which stands in contrast to the much more common approach: reactionary data science.

Maybe it’s too late in the game for that. In that case, there are any number of data catalogs to help data scientists access and prepare data. A data catalog centralizes information about available data in one location, enabling users to access it in a self-service manner. “A good data catalog,” writes analytics expert Jen Underwood in a 2017 blog post, “serves as a searchable business glossary of data sources and common data definitions gathered from automated data discovery, classification, and cross-data source entity mapping.” According to a 2017 article by Gartner, “demand for data catalogs is soaring as organizations struggle to inventory distributed data assets to facilitate data monetization and conform to regulations.” Examples of data catalogs include the following:

  • Microsoft Azure Data Catalog
  • Alation Catalog
  • Collibra Catalog
  • Smart Data Catalog by Waterline
  • Watson Knowledge Catalog

In addition to data catalogs to surface data for AI projects, there are several tools to facilitate other data-science tasks, including connecting to data sources to access data, labeling data, and transforming data. These include the following:

Database query tools
Data scientists use tools such as SQL, Apache Hive, Apache Pig, Apache Drill, and Presto to access and, in some cases, transform data.

Programming languages and software libraries
To access, label, and transform data, data scientists employ tools like R, Python, Spark, Scala, and Pandas.

Notebooks
These programming environments, which include Jupyter, IPython, knitr, RStudio, and R Markdown, also aid data scientists in accessing, labeling, and transforming data.

Learn faster. Dig deeper. See farther.

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Kate Shoup grew up reading under the covers with a flashlight well past her bedtime. Now, Kate does more than just read books — she edits and writes them, too. For more than 20 years Kate has worked as an independent publishing professional. She has written more than 50 books on a mish-mash of topics and edited hundreds more.

--

--

O'Reilly Media
oreillymedia

O'Reilly Media spreads the knowledge of innovators through its books, video training, webcasts, events, and research.