Effortlessly find the right Data with Semantic Search

Lena Woolf
6 min readJul 15, 2022

--

By Jenna Lau-Caruso and Lena Woolf

Imagine if finding the right data is just as easy as finding and listening to music? Most enterprises data consumers struggle with simple task of finding data. Data analysts and scientists have narrow focus on task at hand when searching for data and the task description can be very ambiguous.

Finding the right data by browsing data sources is difficult because table and columns lack business semantics. Plus, data consumers rarely venture outside of their usual data sources to find data in unfamiliar source system. They need ability to find and explore data by subject using business semantics and natural language. An intelligent data catalog powered by semantic search can help organizations to transform the way data scientists, engineers and stewards find data from internal and external data sources.

Let’s take a look at some examples of how an intelligent search can make a difference. Imagine that you have been asked to find “most popular contact method for mortgage applicants”. With a purely text based search, you may not find any suitable data if the columns in the relevant data sources are actually labelled “e-mail address”, “mailing address”, or “phone number”. Gaps like this in the search results could lead to missing critical information or frustrating cycles of searching for data while trying to guess the exact keywords required to find what you need.

There is a better way. IBM Watson Knowledge Catalog (WKC) captures vocabulary and associated semantic meaning of terms used by your business, and classifies data assets with these terms for the purpose of data governance. This semantic layer of knowledge stored in the form of Knowledge Graph can now be used to empower data consumers to find the data they need, when they need it.
So, you head over to IBM Cloud Pak for Data and type the task description into global search bar:

Search using natural language task description

Search result contains relevant mortgage data sets with columns containing types of contact information about the applicants, such as “e-mail address”, “mailing address”, or “phone number”.

Mortgage assets are found based on types of contact information

Search results are ranked by relevance, ensuring that assets containing contact information related to mortgages are seen first, above records which may have mortgage information without contact details, or contact information for a different context.

In this example, the ability to move beyond a simple text-based search and instead search by the semantic meaning of a search phrase is immediately clear. This is achieved by leveraging the taxonomy of Business Terms defined using type relationships, as seen in the image below.

Phone Number “Is a type of” “Contact Information”

Based on this knowledge, a search for “contact information” not only includes matches for the words contact and information, but also results for types of contact information, which may include Phone Number as well as types of phone number like Home Phone Number and Cell Phone Number. Notice the ranking in the screen shot below. Immediate types of the business term searched are treated as more relevant compared to any sub-types.

Found “phone number” business term when searching for “contact information”

Let’s take a look at another example. Perhaps you have been asked to “identify trends in property values across multiple geographies”. Again, you head over to Cloud Pak for Data to find data to help you achieve this task and type “property values by geo” into search bar.

Search results brings back data asset named AREA_AVG_PRICE which is exactly what you need for your project.

Asset AREA_AVG_PRICE appears on the top of search result

AREA_AVG_PRICE asset never would have surfaced in search results using a traditional text-based search because it does not contain any of your search keywords. However, this asset contains columns named CITY and STATE, as well as column business term assignment Home Price. Now it is possible to find this asset by leveraging the semantic knowledge captured by the business term relationships.

To understand the magic taking place here, we can break the search down into two parts: “property values” and “geo”. Behind the scenes, natural language processing helps us parse and understand your search intent each time you search. You may also notice that the results do not include results with the word “by” in the name or description. The search understands that this word is not important to the intent of the query.

WKC Business Glossary contains a term Property Value with a synonym Home Price and related term House Selling Price. This synonym relationship is satisfies the first part of the search.

“Property Value” is a synonym of “Home Price”

Glossary also contains business term Geography with an abbreviation geo, and a synonym Location. The term Location has several types relationships including City, State, and Country.

“Geography” is a synonym of “Location”
Location has a type of City, State, and Country.

By understanding that Home Price is synonymous with Property Value and that geo is a colloquial way to refer to Geography, which can be described by types of Locations such as Cities or States, Cloud Pak for Data is able to recognize your intention and find the data you are looking for.

This level of understanding also enables users to phrase their questions in a way that feels natural to them. In fact, any of the searches below would find the AREA_AVG_PRIVE asset ranked near the top in the results:

  • Value of properties by city
  • Home prices and locations
  • Selling price of houses in each state
  • Trends in home prices per geography

Depending on the wording of the search query, results may be ranked differently. Compare our earlier search phrase “property value by geo” with a search for “value of properties by city”.

Targeted semantic search of city properties

The search “property value by geo”prioritized all types of geographies, including both State and City. The search “value of properties by city” is more specific, narrowing the results down to only “city”. As a result some results containing property value and state information will rank lower than in the first search.

Our video shows semantic search in action:

IBM Data Fabric Semantic Search powered by Knowledge Graph provides intuitive search for data with natural language and semantic augmentation of user queries, as well as intelligent ranking and enforcement of asset access control.

--

--

Lena Woolf

Governance, Data and AI, Information Management, Inclusive Workspace, Innovation