Harnessing the Power of Generative AI combining Structured and Unstructured Data for Predictive Modeling

Boran Morvaj
bain-inside-advanced-analytics
9 min readJan 4, 2024

Authors: Boran Morvaj (Boran Morvaj), Josef Rieder (Josef Rieder)

In today’s data-driven world, organizations rely on predictive models to make critical business decisions. Prediction models based on structured data are still the analytics backbone of many industries for which good and robust solutions exist. Examples of structured data include customer demographics, sales data, and financial records. Classical machine learning models, particularly gradient boosting models, are usually best performing — or similarly well performing than more complex models but with a preferable effort to outcome ratio.

The challenge is how can such models be improved further. As usual in data science, the improvements must be on both dimensions: data and analytics. What kind of data would add most information, if we already use all relevant numeric and categorical fields from our database? It is unstructured data that can bring additional edge to prediction models. The use of generative AI to incorporate unstructured data and enhance structured data models is becoming more common and can provide significant improvements in accuracy and effectiveness. The use of generative AI not only unlocks additional potential in predictive modelling but can also transform how companies and its employees interact with machine learning models.

In this article, we will explore:

· how generative AI can be used to enhance prediction models based on structured data by leveraging also unstructured data,

· how generative AI can improve predictive modelling process for the benefit of companies and employees,

· and how these combined models can be used in different industries.

Adding Contextual Information to Structured Data

It is relatively common with prediction models based on structured column-wise data to add numerical features extracted from text. One of the ways in which predictive models can be improved is by adding contextual information from unstructured text data. For example, one can include financial mood scores in forecasting models for product sales. This vectorization of words (sentences or even full paragraphs) is accomplished by word embeddings, a natural language processing technique which captures similarities between words based on how often they appear together in similar contexts. Over the last two decades extensive research has led to improved types of embeddings, which are one of the core components of large language models. Therefore, data scientists have a plethora of options to choose from now, when they want to conduct word embeddings (word2vec, GloVe, fastText, large language model embeddings). By including word embeddings in prediction models, companies can extract more meaningful information from text data and improve the accuracy of their models.

Efficient Handling of High Cardinality Data

Another way in which generative AI can enhance structured data models is by efficiently handling high cardinality data. When dealing with classification systems with thousands of different codes, standard dummy encoding is not practical. Instead, data scientists can use target encoding or algorithms such as LightGBM and CatBoost, which have built-in functionality for high cardinality data. However, these approaches have limitations in size and disregard the interactions between multiple data points per object. Generative AI can be used to create embeddings for high cardinality data, which can then be applied as features in the predictive model. This approach can improve the accuracy of the model by capturing the underlying patterns in the data.

There is a variety of industries that can benefit from the use of generative AI models for structured data analysis. For example, in the healthcare industry, the use of International Classification of Diseases (ICD) codes and Anatomical Therapeutic Chemical (ATC) codes for drugs classification can be a challenge due to the high cardinality of codes and the presence of multiple codes per patient. Another example is shopper data, where instead of the full product description we only find stock keeping units (SKU) codes in the data.

Approach

Below we describe two approaches on how to improve prediction models by leveraging large language model concepts to make use of the full underlying data of complex classification systems. We do not provide performance results of the two approaches in this article. Our own research has not shown a clear winner yet. The purpose of this article is to provide inspiration for combining classic AI with generative AI. Therefore, we hope that this will lead to an interesting discussion in the community.

Enriching standard predictive models

Word embeddings can be used to extract contextual information from the full description of complex classification to enrich the prediction model.

1. Establish performance benchmark with a standard predictive model (e.g. Neural Network, LightGBM, Catboost, Random Forest) using structured data. For high-cardinality data also perform one-hot encoding for n most frequent codes.

2. Leverage unstructured or high-cardinality data:

a. High-cardinality data: Treat the raw categories or codes as words. Train a new embedding model or fine-tune an existing embedding model on the specific data and problem. Embeddings ensure that similar categories or codes with similar meaning have a similar encoding.

b. Textual representation of codes: Replace codes with detailed text descriptions, e.g. ICD codes with the corresponding long description or SKU codes with the corresponding product description. Use textual representation to get embeddings for the codes either by leveraging traditional embedding algorithms (e.g. Word2Vec, fastText) or embeddings from pretrained large language models or generative AI models (e.g. variants of BERT, GPT, MPNet).

c. Unstructured data: If it is text data, use traditional embedding algorithms or embeddings from pretrained large language models. If it is multi-media type of that (e.g. image, sound, video) use pretrained generative AI to obtain the embeddings (e.g. ResNet, Whisper, multimodal-receiver).

3. Further prepare embeddings (e.g. patient embedding, shopper embedding, customer embedding) so that they can be used as input feature for the prediction model. In the case of shopper embeddings for example, take the purchased goods over a defined period of time and use the descriptions these products as input for the embeddings.

4. We can now train a prediction model with structured and unstructured data (represented as embeddings) as input features.

Using generative AI as the prediction model

Leverage traditional prediction model to provide relevant information to generative AI model.

1. Establish performance benchmark with a standard predictive model (e.g. Neural Network, LightGBM, Catboost, Random Forest) using structured data. For high-cardinality data perform one-hot encoding for n most frequent codes.

2. Analyze the feature importance to understand which features from the structured data are the most important.

3. Create training data

a. Replace codes with detailed descriptions (e.g. ICD codes with long description or SKU codes with product description) or directly input available unstructured data.

b. Textually transcribe the most important features from the model trained on the structured data (e.g. “Age 45, High Income”).

4. Fine-tune a pretrained generative AI model using the prepared training data with the last layer customized for the specific task.

Using Generative AI for Enhancing Predictive Modelling Processes

While predictive models can provide valuable insights, the integration of generative AI models not only can enhance these models but also offer recommendations for proactive measures to be taken for a given customer — a combination which is much more actionable than pure predictions. To incorporate generative AI into the predictive process, a company can utilize a pretrained generative AI model combined with industry information, the company’s internal knowledge and even customer data — customer’s current products, customer’s history, or any other customer relevant information. For example, plain product recommendations can be complemented with individual messages explaining why the recommended product might be a good choice. Providing such reasons is super powerful to foster customer engagement providing the relevant customer information is available in near-term and the algorithm has the function to highlight the most important drivers, which is the case for many LLM applications.

There are many benefits of utilizing combinations of machine learning models and generative AI for business applications, here are just a few. The AI system reduces the burden on employees by automating many processes. It can quickly analyze relevant data and provide recommendations, saving time and effort for the employee. It also ensures consistency and accuracy in communications, as the generative AI model leverages the company’s internal knowledge and expertise to deliver reliable insights. Finally, it provides employees access to a broader range of information and enables them to make better-informed decisions.

When trying to find a good analytics solution for these and similar applications it is most important to approach this from a problem-solving perspective instead of just always picking the coolest algorithm in town. Analysts need a good overview of the solution space to find the right combination for the given task.

Use cases

In the retail industry, companies often face the challenge of efficiently analyzing large volumes of shopper data to make accurate predictions e.g., for next-best offer recommendations. By leveraging generative AI techniques, the company can enrich their prediction models by incorporating unstructured data, such as product descriptions, customer reviews, and social media sentiment analysis. Furthermore, generative AI model can suggest specific actions to be taken, such as designing targeted marketing campaigns, offering personalized product recommendations, or even increasing supply chain productivity. It can also provide additional information to be gathered from unstructured sources or propose specialized products that align with the customer’s preferences. This empowers retail employees to make better-informed decisions, improve customer experience, and identify business growth opportunities.

In the finance industry, accurate predictive modeling is crucial for making informed investment decisions, managing risks and detecting fraudulent activities, to name just a few analytics use cases. In addition to classic predictive models that usually use data such as customer demographics, transaction history, and market trends, the companies can leverage generative AI to incorporate additional unstructured data from various sources, including news articles, social media sentiment, and expert opinions. By integrating this generative AI model into their predictive process, the financial institution gains the advantage of not only more accurate predictions but also personalized recommendations. The generative AI model can interpret queries or requests from financial analysts, analyze the available data, and provide insights into specific outcomes or events related to investments, risks, or market conditions. It can tap into its internal knowledge to suggest proactive measures, such as adjusting investment portfolios or identifying potential market trends. In this way companies can enhance their predictive models, improve risk management strategies, and gain a competitive edge by making more informed and timely decisions.

In the insurance industry, predictive models are used to assess risks, determine premium rates, and optimize claims management. The prediction models can be enriched with additional data such as accident reports, medical records, and natural language descriptions of incidents using generative AI. Integrating generative AI into the predictive modeling processes, can provide several benefits to the insurance companies. Company’s risk assessments can be automated, which saves time for insurance agents and underwriters, allowing them to focus on more complex cases and make more accurate risk evaluations. Claims management process can be enhanced so that employees make more informed decisions, leading to faster and fairer settlements while reducing the company’s exposure to fraudulent claims., Generative AI can act as virtual assistants to the employees and can provide personalized policy recommendation or suggest coverage adjustments, additional policy options, or risk mitigation strategies.

Summary and outlook

Predictive modeling based on structured data is a well-established field of data science, and there are many good and robust solutions available. Classical machine learning models, particularly gradient boosting models, are still among the best-performing models for many use cases, and they have a preferable effort-to-outcome ratio. However, there is a continuous need to develop even better models, as economic necessities demand ongoing improvement. The use of generative AI to incorporate unstructured data and enhance structured data models is becoming more common and can provide significant improvements in accuracy and effectiveness. By utilizing embeddings generated from unstructured data, generative AI models can provide valuable insights and identify new features that may be useful for prediction. Going forward, we can expect to see more of a mix of structured and unstructured data being used for predictive modeling as large language models continue to progress.

Generative AI not only can improve prediction models but also transforms how companies and employees interact with machine learning models. It can automate certain aspects of the evaluation process, save time and effort for employees, ensure consistency and accuracy in evaluations, and enable proactive decisions based on a comprehensive understanding of the company data. Furthermore, generative AI provides access to a broader range of information and enables employees to make more informed decisions. As technology continues to advance, the potential for generative AI in predictive modeling will only grow. It offers a pathway for organizations to unlock the full potential of their data, improve models, and transform their decision-making processes.

It is important to note that while AI systems can provide valuable suggestions, the final decision-making should still rest with humans. Prediction models and generative AI serve as powerful tools to augment human capabilities and support processes, but in most cases, it is ultimately the employee or the customer who takes the recommended next steps based on their judgment and expertise. While today’s AI applications have some degrees of autonomy, they always should work under human direct or indirect control, so that humans stay in charge of human-AI teams and take responsibility for the outcomes.

--

--