Advancing Data Engineering with Generative AI

Published in

TotalEnergies Digital Factory

7 min readApr 15, 2024

Envisioning a future where AI and data dance in digital harmony

Transforming Data Engineering through Generative AI

Generative AI is reshaping engineering world by automating tasks such as creating documentation, writing code, or generating architecture diagrams — activities that typically require specialized skills and human understanding. In this sense, the AI Models can process and produce information that aligns with the task at hand, adapt to new requirements with minimal human intervention, and streamline complex processes that were once time-consuming.

This innovation enables not only engineers but anyone to use natural language to communicate with complex systems, changing their work methods significantly.

We will explore how GenAI can impact the data engineering field and look at the advantages that these initiatives can offer to this profession.

Large Language Model Operations (LLMOPs)

Large language models are the heart of GenAI and as they gradually expand their role in today’s data engineering world, it’s not hard to imagine very soon data engineers will need to work closely with them regularly. For this reason, it is important to adopt LLMOps which help manage the challenges, ensure the proper functioning of these AI models and maximize their potential. It involves applying best practices and methods to the lifecycle of GenAI applications, including development, deployment, monitoring, and maintenance.

Checkout this article from databricks to learn more about LLMOPs.

Understanding Retrieval-Augmented Generation (RAG)

Within the domain of LLMOps, RAG stands out as an important technique that significantly enhances the functionality of AI models. RAG works by getting relevant information from external data sources during the generative process, allowing the AI to create responses that are useful and precise using that contextual information. This method is particularly beneficial for tasks demanding high levels of specificity. For instance, imagine a chatbot that uses Wikipedia articles as context to provide more informed answers.

Checkout this article from Fares to find out more about RAG and its relationship with LLMOPs.

As we continue to explore the applications of GenAI, we will look into practical examples showcasing the transformative impact of RAG. By applying natural language to query databases, integrating vector search capabilities, and empowering generative AI models to execute functions and perform actions autonomously.

SuperDuperDB: Elevating Data Integration with AI

Does your company work with different kinds of databases? Do you wonder how to integrate them into a LLM model? Are you tired of evaluating all the various providers of Vectors databases and are prepared to either spend a lot on license fees or settle for a solution that might not have support soon?

SuperDuperDB is an innovative Python framework that revolutionizes the integration of AI models with relational and non-relational databases. Using it, you can directly integrate AI and vector search directly with your database without complex infrastructure changes or the addition of auxiliary vector databases.

The following code example showcases how to integrate SuperDuperDB to enable advanced vector search capabilities in MongoDB:

Step 1: Set Up Environment and Connect to MongoDB

# Install
required libraries (execute in shell/terminal)
!pip install
superduperdb
!pip install
ipython openai==1.1.2
 
# Set the OpenAI API key (execute in shell/terminal or script)
import os
os.environ['OPENAI_API_KEY'] = 'sk-your-api-key'
 
# Connect to
MongoDB datastore10from superduperdb import superduper, Collection
mongodb_uri = "mongodb://localhost:27017/mydatabase"  
db = superduper(mongodb_uri, artifact_store = 'filesystem://./data/')
doc_collection = Collection('documents')  # MongoDB collection holding documents15

Step 2: Load and Insert Data

# Download and
load dataset (execute in shell/terminal)
!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json
 
# Import datainto MongoDB
import json
with open('pymongo.json') as json_file:
pymongo_data = json.load(json_file)
from superduperdb import Document
db.execute(doc_collection.insert_many([Document(doc) for doc in
pymongo_data]))

Step 3: Vectorize Text Data

# Perform vectorization using the OpenAI model
from superduperdb.ext.openai.model import OpenAIEmbedding
# Initialize the OpenAI Embedding model
openai_model = OpenAIEmbedding(identifier='text-embedding-ada-002')

Step 4: Set Up Vector Indexing

from superduperdb import Listener, VectorIndex
# Add a VectorIndex with real-time listening and vectorization to the database
db.add(VectorIndex( 
  identifier='pymongo-docs-embedding' 
indexing_listener=Listener(select=doc_collection.find(),
key='value', 
# Field used for indexing
model=openai_model
)))

Step 5: Execute Vector Search

search_term = 'Query the database'
num_results = 5
# Perform vector search and retrieve the most relevant documents5search_results = db.execute(
doc_collection.like(doc_collection.Document({'value': search_term}), 
vector_index='pymongo-docs-embedding',
n=num_results).find())
# Display search
results11for doc in search_results:
print(f'Document:{doc["value"]})  # Assuming 'value' contains the document text

We generate vectors using OpenAI embeddings API and we use SuperDuperDB to add VectorSearch features to an existing database. We can also apply this to other engines like SnowFlake, and various other SQL and NOSQL Databases.

Vanna: Transforming SQL Query Generation with RAG

Another very inspiring opensource tool to easily interact with your database using LLM models is called Vanna.

Vanna provides a python framework designed for generating SQL queries through natural language. It utilizes RAG to translate human understandable questions into accurate SQL statements, making database querying accessible to those without SQL fluency.

With Vanna, you start by giving to your LLM model your database metadata context to help it learn it. Then you can ask it questions that will generate SQL queries that can be set up to run on your database automatically.

The remarkable feature of Vanna is the self learning, you can opt to auto train it on queries that ran successfully, right questions are saved for future use and improve future results’ accuracy.

Let’s see a simple example of how to use Vanna:

Step 1: Setup Vanna within the Business Environment

# Import Vanna module and initialize with a connection to the company's SQL database
from vanna import VannaSQL
db_connection_string =
"postgresql://user:password@localhost:5432/sales_database"
vanna = VannaSQL(connection_string=db_connection_string)

Step 2: Translate Natural Language to SQL Query

natural_language_question= "What were the average sales by product category last quarter?"
# Use Vanna to convert the natural language question to an SQL query
sql_query = vanna.to_sql(natural_language_question)
print(sql_query) 
# Outputs: "SELECT AVG(sales), product_category FROM sales_data WHERE date >= '2023-01-01' GROUP BY product_category"6

Step 3: Execute the Generated SQL Query

# With the SQL query generated, Vanna can execute the query to retrieve the desired data:
result_set = vanna.execute(sql_query
vanna.display(result_set) 
# A method to nicely format and display the query results to the user

Step 4: Continuous Learning and Adaptation

As we mentioned before, each successful interaction with Vanna serves as training input, refining its ability to comprehend a more extensive array of questions related to business needs and domain-specific vernacular.

Function Calling

Data engineers do not only work with databases, they also deal with more complex tasks, tools and systems such as API, data pipelines, transformations that exceed the boundaries of just LLM’s and RAG. In this situation, function calling is a helpful feature.

Function calling turns LLMs from passive responders to active agents that can carry out tasks, call APIs, and interact with databases and external services.

A simple example to illustrate this is the following:

Implementing a data pipeline using OpenAI function calling allows you to create a dynamic and adaptable system. Here’s how you can build such a pipeline:

Define Your Functions: Start by defining custom functions that encapsulate specific data processing tasks. For example:

clean_data(raw_data: DataFrame) -> DataFrame: Cleans and standardizes raw data.
calculate_metrics(data: DataFrame) -> Dict[str, float]: Computes relevant business metrics.
generate_reports(metrics: Dict[str, float]) -> str: Creates summary reports.

2. Chatbot Interaction:

Users interact with a chatbot powered by OpenAI.
Chatbot receives natural language queries (e.g., “Analyze sales for Q1 2024.”).

3. Function Calling with OpenAI:

Call the OpenAI model, providing the user query and a list of defined functions.
The model intelligently generates JSON output adhering to your custom schema.

4. Parse and Execute:

In your code, parse the JSON response into structured data.
If the function exists (e.g., “calculate_metrics”), execute it with the provided arguments. Retrieve results (e.g., { “revenue”: 100000, “profit”: 25000 }).

A good tool to explore a similar patter to orchestrate more complicated data analytics tasks is Task weaver which a code-first agent framework for seamlessly planning and executing data analytics tasks. This innovative framework interprets user requests through coded snippets and efficiently coordinates a variety of plugins.

The Future of Data Engineering Shaped by Generative AI

In this article, we have seen some of the cool things we can do with GenAI for data engineering tasks. We might not use all of them right away, but we need to get ready for the future by giving our tables, functions, and other related elements, descriptive naming which will help the smart coordination of our future data pipelines.

When a user asks or tells an AI system something, it’s important that the model can match these words to what they mean or want. For example, if a database has a table called “MonthlySalesData” and the user asks, “What were the sales figures in March?” a clear name will help the LLM connect “sales figures” with “MonthlySalesData”.

As we move forward, we imagine a future where AI doesn’t just help data engineering, but takes it to new heights. Generative AI opens up a world where data systems are not just places to store information, but are active, helpful partners in managing and exploring the huge amounts of data.