Utilizing LangChain and OpenAI for Data Processing

Excel/CSV processing using LangChain, OpenAI, and Python

Rajendra Soanwale
Globant
4 min readJul 3, 2024

--

Excel/CSV data processing using OpenAI & LangChain

In our increasingly data-driven world, it’s imperative to discover and utilize tools to streamline our data processing tasks. We’ll focus on how LangChain, a novel tool incorporated by language processing algorithms, effectively handles Excel and CSV data.

Prerequisites

  • Python 3.7 or higher
  • LangChain library installed ( pip install langchain).
  • OpenAI library installed (you can do so via pip install open == 1.12.0 or the latest version) and OpenaAI key.
  • Openpyxl library installed.

Understanding LangChain

LangChain leverages the power of machine learning and artificial intelligence to process and analyze data. Often used in Natural Language Processing (NLP), LangChain takes in raw data and converts it into a form that is easy to understand and utilize. This potent framework is particularly effective when dealing with large Excel spreadsheets or CSV files.

Why use LangChain for Excel and CSV data?

Excel and CSV files are common data storage formats, but interpreting and sorting through rows of text and numbers can be daunting. This is where LangChain shines.

  • Streamlining Workflow: LangChain can quickly sift through volumes of data, extract important information, and easily identify trends and relationships. This can completely transform a task that would have otherwise taken hours to complete.
  • Versatility: LangChain is not restricted to text analysis; it can manage a range of data, from simple numerical data to complex strings.
  • Automation: Another distinguishing advantage of LangChain is its ability to automate monotonous tasks such as data entry or cleaning. Automation reduces human error, ensuring more accurate results.

LangChain in Action

The process starts with feeding your Excel or CSV data into LangChain. The system, guided by pre-set rules or ‘ prompts’, then learns to identify and categorize data based on the instructions provided. For instance, LangChain can classify records based on user-defined rules.

Integrate OpenAI into Langchain

LangChain, a dynamic Python library, allows you to seamlessly engage with an array of Language Learning Models (LLMs) while also integrating them into your unique applications and bespoke data. LangChain stands out from the crowd by providing a Software Development Kit (SDK), creating a synergy with numerous LLM providers, including the renowned OpenAI.

The code block below integrates Azure open AI into the langchain using the langchain_openai library provided in Python. open ai key and endpoint are provided to the langchain API.

from langchain_openai import AzureChatOpenAI
import openai
os.environ["OPENAI_API_KEY"] = "XXXXXXXX"
os.environ["OPENAI_API_VERSION"] = "2023-05-15-pub"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://xxxxxxx.openai.azure.com/"
llm = AzureChatOpenAI(
azure_deployment="deployment",
model_version="0613", #useful for cost saving
)

Creating Pandas Dataframe Agent

With the setup and LLM creation out of the way, we can create a Pandas Dataframe “Agent” to talk to the Excel/CSV dataset.

First, read Excel/CSV files using the pandas library.

import pandas as pd
#Example with csv
df = pd.read_csv('filename.csv')

#Example with Excel
df = pd.read_excel('filename.xlsx')

Second, Create an agent. Once the data frame and langchain object are available, you can create a langchain agent that will connect to the dataset and answer any question related to the dataset.

The code below creates a pandas data frame agent using the langchain_experimental library. This agent accepts a data frame and a langchain object as parameters.

from langchain_experimental.agents import create_pandas_dataframe_agent
agent = create_pandas_dataframe_agent(llm
df,
verbose=True,
)

…once an agent is available we’re ready to ask the agent whatever we can think of dataset and it will give us an answer based on dataset.

ans = agent.run("How many rows of data do you have?")
print(ans)
Output from Langchain agent

As you can see in the above image, the agent returns the output- and displays what action is taken on the dataset.

Feel free to pose any queries concerning your dataset, whether they probe the surface or delve into its intricate depths. This Langchain agent serves you with precise answers, regardless of the complexity of your question. It’s akin to having a data oracle at your disposal. From scanning and analyzing your data’s landscape to performing nuanced operations like aggregation and reduction, this system is as versatile as it is powerful. It’s not just data analysis; it’s an interactive dialogue with your data.

Challenges in AI Data Processing

There are three challenges to consider:

  • Data Security and Privacy: Processing sensitive data with AI raises concerns about data security and privacy. Users need to ensure that they are compliant with data protection regulations.
  • Human Oversight: AI is a tool for data processing, but it should not replace human oversight. Critical decisions should be reviewed by a human to ensure the AI’s recommendations are sensible and applicable.
  • Contextual Understanding: AI might struggle with understanding the context of the data, especially if it involves domain-specific knowledge or nuances that the model needs to be trained on.

Conclusion

LangChain is stirring a revolution in Excel and CSV data processing, bringing efficiencies and automation to your fingertips. Embrace the potential of AI, and explore LangChain for a streamlined, error-free data processing experience.

--

--