Part 1.3 : Chat with a CSV / Langchain , ChatGPT

Published in

BosphorusISS

5 min readDec 13, 2023

Hi, I am Mine, incase you missed Part 1-2 here is a little brief about what we do so far; recently I was working on a project to build a question-answering model for giving responses to the question over the data that we have from one of our internal project, we using it as project management tool.

We’ll see how AWS Canvas works and create a forecast model in it, deploy that model and a huggingface model, and try langchain to connect model to our csv data. Links are below.

Later on approach changed for solution we will be use chatgpt-4 as model, I will directly dive in the topic.

Agents for OpenAI Functions

If you read the previos post, you will know that we were using csv_agent to create a question-answering model from the csv data. For docs, check here.

Now we switch to OpenAI models, and we will change our agent type to OPENAI_FUNCTIONS from ZERO_SHOT_REACT_DESCRIPTION. On docs you can read through what are the differences, check on here,

Briefly, OpenAI Function APIs is to more reliably return valid and useful function calls for actions, unlike ZERO_SHOT_REACT_DESCRIPTION there is no need to describe to the LLM to use which function to use for complete the action.

First, we don’t need a Sagemaker from now on for using a model, we will have our openai api key. Lets update our code :

pip install langchain
pip install openai
pip install tabulate
pip install pandas

import tabulate
import pandas as pd
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.agents import create_csv_agent
from langchain.agents.agent_types import AgentType

OPENAI_API_KEY = '{your_api_key}'
file_path= '{your_file_path}'

csv_agent= create_csv_agent(
        ChatOpenAI(temperature=0, model="gpt-4", openai_api_key=OPENAI_API_KEY),
        file_path, 
        verbose=True,
        stop=["\nObservation:"],
        agent_type=AgentType.OPENAI_FUNCTIONS,
        handle_parsing_errors=True
    )
    
csv_agent.run("how mant diffrent project_id are there? ")

Prompt , why it is so important?

Prompt is a set of instructions or input provided by a user to guide the model’s response, helping it understand the context and generate relevant output, such as answering questions, completing sentences, or engaging in a conversation.

LangChain provides several classes and functions to help construct and work with prompts.

Prompt templates: Parametrized model inputs
Example selectors: Dynamically select examples to include in prompts

We need to have proper prompt for our csv data, because even that we have an csv_agent to back there its calling to create_pandas_dataframe_agent on the behind. And it just converts our data to dataframes, model doesn’t awere of the context of the data. If you run the code above you will see there is no correct answer to given question. But if you ask something like this, you will see the correct answer.

agent.run("how many rows are there?")

We should use PromptTemplates to give an instruction to model to where to find the data and what they represents.

Little trick here, you don’t have to use PromptTemplates, you can do prompting like that as well, but I am not sure if it makes a difference, but its works :P

question = "which projects Mine Kaya worked on September 2023 ?"

propmt = " You are a question-answering bot over the data you queried from dataframes.
           Give responses to the question on the end of the text :
            
            
            df1 contains data about timesheet, which user worked on a project on that day.
            Worked and spending time are representing same thing.
            It's a daily record but you will not find weekend records for Turkey's timezone, GMT+03:00.
            df1 columns and column descriptions as following:
            date: day of record , only weekdays
            user ID: unique identifier of the user
            name: name of the user
            email: email of the user
            project_id: unique identifier of the project that user worked
            hours : hours of user spend on that project 
            assignment_start_date : when user started to work on that project
            assignment_end_date : when user ended to work on that project
            
            Below is the question
            Question:"  + question

agent.run(propmt)

Now chatgpt is aware of the context and ready to reply questions over that csv, but we have several different topics and they will be stored in several different csv’s. csv_agent is really makes it easier to work with multiple files.

Working Multiple-CSV Files

We will be just give all the file paths that we have and update our prompt as well.

OPENAI_API_KEY = '{your_api_key}'

file_path_timesheet = '{your_file_path}'
file_path_user = '{your_file_path}'
file_path_homeoffice = '{your_file_path}'
file_path_projects = '{your_file_path}'
file_path_timeoff = '{your_file_path}'

csv_agent= create_csv_agent(
        ChatOpenAI(temperature=0, model="gpt-4", openai_api_key=OPENAI_API_KEY),
        [file_path_timesheet,file_path_user,file_path_homeoffice,
          file_path_projects,file_path_timeoff],
        verbose=True,
        stop=["\nObservation:"],
        agent_type=AgentType.OPENAI_FUNCTIONS,
        handle_parsing_errors=True
    )
    
csv_agent.run("how mant diffrent project_id are there? ")

query = 'How are you?'

prompt = (
        """
            You are a question-answering bot over the data you queried from dataframes.
             Give responses to the question on the end of the text :
            
            df1 contains data about timesheet, which user worked on a project on that day.
            Worked and spending time are representing same thing.
            It's a daily record but you will not find weekend records for Turkey's timezone, GMT+03:00.
            df1 columns and column descriptions as following:
            date: day of record , only weekdays
            user ID: unique identifier of the user
            name: name of the user
            email: email of the user
            project_id: unique identifier of the project that user worked
            hours : hours of user spend on that project 
            assignment_start_date : when user started to work on that project
            assignment_end_date : when user ended to work on that project

            df2 contains data about user detail, if you have question about user, check this out.
            df2 columns and column descriptions as following:
            user_id: The unique identifier of the user.
            name: The name of the user.
            email: The email address of the user.
            employment_start_date: The date of user start the work.
            birthday: The birthday of the user.
            
                                 .
                                 .
                                 .

            Below is the question
            Question:
            """
       + query
        
    )


agent.run(prompt)

Don’t worry I will follow the best practices later on Part 4. We will see how to use PromptTemplate class in a proper way.

Cool, we did a question-answering model from our CSV data. I should say that when we switch to use ChatGpt responses was more related even before prompting. There can be several reasons about that, first one is maybe I couldn’t choose a good open-source model from Huggingface to work on this spesific problem, second one could be change on Agent Types, OpenAI functions type is spesificly developed for OpenAI models, so it can build a smoother ways to interact with the model.

We reach our first goal, our second goal will be succesfully add a memory into project to keep conversation going without give spesific project or user name. Hang it there, see you on Part 4.

Part 1.3 : Chat with a CSV / Langchain , ChatGPT

Agents for OpenAI Functions

Prompt , why it is so important?

Working Multiple-CSV Files

Written by Mine Kaya