Unlock Powerful CSV Data Insights with Phind-Codellama, Together Inference API, and Langchain

3 min readFeb 20, 2024

In today’s data-driven world, organizations are constantly seeking ways to extract valuable insights from their datasets to drive informed decision-making. Traditional methods of data analysis can be time-consuming and complex, often requiring specialized skills and resources. However, with the advent of new RAG techniques and Large Language Models, unlocking powerful insights from CSV data has become more accessible than ever before.

In this blog post, we will explore how you can leverage the combined capabilities of Phind-Codellama, Together Inference API, and Langchain to easily extract valuable insights from your CSV data. These cutting-edge tools provide powerful data exploration enabling you to uncover hidden patterns, trends, and correlations within your datasets quickly and efficiently.

Phind CodeLlama

The Phind models are variations of the CodeLlama-34B architecture, fine-tuned on a proprietary dataset containing programming problems and corresponding solutions. Unlike traditional datasets, Phind’s structure comprises instruction-answer pairs rather than code completion examples. These models underwent native fine-tuning without using LoRA, utilizing DeepSpeed ZeRO 3 and Flash Attention 2 for efficient training. Phind-CodeLlama-34B-v2, for instance, was initialized from Phind-CodeLlama-34B-v1 and further trained on an additional 1.5 billion tokens. The dataset was also subjected to OpenAI’s decontamination methodology to ensure result validity.

Phind/Phind-CodeLlama-34B-v2 · Hugging Face

Implementation

Let us install all the necessary libraries

langchain_together
pandas
langchain
langchain_experimental
matplotlib
python-dotenv

2. Now let us create a class to initialize the required data and the csv agent which uses chain of thought to execute consuecutive python commands and gives us the final output

import pandas as pd
import os
from langchain_together import Together
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

class PandasDFAnalyzer:
    def __init__(self,  df: pd.DataFrame,
                 
                 together_api_key: str,
                    llm = None
                 ):
        self.df = df
        self.llm = llm
        if not self.llm:
            self.llm = Together(
                model="Phind/Phind-CodeLlama-34B-v2",
                    temperature=0.7,
                    max_tokens=128,
                    top_k=1,
                    together_api_key=together_api_key,
                )
    
    def analyze_df(self, query):
        agent = create_pandas_dataframe_agent(self.llm, self.df, verbose=True) 
        
        response = agent.run(query)
        
        return response

3. Let us test it with a sample file cars_min.csv

from llmprod.pandas_df.pandas_df_analyzer import PandasDFAnalyzer
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv()
df = pd.read_csv('cars_min.csv')

TOGETHER_API_KEY =  os.getenv('TOGETHER_API_KEY')



df_analyzer = PandasDFAnalyzer(df=df, together_api_key=TOGETHER_API_KEY)

df_analyzer.analyze_df(query="What are the best cars")

Execution

Output

The best cars are the ones with the highest price. Based on the data provided, the top 5 best cars are:

Car_ID 4993, Model X, Production Date: 2022–07–25, Color: SILVER, Price: 149962.15
Car_ID 4724, Model X, Production Date: 2022–03–01, Color: SILVER, Price: 149958

Conclusion

In conclusion, the combined power of Phind-Codellama, Together Inference API, and Langchain offers a robust solution for unlocking valuable insights from CSV data with ease. By leveraging state-of-the-art models like Phind-CodeLlama-34B-v2 and innovative techniques like chain of thought execution, organizations can quickly extract meaningful patterns and correlations from their datasets. Moreover, while llms like WizardCoder exist for code generation, the performance of Phind-Codellama with Langchain appears to be superior out of the box, providing accurate and relevant insights efficiently.

References

Pandas Dataframe | 🦜️🔗 Langchain

Code

satishgunasekaran/llmprod: Simple Code to access LLMS (github.com)