Extending the Wardley Mapping Body of Knowledge — Part 16: Using Pinecone Serverless for Vector Database

Upserting data into Pinecone Serverless

Mark Craddock
Prompt Engineering

--

Pinecone Serverless

A Game-Changer for Developers

Pinecone’s mission to democratise vector search for AI applications has taken a significant leap with the introduction of its serverless platform. This evolution is particularly relevant in the context of Wardley Mapping, where dynamic, real-time strategic insights are crucial. The serverless architecture of Pinecone allows for seamless scalability and efficient processing of large datasets, making it an ideal companion for strategic mapping efforts.

With serverless, developers and strategists can now build and deploy GenAI applications that are not only fast and accurate but also remarkably cost-effective. The ability to handle massive, ever-growing amounts of vector data with ease translates into more nuanced and informed strategic decisions. More on Pinecone Serverless here.

Transforming Strategic Insights with Unprecedented Efficiency

The integration of Pinecone Serverless into Wardley Mapping methodologies ushers in a new era of strategic planning. The separation of reads, writes, and storage within Pinecone’s architecture significantly reduces costs across all types and sizes of workloads. This cost efficiency, coupled with the innovative indexing and retrieval algorithms, enables businesses to conduct low-latency, vector-based searches over an unlimited number of records without sacrificing quality or performance.

For Wardley Mapping, this means the ability to incorporate vast amounts of real-time data into strategic maps, enhancing the depth and accuracy of insights. The serverless nature of Pinecone eliminates the need for infrastructure management, allowing strategists to focus on deriving actionable insights from their maps.

Empowering Wardley Maps with Knowledge and Agility

Pinecone Serverless stands out not just for its cost savings but also for its effortless scalability and the freshness of its search results. The absence of infrastructure considerations like pods, shards, and replicas means that strategists can scale their applications without complexity, from the initial stages of development to full-scale production.

The impact on Wardley Mapping is profound. As strategic maps become more data-driven and dynamic, businesses can adapt more swiftly to changing market conditions and competitive landscapes. The integration of Pinecone Serverless into this process ensures that strategic decisions are backed by the most relevant and up-to-date information available.

The Code:

Configuration and Google Drive Integration

Lets mount Google Drive for accessing stored data and configuration, such as the settings.ini file, which contains crucial configurations for the knowledge base.

# Mount Google Drive
try:
from google.colab import drive
drive.mount('/content/gdrive')
except Exception as e:
print(f"Failed to mount Google Drive. Reason: {e}")

Dependency Installation

To leverage Langchain and OpenAI effectively, the notebook installs necessary Python packages, ensuring all tools required for processing and vectorization are available.

Note: This is now the required depenendcies for LangChain v0.1.0. Please check any of your previous code, because many capabilities will be depreciated, so you need to move over to version 0.1.0. More details here.

!pip install -q -U langchain
!pip install -q -U langchain-community
!pip install -q -U langchain_openai

OpenAI Model Configuration

The code snippet below sets up the OpenAI model to be used, selecting the latest version for optimal performance and capabilities.

Note: This is the very latest model released. As of 27th January 2023. Please check for the latest version is you are looking at this notebook later in the year.

MODEL = "gpt-3.5-turbo-0125"  # Latest model

Pinecone Serverless Initialization

Ensure that the Pinecone API key is set as the environment variable PINECONE_API_KEY. This is missing from the Pincone Serverless documentation.

os.environ["PINECONE_API_KEY"] = api_key

Here, the notebook initialises Pinecone Serverless, specifying the vector database type as ‘PineconeServerless’, and sets up the environment and API key for Pinecone.

vs = 'PineconeServerless'  # Vector database type

if vs == 'PineconeServerless':
!pip install -q -U pinecone-client
from pinecone import Pinecone, ServerlessSpec
api_key = userdata.get('PINECONE_API_KEY')
os.environ["PINECONE_API_KEY"] = api_key # Fetch key from Colab keystore
# Note, this is missing from the Pinecone Serverless documentation.

# Initialize Pinecone
pc = Pinecone(api_key=api_key)
spec = ServerlessSpec(cloud='AWS', region='us-west-2')
index_name = 'wardleykb' # Make sure you use your index name

Data Preparation and Vectorization

The notebook uses Langchain’s text splitter and OpenAI embeddings to process and chunk text data, preparing it for vectorization and subsequent storage in Pinecone Serverless.

from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separator="\n")
embeddings = OpenAIEmbeddings()

Storing Data in Pinecone Serverless

The final and most crucial step involves splitting the text into chunks, creating metadata for each chunk, and upserting these embeddings into Pinecone Serverless. This process transforms the textual data into searchable vectors within Pinecone’s serverless infrastructure, enabling fast and efficient retrieval of knowledge.

The next step is very important. We now flip back to use the LangChain Pinecone code.

from langchain.vectorstores import Pinecone

Here’s a snapshot of the code require to upsert the data into Pinecone Serverless. It very similar to standard Pinecone pods.

# Setup for processing and vectorizing text
docs = []
metadatas = []
embedding_data = []
unique_video_ids = []
transcriptions = []
counter = 0
texts = []
start_times = []
# Processing and upserting text chunks into Pinecone Serverless
for video_id in unique_video_ids:
# Load and process video transcript
# [... transcript processing code ...]
# Vectorize text and create metadata
for i, d in enumerate(texts):
splits = text_splitter.split_text(d)
docs.extend(splits)
metadatas.extend([{...}]) # Metadata for each text chunk
# Upsert data into Pinecone Serverless
if vs == 'PineconeServerless':
try:
print("Saving data to the serverless vectorstore")
vector_store = Pinecone.from_texts(docs, embeddings, metadatas=metadatas, index_name=index_name)
print("Vectorstore save complete")
except:
print("Error upserting data into the vectorstore")

Looking Ahead: The Strategic Advantage of Pinecone Serverless

The advent of Pinecone Serverless represents a significant milestone in the evolution of strategic planning tools and methodologies. By combining the dynamic, visual approach of Wardley Mapping with the scalable, efficient capabilities of Pinecone Serverless, businesses are poised to achieve a level of strategic agility and insight previously unattainable.

As we move forward, the synergy between Wardley Mapping and Pinecone Serverless will undoubtedly continue to evolve, offering businesses new ways to navigate the complexities of the modern strategic landscape. The promise of up to 50x lower costs, combined with the ease of use and exceptional performance of Pinecone Serverless, sets a new standard for strategic planning and analysis. In this new era, businesses equipped with these advanced tools will not just respond to the market but actively shape it, leveraging real-time insights and strategic foresight to carve out a competitive edge in an ever-changing world.

--

--

Mark Craddock
Prompt Engineering

Techie. Built VH1, G-Cloud, Unified Patent Court, UN Global Platform. Saved UK Economy £12Bn. Now building AI stuff #datascout #promptengineer #MLOps #DataOps