Building an AI Body of Knowledge for MLOps using ChatGPT

Collect, clean and use ChatGPT to build an AI knowledge base for MLOps

Mark Craddock
Prompt Engineering

--

The Code

The provided code offers a comprehensive solution to creating an AI-powered knowledge base focused on Machine Learning Operations (MLOps) by leveraging YouTube content. The code has been split into four parts so that we only use a GPU when we needed.

Here’s a detailed breakdown.

Part 1: Data Collection

We collect the videos on MLOps from both of the LLMs in Production Conferences, using the YouTube playlists.

We collect and download all the videos and store them on a Google drive for processing later.

The code is in a Jupyter Notebook developed in Google Colab. The link to the original file is also provided.

1. Google Drive Mounting:

from google.colab import drive
drive.mount('/content/gdrive')

This section mounts Google Drive to the Colab environment, allowing data to be saved directly to Google Drive.

2. Directory Set Up:

KB_FOLDER = "/content/gdrive/Shareddrives/AI/MLOpsKB"  
...
if not os.path.exists(TRANSCRIPTS_WHISPER_FOLDER):
os.makedirs(TRANSCRIPTS_WHISPER_FOLDER)

Here, various paths for storing data such as YouTube audio, transcripts, and more are set up. If these directories don’t exist, they are created.

3. Scraping YouTube Playlists:

!pip install -q scrapetube
import scrapetube
...
print(unique_video_ids)

The scrapetube library is used to extract video IDs from two YouTube playlists. These IDs are then combined and deduplicated to form a list of unique video IDs.

4. Storing Video IDs Locally:

with open(f'{YT_AUDIO_FOLDER}/videos.txt', 'a') as f:
f.write(f'{video_id}\n')

All the unique video IDs are stored in a local file for further processing.

5. Downloading Audio from YouTube:

!pip install -q yt-dlp
import yt_dlp as yt
...
print(counter, "of", total_videos, ": Existing file: " + path)

This section uses the yt-dlp library to download the audio of each video from YouTube. If the audio for a video already exists, it skips downloading.

Part 2: Speech to Text

This segment specifically deals with transcribing the previously collected audio files into text. Let’s go through the details:

We now use WhisperJAX to convert the audio files into text. WhisperJAX enables faster processing of the conversion.

1. Initial Set Up and Runtime Checks:

# -*- coding: utf-8 -*-
...
print('You are using a high-RAM runtime!')

Lets check if the runtime environment utilises a GPU and has high RAM.

2. Google Drive Mounting:

from google.colab import drive
drive.mount('/content/gdrive')

This section mounts Google Drive to the Colab environment, enabling direct saving and retrieval of data from Google Drive.

3. Directory Verification and Creation:

KB_FOLDER = "/content/gdrive/Shareddrives/AI/MLOpsKB"
...
if not os.path.exists(BOOK_FOLDER):
os.makedirs(BOOK_FOLDER)

The script sets paths for various directories where the knowledge base and associated data will be stored. If these directories don’t already exist, they are created.

4. Transcription Setup:

import jax
jax.devices()
...
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16)

Here, the required libraries are imported, and the Whisper pipeline is set up. Whisper is OpenAI’s automatic speech recognition (ASR) system that will be used to transcribe the audio files. We also use JAX to speed up the ASR pipeline. Jax is a Python library designed for high-performance ML research. Jax is nothing more than a numerical computing library, just like Numpy, but with some key improvements. It was developed by Google and used internally both by Google and Deepmind teams.

5. Loading Video IDs:

unique_video_ids = []
...
unique_video_ids.append(curr_place)

Video IDs stored in the previous part are loaded into the unique_video_ids list for processing.

6. Transcribing Audio Files:

import re, json, os
def transcribe_file(filename):
...
return transcription

transcriptions = []
...
with open(transcript_filename, 'w') as f:
f.write(json.dumps(transcription))

Each audio file corresponding to a video ID is transcribed using the Whisper pipeline. The transcriptions are saved in a JSON format on Google Drive for easy retrieval and further processing.

Upsert Data to Vectorstore

This specific segment is focused on setting up the vector database and upserting the data, which is crucial for efficiently querying and retrieving information. Let’s break down the code:

1. Runtime Checks:

try:
gpu_info = !nvidia-smi
...
print('You are using a high-RAM runtime!')

The script first checks the runtime environment to determine if it’s connected to a GPU and how much RAM is available. This is important since some operations, especially vector computations, benefit greatly from GPU acceleration.

2. Directory Configuration:

from google.colab import drive
drive.mount('/content/gdrive')
...
if not os.path.exists(TRANSCRIPTS_WHISPER_FOLDER):
os.makedirs(TRANSCRIPTS_WHISPER_FOLDER)

Here, Google Drive is mounted to access saved data, and paths for various directories related to the knowledge base are established. If certain directories don’t exist, they’re created.

3. Dependency Installation:

!pip install -q langchain
!pip install -q openai
!pip install -q tiktoken

Essential Python packages are installed, including langchain (likely a library for language processing chains), openai, and tiktoken.

4. OpenAI Setup:

os.environ["OPENAI_API_KEY"] = "" # Add your OpenAI API key here
MODEL = "gpt-3.5-turbo-16k-0613"

An API key is set for OpenAI, and a specific ChatGPT model is selected for operations.

5. Vector Database Initialisation:

vectorstore = 'FAISS'
...
else:
!pip install -q faiss-cpu
from langchain.vectorstores import FAISS

The code supports two vector database options: Pinecone and FAISS. Here, the FAISS vector database is chosen and set up.

6. Text Chunking and Vector Embeddings:

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separator="\n")
...
if os.path.exists(f"{YT_DATASTORE}/index.faiss"):
...
else:
vector_store.save_local(f"{YT_DATASTORE}")

This segment is crucial. The code first splits the text into manageable chunks. Then, for each chunk, vector embeddings are created. These embeddings are then either merged into an existing vector database or saved as a new one. The vector embeddings essentially convert text data into a format that can be efficiently queried and compared.

7. Storing Chunks for Later Processing:

import json
...
with open(f'{TRANSCRIPTS_TEXT_FOLDER}/' + video_id + '_large.txt', 'w') as output_file:
output_file.write(text)

This section reads video IDs from a file, loads their transcripts, and then saves each transcript’s text content to a designated folder for later use.

Query the data using ChatGPT

This segment focuses on querying the vector database, specifically aiming to retrieve relevant information using ChatGPT. Let’s dive into the details:

We can now start to query the MLOps data using ChatGPT

1. Initial Setup:

from google.colab import drive
drive.mount('/content/gdrive')

This section prepares the environment, generated from a Google Colab notebook, and mounts Google Drive to access the stored data.

2. Directory Verification and Creation:

KB_FOLDER = "/content/gdrive/MyDrive/MLOpsKB"
...
if not os.path.exists(TRANSCRIPTS_WHISPER_FOLDER):
os.makedirs(TRANSCRIPTS_WHISPER_FOLDER)

The script establishes paths for various directories where the knowledge base and relevant data reside. Directories are created if they don’t already exist.

3. Load Dependencies:

!pip install -q langchain
...
!pip install -q cached_property

Here, the necessary Python packages are installed, which are crucial for the subsequent operations, including langchain, openai, tiktoken, and others.

4. Vector Database Setup:

vectorstore = 'FAISS'
...
else:
print(f"Missing files. Upload index.faiss and index.pkl files to data_store directory first")

The code supports two vector database options: Pinecone and FAISS. In this instance, FAISS is chosen. The code checks if the necessary datastore exists; if not, it prompts the user to upload the required files.

5. OpenAI Setup and Prompt Configuration:

os.environ["OPENAI_API_KEY"] = "" # Add your OpenAI API key here
...
prompt = ChatPromptTemplate.from_messages(messages)

The OpenAI API key is set, and a specific ChatGPT model (in this case, gpt-3.5-turbo-16k-0613) is selected. A prompt structure is also defined to guide the AI's responses.

6. Initialising the Retrieval Chain:

llm = ChatOpenAI(model_name=MODEL, temperature=0)
...
chain = RetrievalQAWithSourcesChain.from_chain_type(...)

This section initialises the retrieval process by setting up a chain that links the querying process with the vector store and the ChatGPT model. The objective is to enable a user to query the knowledge base and get responses supplemented with sources.

7. Querying the Knowledge Base:

query = "How can I learn about MLOps?"
...
print(f"Content: {document.page_content}")

The chain is utilised to query the knowledge base. The results are printed, supplemented with sources (in this case, YouTube video links and associated metadata). Multiple example queries are given, including asking about learning MLOps, key components of MLOps, and leaders in the MLOps community.

Building a UI for the data

In this blog post, we will explore an AI-driven chat interface developed using Streamlit and OpenAI. This interface provides users with insights and recommendations based on MLOps videos. Let’s delve into the specifics:

1. Importing Necessary Modules:

import os
import re
import openai
import streamlit as st
from langchain.chat_models import ChatOpenAI
...
from streamlit_player import st_player

The core libraries and modules necessary for the application are imported.

2. Configuration:

openai.api_key = st.secrets["OPENAI_API_KEY"]
os.environ["PROMPTLAYER_API_KEY"] = st.secrets["PROMPTLAYER"]
MODEL = "gpt-3.5-turbo-16k-0613"

API keys are configured, and the desired GPT model is selected.

3. Text Cleaning Functions:

def remove_html_tags(text):
...
def remove_markdown(text):
...
def clean_text(text):
...

A suite of utility functions is defined to strip HTML and markdown formatting, ensuring that the video content is presented cleanly.

4. Streamlit UI Configuration:

st.set_page_config(page_title="Chat with MLOps Conference Videos")
st.title("Chat with MLOps Videos")
...
st.sidebar.divider()

The Streamlit UI is tailored with relevant titles, sidebars, and version details.

5. Data Loading:

DATASTORE = "data_store"
if os.path.exists(DATASTORE):
...
else:
st.write(f"Missing files. Upload index.faiss and index.pkl files to {DATA_STORE_DIR} directory first")

The application checks for the presence of a data store and loads associated data or prompts the user to upload necessary files.

6. Chatbot Setup:

system_template="""
As a chatbot, analyse the provided videos on MLOps...
...
prompt = ChatPromptTemplate.from_messages(prompt_messages)
...
chain = RetrievalQAWithSourcesChain.from_chain_type(
...
)

The chatbot is structured with specific instructions and set up to retrieve relevant video sources based on user queries.

7. Streamlit Chat Interface:

if "messages" not in st.session_state:
...
for message in st.session_state.messages:
...
if query := st.chat_input("What question do you have for the videos?"):
...

The chat interface is developed using Streamlit’s chat functionalities. As users input their queries, the assistant responds with AI-driven insights and even provides links to specific video sources.

Want your own version of the Streamlit app?

Just click ‘fork this app’

--

--

Mark Craddock
Prompt Engineering

Techie. Built VH1, G-Cloud, Unified Patent Court, UN Global Platform. Saved UK Economy £12Bn. Now building AI stuff #datascout #promptengineer #MLOps #DataOps