Create personalized ChatGPT question-answer with Reddit WallStreetBets

4 min readApr 7, 2023

Want to create your own specialized ChatGPT question-answer with personalized answers to your specific needs? Here’s a quick how-to.

Similarly, to this post, we’ll need Azure OpenAI services access that can be requested here. The main difference with the customized search with ChatGPT post is that we are not going to use Azure Search services to search over our data. We will use embeddings model and calculate cosine similarity to rank search results. Just to give quick high-level overview of what embedding are — embeddings refer to learning the vectorization through deep learning and is represented in low-dimensional dense matrix for fast and efficient computations. This way we can quickly run our cosine similarity function to determine what paragraph/chunk of text data is semantically closest to the user query. We will use OpenAI “text-embedding-ada-002” embeddings model since it’s already readily available for us from Azure OpenAI services and we’ll use the latest version in order to take advantage of the latest weights/updates.

I find Reddit fascinating platform; I think it’s even a hack nowadays to add reddit to your search query in Google search to get better search results. However, with Bing chat being readily available all the time in the same window on Edge browser makes it a lot easier to search the internet without having to leave the current tab and not dealing with myriads of ads. I even use it right now while writing this article but that certainly another topic for discussion.

Anyhow, since all of the AMC/GME excitement few years back on WallStreetBets subreddit, I had an idea of scraping this thread for analysis, similarly to my other Reddit project. But it is always such a tedious process to analyze textual data, especially coming from public forums with a lot of noise. So taking advantage of natural language understanding of ChatGPT is a no brainer in this situation. Plus, there’s no need to store scraped and processed data in databases and incur additional costs. Imagine we can scrape it once a day or however many times a day, store data in memory to answer our questions and then dispose of it. Pretty neat, right?

Mine WallStreetBets subreddit

Firstly, we’ll need to scrape the data like so. Disclaimer: I re-used my old code from the other project I mentioned, there might be more efficient way.

After we get the data, in my case around 480 posts, we convert it to pandas dataframe and concatenate title and selftext data as we don’t need to distinguish the two for our purposes. Next step is to count the number of tokens, so that we don’t exceed number of tokens for our calls to OpenAI GPT model.

Get embeddings from OpenAI model

To be clear, I haven’t performed a lot of cleaning of the text field like excluding URLs, whitespaces and whatnot, not because I was being lazy but to further evaluate NLU results ;-)

So what we want to do is deploy two models on our OpenAI Services, one to obtain embeddings for our data as well as run it on the user query — in our case “text-embedding-ada-002” and another model to answer our questions. We will deploy “gpt-35-turbo” but if it’s not available in your region yet, “text-davinci-002” will work just as fine. The difference between the two is that former is designed for chat interface whereas latter is for code completion, so for question-answer system will work just fine.

When we deployed our model, depending on the pricing tier of your services there’s a proper way to get embeddings like at the very bottom or hacky way just above it. I had to make call row by row in order to avoid exceeding error.

Create question context

After all of the text embeddings were obtained, the next step is to define create a context function from our scraped data and the user question. For that, we make a call to et embedding for our question and calculate cosine similarity with the embeddings from our mined Reddit data.

Create question-answer

Now that we have our context function that can be called to answer our query, we can define our question-answer function. Let’s specify which engine should be used to answer the query and give guidelines to the GPT model, so that the response is more concise and to the point. For example if the requested information is not found in the reddit data, the GPT model will respond with “I don’t know” answer. There are other parameters that can be adjusted like temperature to either provide more precise result or more creative, but that’s up to you to determine.

Result

Here you have it, your own personalized question-answer. This can be scheduled to mine the data — any data, it doesn’t have to be WSB, and maybe packaged into a chatbot, but that’s a topic for another day.

Explore more in the openai-cookbook. The ease of deployment makes it much simpler to experiment with so many applications. Hope this was helpful. Thank you for your time!