ChatGPT API Magic: Leveraging Frontend Endpoints for Advanced Data Extraction

Rodolflying
6 min readMar 27, 2023

--

Uncovering the Secrets of ChatGPT’s Conversations by Tapping into Frontend API Endpoints and Efficient Data Scraping Techniques

This is the second article of the series of articles related to scraping chat gpt! check out the summary of the repository GPT_scraper here

If you already know everything about scraping, we will please you, getting inmediatly to the point. Sometimes one GIF says everything without further exaplanation:

Introduction:

In the world of AI and natural language processing, ChatGPT has emerged as a game-changing technology that is continuously transforming the way we communicate with machines. While many users are exploring its potential, the true power of ChatGPT often lies hidden in the depths of its API endpoints. This article will serve as your guide to understanding and harnessing the potential of ChatGPT’s frontend API endpoints, enabling you to delve deeper into its capabilities and extract valuable insights from your conversations.

I will provide step-by-step instructions on how to inspect elements and exploit the frontend API to your advantage. By the end of this article, you will be well-versed in the art of ChatGPT data extraction, empowering you to unlock the full potential of this revolutionary technology. Whether you are an AI enthusiast, a data scientist, or a developer looking to enhance your ChatGPT knowledge, this comprehensive guide is tailored to suit your needs. So, let’s embark on this exciting adventure and uncover the hidden treasures of ChatGPT’s frontend API endpoints!

Setting up headers:

Before diving into the functions, it’s crucial to set up the headers required for the API requests. Follow these steps to obtain the headers:

I. With your chat GPT opened in a Chrome tab, press CTRL+SHIFT+I to inspect the page

II. Go to “Network” and filter by “Fetch/XHR,” then refresh the page (or F5). Click one of the previous conversations.

III. Look for “conversations?offset=0&limit=20” and for something like “e1dbb0b1–2567–48cd-b2c0–0bcda815d7yd” (these are the two backend hidden API headers that we will use).

IV. Secondary click on each and copy as cURL (bash).

V. Follow the instructions for your preferred API testing tool (Postman or Insomnia) to import the cURL commands and obtain the headers.

get headers

VI. Rename the “headers” file to “headers.py” and paste the corresponding headers.

ids_header = {copy here the "conversations?offset=0&limit=20" headers from the code provided by Postman/Insomnia}
conversation_header = {copy here the "e1dbb0b1-2567-48cd-b2c0-0bcda815d7yd" headers from the code provided by Postman/Insomnia}

Imports:

#pretty known python libs
import pandas as pd
import json
import random
import requests
from time import sleep, strftime
#i put the headers in other file, since they are too large
from headers import ids_header, conversation_header

The Script:

The script consists of several functions that work together to scrape and process data from frontend API endpoints. These functions include:

get_response(url, headers, payload): This helper function sends an HTTP GET request using the given URL, headers, and payload, and returns the response. The function is used throughout the script to make API calls.

# Define a helper function to send HTTP GET requests
def get_response(url, headers, payload):
response = requests.request("GET", url, headers=headers, data=payload)
return response

get_ids(): This function retrieves conversation IDs, titles, and creation times by iterating through the API response data. It calculates the total number of iterations needed to fetch all conversations and stores them as a list of dictionaries.

# Define a function to retrieve conversation IDs, titles, and creation times
def get_ids():
# Initialize variables
payload = {}
headers = ids_header
data = {}
ids, create_time, titles = [], [], []

# Iterate through offset and limit to get all conversations
i, offset, total_iterations = 0, 0, 0
while True:
try:
# Build the URL for the API request
url = f"https://chat.openai.com/backend-api/
conversations?offset={str(offset)}&limit=100"
response = get_response(url, headers, payload)
data = json.loads(response.text)
# Loop through the API response data to extract conversation details
for item in data['items']:
ids.append(item['id'])
create_time.append(item['create_time'])
titles.append(item['title'])
# Update the total number of iterations, if necessary
if i == 0:
total_chats = data['total']
total_iterations = total_chats / 100
if total_iterations / 100 % 1 != 0:
total_iterations = int(total_iterations) + 1
offset = offset + 101
i += 1
else:
offset = offset + 100
i += 1
# Break the loop when all conversations have been processed
if i == total_iterations:
break
except Exception as e:
print(str(e))
print('done')
break
# Save data id by id with a comprehension (considering
# created_time list also)
data = {'conversations': [{'id': id, 'title': title,
'create_time': create_time, 'messages': []}
for id, title, create_time in zip(ids, titles, create_time)]}
return data, ids

get_conversations(data, ids): This function extracts conversation details using the conversation IDs fetched by get_ids(). It loops through the IDs, sends a request to the API endpoint for each conversation, and appends the conversation messages to the data dictionary.

# Define a function to retrieve the conversations using the conversation IDs
def get_conversations(data, ids):
# Initialize variables
payload = {}
headers = conversation_header

# Loop through conversation IDs and fetch the corresponding conversation
for i, id in enumerate(ids):
url = f"https://chat.openai.com/backend-api/conversation/{id}"
response = get_response(url, headers, payload)
response_json = json.loads(response.text)

# Loop through the messages in the conversation
for message_id, message_data in response_json["mapping"].items():
if "message" in message_data:
role = message_data["message"]["author"]["role"]

# Check if the role is "user" or "assistant"
if role == "user":
# Append a new message object for the human's question
# to the conversation's list of messages
human_message = {
'sender': 'human',
'text': message_data["message"]["content"]["parts"]
}
data['conversations'][i]['messages'].append(human_message)

elif role == "assistant":
# Append a new message object for the bot's answer
# to the conversation's list of messages
bot_message = {
'sender': 'bot',
'text': message_data["message"]["content"]["parts"]
}
data['conversations'][i]['messages'].append(bot_message)

# Sleep for a random time (2~5 seconds) to avoid getting blocked
# for too many requests
sleep(random.randint(2, 5))

return data

save_json(data, date): This function saves the scraped data in a JSON file. It takes the data dictionary and the current date as input, creates a filename, and writes the JSON file to the 'outputs' folder.

def save_json(data, date):
filename = f"outputs/API scraped conversations {str(date)}.json"
with open(filename, "w") as f:
json.dump(data, f)

save_csv(data, date): This function saves the scraped data in a CSV file. It reads the JSON file created by save_json(), converts it to a DataFrame, and writes the CSV file to the 'outputs' folder

def save_csv(data, date):
json_file = f"outputs/API scraped conversations {str(date)}.json"
df = pd.read_json(json_file)
df = pd.DataFrame(df['conversations'].values.tolist())
filename = f"outputs/API scraped conversations {str(date)}.csv"
df.to_csv(filename, index=False)

main(): This is the main function that ties everything together. It calls the other functions in the proper order, sets the current date and time, and saves the scraped data as JSON and CSV files.

def main():
data, ids = get_ids()
data = get_conversations(data, ids)
date = strftime("%d-%m-%Y %H-%M")
save_json(data, date)
save_csv(data, date)

if __name__ == "__main__":
main()

Running the script:

To run the api_scraper.py script, open a terminal, navigate to the desired folder, or open the project in your code editor, and execute the following command:

python api_scraper.py
  1. Waiting for the script to finish: The script includes a random time limit (2~5 seconds) between each conversation fetch to avoid getting blocked for too many requests. Wait for the program to complete.
  2. Checking the results: After the script finishes, find the scraped data in the ‘outputs’ folder as JSON and CSV files.

The whole process of running and getting the results put all in a GIF:

Conclusion:

In this article, we have provided a comprehensive guide to using a Python script for scraping conversation data from frontend API endpoints. We’ve explained each function in detail, provided the corresponding code snippets, and shared the necessary steps to set up the headers for successful data extraction. With this knowledge, you can confidently extract conversation data for your own projects and purposes.

--

--

Rodolflying

Industrial Engineer. I find inspiration in data science and technology to solve real-life problems. https://www.linkedin.com/in/rodolfo-sepulveda-847532135/