Coding a custom wordlist generator with Chat GPT

woFF
6 min readJan 5, 2023

--

Robot typing in front of a computer.
Generated with Midjourney

Long story short I wanted to have a custom wordlist to support my hobby bugbounty activities and since I was already playing around with Chat GPT I was curious if it can actually be a helpful companion. This time my goal was to support some custom directory / file enumeration, but you can see how this can be handy for password bruteforcing as well.

I’m more comfortable scripting in python so I decided to go with it and the overall goal was to have a script that:

  • takes a parameter to search for on google
  • retrieves the first N result’s URLs
  • visits the URLs and extracts the text from it
  • pushes out the result to stdout

Nothing fancy, but I was lazy to use CeWL (which is pretty cool and way more advanced than the end result here…).

Setting up the OpenAI playground

I also tried out CodeGPT in VSCode (as suggested by The PyCoach in https://medium.com/@frank-andrade/codegpt-the-vscode-extension-with-chatgpt-like-functionalities-783323a916c3) but to be honest at some point I started to get the responses of other user’s queries, so I immediately uninstalled it. (It was specific to the VSCode extension and it might have been a temporary glitch, obviously I never would share any input containing secrets, … maybe I will dig into it later)

So I ended up simply using the Openai playground with the code-davinci-002 model and increased maximum length to 1024. I won’t share all my attempts here, but definitely plan to highlight the more interesting ones.

The naive approach

Let’s see how will it handle the full scope in this detail level:

'''
Write a python program that:
- takes a parameter to search for on google
- retrieves the first N result's URLs
- visits the URLs and extracts the text from it
- pushes out the result to stdout

'''

import sys
import requests
from bs4 import BeautifulSoup

def get_urls(search_term, n):
'''
Returns the first n urls from a google search
'''
url = 'https://www.google.com/search?q={}'.format(search_term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
urls = [a.attrs.get('href') for a in soup.select('div.r a')]
return urls[:n]

def get_text(url):
'''
Returns the text from a url
'''
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
return text

def main():
'''
Main function
'''
search_term = sys.argv[1]
n = int(sys.argv[2])
urls = get_urls(search_term, n)
for url in urls:
print(get_text(url))

if __name__ == '__main__':
main()

Ok, so it looks decent. It actually extracted the two larger logical chunks to two separate methods and suggested to use BeutifulSoup for extracting text from an HTML.

Error handling definitely feels to be missing, but the larger issue is: it is not working.

The problem is that if you check the HTML output for a google search you will notice that in each occasion the classes are changing (I guess to avoid this type of simple scraping).

Let’s ask Chat GPT how to search Google programmatically

To be honest I like to use the simple https://chat.openai.com/chat as a rubber duck I’ve fallen back to it.

Me: I’m writing a python3 code to get the top 20 results’ URLs from google. How would you do it?

RubberDuck: There are several ways to do this, but one option is to use the Google Search API. Here’s how you can get started:

1. Go to the Google Cloud Console.

2. Click the project drop-down and create a new project.

3. Click the hamburger menu and select APIs & Services > Credentials.

4. On the Credentials page, click Create credentials > API key. The API key created dialog displays your newly created API key.

5. Click Close.

The new API key is listed on the Credentials page under API keys.

To restrict the API key for use only with the Custom Search API, click the Edit button for the API key, navigate to the Key restriction section, and set Application restrictions to HTTP referrers.

Add the referrers. This can be a specific web site or an IP address (like 192.168.1.1).

Click Save.

With the API key, you can then use the Custom Search API to search Google and retrieve the top 20 results. Here’s a sample code that demonstrates how to do this:

import requests

def search(query):
query = query.replace(' ', '+')
api_key = 'YOUR_API_KEY'
cx = 'YOUR_CSE_ID'
url = f'https://www.googleapis.com/customsearch/v1?q={query}&key={api_key}&cx={cx}&num=20'
response = requests.get(url)
if response.status_code == 200:
data = response.json()
return [item['link'] for item in data['items']]
else:
return None

results = search('your search query')
if results:
for result in results:
print(result)
else:
print('No results')

You will need to replace YOUR_API_KEY and YOUR_CSE_ID with your own API key and CSE ID, respectively. You can obtain your CSE ID by creating a Custom Search Engine at https://cse.google.com/cse/.

I hope this helps! Let me know if you have any questions.

Polite response that’s for sure. I followed the instructions (except that when reading about the Custom Search API, I ended up creating the project & API key here) and things seemed to be working. So let’s get back to the Playground and try to get the code in a final state.

Getting a working PoC code

So I’m skipping a few extra iterations here as I had to give Chat GPT a few extra hints:

  • uses the Google Custom Search JSON API: see above
  • uses googleapiclient: I didn’t like the requests based approach… I was pretty sure there is a specific library to interact with the APIs from google… Asked RubberDuck and after getting requests recommended again I pushed for something else and got to Google API Client Library for Python
  • takes care of pagination when reading the Google Custom Search JSON API results: I noticed the results were always pretty shitty and didn’t really handle pagination, so I made this an explicit requirement
  • retrieves the first 100 result’s URLs: again pagination… the API will give back the first 10 results and without making it explicit that the number of results we are interested in can be more I ended up getting subpar results
'''
Write a python program that:
- takes a parameter to search for on google
- uses the Google Custom Search JSON API
- uses googleapiclient
- takes care of pagination when reading the Google Custom Search JSON API results
- retrieves the first 100 result's URLs
- visits the URLs and extracts the text from it
- pushes out the result to stdout

'''

import sys
import json
import requests
from bs4 import BeautifulSoup
from googleapiclient.discovery import build

def get_search_results(query, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=query, cx=cse_id, **kwargs).execute()
return res['items']

def get_text_from_url(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup.get_text()

def main():
query = sys.argv[1]
number_of_urls = sys.argv[2]
api_key = sys.argv[3]
cse_id = sys.argv[4]
results = []
for i in range(0, number_of_urls, 10):
results.extend(get_search_results(query, api_key, cse_id, start=i))
for result in results:
print(get_text_from_url(result['link']))

if __name__ == '__main__':
main()

And it worked (well I fine tuned the for loop to get the required number of results). Only at this point I realized that I only asked it to retrieve the text… So after a quick consultation I ended up fixing a few things:

  • getting the words based on a regexp
  • and getting only that to the output
  • and added an ugly exception handling when doing the request
import sys
import re
import requests
from bs4 import BeautifulSoup
from googleapiclient.discovery import build

valid_words = re.compile('^([A-Z]{1}[a-z]{1,}|[a-z]{2,})$')

def get_search_results(query, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=query, cx=cse_id, **kwargs).execute()
return res['items']

def get_text_from_url(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
unique_text = set(soup.get_text().split(' '))
unique_words = { s.lower() for s in unique_text if valid_words.match(s) }
except:
print(f"Error when reading {url}")
return unique_words

def main():
query = sys.argv[1]
number_of_urls = sys.argv[2]
api_key = sys.argv[3]
cse_id = sys.argv[4]
search_results = []
word_results = set()
for i in range(0, number_of_urls, 10):
search_results.extend(get_search_results(query, api_key, cse_id, start=i))
for search_result in search_results:
word_results = word_results.union(get_text_from_url(search_result['link']))
for word in word_results:
print(word)

if __name__ == '__main__':
main()

Not the most beautiful code (for the sake of this post I tried not to override the results too much), but it does the job. I still don’t think Chat GPT will take away our jobs as it definitely gave stupid responses, which I had to evaluate, but I think as a rubberduck it does a decent job. Would I’ve been quicker using google and simply the API documentation? Maybe… but it was fun :)

--

--

woFF

Blue teamer, IT security engineer, ex-pentester, work-life balancing wannabe stoic parent, human being aiming for happiness in life.