Building an Autonomous Twitter Account with LLMs

Max Brodeur-Urbas
Better Programming
Published in
8 min readMay 19, 2023

--

I created my own Twitter bot using Hacker News posts, the GPT-4 API, and scheduled CRON jobs. Check out its tweets here.

Setting up your own only takes a few minutes. Here’s how I did it and what I learned.

How It Works

At the top of every hour, my bot browses the front pages of Hacker News and picks the best post based on my interests. It then reads the article and comes up with its own opinion, sharing it straight to Twitter alongside the link.

The bot is an LLM-powered pipeline performing the above task using independent modular components (operators). Scheduled CRON jobs are used to trigger it so it can run as often as you’d like. Here’s a rundown of each operator.

Technical Details

Note: Skip this section if you’re not interested in the implementation. I explain how to create your own no-code Twitter bot later on in the article.

Scraping Hacker News

HN felt like a great place to start sourcing articles because the front pages are curated by a group of harsh but very thoughtful critics.

(Edit: I should have used the HackerNews API. Pure web scraping works fine in this case but the API would have been easier and more dynamic)

I used the Beautiful Soup python library to convert posts from the first n pages into a dictionary of titles and links.

def scrape_hacker_news(self, params, ai_context):
keywords = params.get('keywords', [])
num_pages = int(params.get('num_pages', 1)) # Convert to int here
excluded_words = ['AskHN', 'ShowHN', 'LaunchHN']

if num_pages > 5:
ai_context.add_to_log(f"Maximum limit of 5 pages exceeded. Please provide a number up to 5.")
return

title_link_dict = {}

for page_num in range(1, num_pages + 1):
response = requests.get(f'https://news.ycombinator.com/?p={page_num}')
bs = BeautifulSoup(response.text, "html.parser")
posts = bs.select('tr.athing') # select each post

for post in posts:
title_element = post.select_one('.titleline > a') # select the title element within the post
if title_element:
title = title_element.text
link = title_element['href']

# Check if any keyword is in the title (if keywords are provided)
if keywords and not any(keyword.lower() in title.lower() for keyword in keywords):
continue

# Check if none of the excluded words are in the title
if any(excluded_word in title for excluded_word in excluded_words):
continue

title_link_dict[title] = link

I filtered out posts containing “AskHN”, “ShowHN ”or “LaunchHN ”flares to ignore any discussion threads.

Choosing the best post

Here I wanted to emulate how I actually scroll through posts. It’s pretty shallow but if the title is enough to grab my interest then it should be enough for my bot. I concatenate all the titles into a single string and ask the LLM to pick the most relevant based on my prompt.

An example being, “Most interesting article for AI and tech enthusiasts”. You can tailor your prompt for any niche you want to target.

def find_best_post(self, params, ai_context):
query = params.get('query', '')
posts = ai_context.get_input('title_link_dict', self)
print("posts: ", posts)
title_link_dict = json.loads(posts)

used_links = ai_context.memory_get_list('tweeted_links')

print(f"used_links = {used_links}\ntitle_link_dict = {title_link_dict}")
title_link_dict = {title: url for title, url in title_link_dict.items() if url not in used_links}

print(title_link_dict)
# Converting titles into a context string
context_string = ', '.join(title_link_dict.keys())

ai_context.add_to_log(f"Analyzing {len(title_link_dict)} potential post(s).", self)

# Final prompt string
message = f"From the following post titles: {context_string}, pick the post that most closely reflects this desire: {query}? Return the title of the post selected and nothing else."

# Here you can pass the prompt to your function
msgs = [{"role": "user", "content": message}]
best_post_title = ai_context.run_chat_completion(msgs=msgs)

# Check if the title exists in the dictionary
if best_post_title not in title_link_dict and best_post_title.endswith('.'):
best_post_title = best_post_title[:-1] # Remove the last character (the period)

best_post_link = title_link_dict.get(best_post_title, '')

ai_context.add_to_log(f"The most relevant post to your query is titled: {best_post_title}. With Link: {best_post_link}", self)

ai_context.set_output('best_post_link', best_post_link, self)

Read the article

Now that I’ve identified the article I’d want to read, I once again use Beautiful Soup to scrape all content from the page/article.

def ingest(self, params, ai_context):
data_uri = params.get('data_uri', None)
if not data_uri:
data_uri = ai_context.get_input('input_url', self)
ai_context.storage['ingested_url'] = data_uri
if self.is_url(data_uri):
text = self.scrape_text(data_uri)
ai_context.set_output('uri_content', text, self)
ai_context.add_to_log(f"Content from {data_uri} has been scraped.")

def scrape_text(self, url):
response = requests.get(url)
bs = BeautifulSoup(response.text, "html.parser")

for script in bs(["script", "style"]):
script.extract()

text = bs.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = "\n".join(chunk for chunk in chunks if chunk)

return text

Index the data

I wanted to make sure that articles of any length can be ingested. If I try to pass an entire 20-page article into my prompt I’ll overshoot the token limit.

I chunked the content into more manageable pieces and used OpenAI’s embedding API. I stored this info in a dictionary with embeddings as keys and chunked text as values. I can then retrieve the most relevant chunk of context for a given prompt if I embed my prompt and calculate the distance between each chunk.

(This embedded data should ideally be stored in a Vector Database for scalability and speed. For the sake of the demo I stored it in memory)

def get_embedding(self, text_or_tokens):
EMBEDDING_MODEL = 'text-embedding-ada-002'
return openai.Embedding.create(input=text_or_tokens, model=EMBEDDING_MODEL)["data"][0]["embedding"]

def clean_text(self, text):
return text.replace("\n", " ")

def batched(self, iterable, n):
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while (batch := tuple(islice(it, n))):
yield batch

def chunked_tokens(self, text, encoding_name, chunk_length):
encoding = tiktoken.get_encoding(encoding_name)
tokens = encoding.encode(text)
chunks_iterator = self.batched(tokens, chunk_length)
for chunk in chunks_iterator:
decoded_chunk = encoding.decode(chunk) # Decode the chunk
yield decoded_chunk

def len_safe_get_embedding(self, text, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING):
chunk_embeddings = {}
for chunk in self.chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):
embedding = self.get_embedding(chunk)
embedding_key = tuple(embedding) # Convert numpy array to tuple
chunk_embeddings[embedding_key] = chunk

return chunk_embeddings

Hybrid search

This is where the Tweet is made. I pass in a prompt asking the agent to form a concise and interesting tweet about the article's key concepts. This operator stuffs that LLM query with the most relevant chunks using the above embedding approach.

This approach works great for very pointed questions. If you want to ask your agent about a specific quote or detail in the article, context stuffing will fetch the most relevant chunks and pipe them into your query.

For this case, a summary of chunks would actually be more useful since I’m not asking a very specific question. We added a summarization operator after I deployed the demo which I’ll use in the future.


def get_embedding(self, text_or_tokens):
EMBEDDING_MODEL = 'text-embedding-ada-002'
return openai.Embedding.create(input=text_or_tokens, model=EMBEDDING_MODEL)["data"][0]["embedding"]

def cosine_distance(self, emb1, emb2):
return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

def calculate_sorted_similarities(self, query_embedding, embeddings_dict):
sorted_similarities = []

for embedding_key, text in embeddings_dict.items():
embedding = np.array(embedding_key) # Convert tuple to numpy array
similarity = self.cosine_distance(query_embedding, embedding)
heapq.heappush(sorted_similarities, (-similarity, embedding_key, text))

return sorted_similarities

def num_tokens_from_string(self, string: str, model_name: str = 'gpt-3.5-turbo') -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.encoding_for_model(model_name)
num_tokens = len(encoding.encode(string))
return num_tokens

def get_max_embeddings_fit(self, sorted_similarities, initial_tokens, ai_context, model='gpt-3.5-turbo'):
# Since OpenAI is applying a limit to combined tokens in the input prompt and the response
# we want to make sure we leave enough space for the output by not overinflating the input prompt.
MIN_RESPONSE_SIZE = 250

selected_embeddings = []
total_tokens = initial_tokens + MIN_RESPONSE_SIZE
token_limit = get_max_tokens_for_model(model)

for similarity, embedding_key, text in sorted_similarities:
text_tokens = self.num_tokens_from_string(text, model)

if total_tokens + text_tokens < token_limit:
total_tokens += text_tokens
selected_embeddings.append(text)
else:
break

ai_context.add_to_log("{} embeddings were fit into the prompt".format(len(selected_embeddings)))
return selected_embeddings

# Generate a heap of the most similar embeddings, we're unsure how many we can fit into the prompt at this point.
sorted_embeddings = self.calculate_sorted_similarities(query_embedding, ai_context.get_input('vector_index', self))
# Format the message and determine how many tokens our contextless prompt is
message = f"Given the following context, and any prior knowledge you have, {query}?"
message_tokens = self.num_tokens_from_string(message)
# Fit as many embeddings as we can into the prompt
closest_embeddings = self.get_max_embeddings_fit(
sorted_similarities=sorted_embeddings,
initial_tokens=message_tokens,
ai_context=ai_context
)

# Add context to message
closest_embeddings_str = str(closest_embeddings)
message += f" context: {closest_embeddings_str}"

msgs = [{"role": "user", "content": message}]
ai_response = ai_context.run_chat_completion(msgs=msgs)
ai_context.set_output('hybrid_search_chatgpt_response', ai_response, self)
ai_context.add_to_log(f'Response from ChatGPT + Hybrid Search: {ai_response}')

Tweet

This operator interfaces with the Twitter API using the OAuth access you provided when linking your account.

def send_tweet(self, tweet_text, url, ai_context):
client = tweepy.Client(bearer_token=self.bearer_token)
client = tweepy.Client(
consumer_key=self.consumer_key, consumer_secret=self.consumer_secret,
access_token=self.access_token, access_token_secret=self.access_token_secret
)
formatted_tweet_text = f"{tweet_text}\n{url}"

# If tweet is too long, remove hashtags
if self.is_over_twitter_limit(formatted_tweet_text):
formatted_tweet_text = self.remove_hashtags(formatted_tweet_text)

ai_context.add_to_log(f'Tweeting: {formatted_tweet_text}')
try:
response = client.create_tweet(
text=formatted_tweet_text
)
ai_context.add_to_log(f"Tweet is live at: https://twitter.com/user/status/{response.data['id']}", color='green')

except tweepy.TweepyException as e:
print(str(e))
ai_context.add_to_log(f"Error sending tweet: {str(e)}")

Scheduling a Run

I use a CRON job to have this pipeline run at the top of every hour.

Learnings

  • Be careful what you wish for. Your instructions are taken as gospel so be careful asking it to be “funny” or “lighthearted”. It will make everything a joke… If it happens to consume an article about a tragedy it will definitely share it with crying laughing emojis and a bad pun.
  • GPT4 is worth every extra penny for creating content. It can use emojis, crack jokes and pretend it has experience with the article's topic while 3.5 writes rather boring summaries. Prompting it requires a bit more subtlety, however. Asking it to be serious will make it boring, asking it to be funny will make it annoying. I’d recommend testing out individual tweets using the simpler “🐦 Single Article Twitter Bot” on AgentHub to get a feeling for your preferences.
  • Anti-bot software throws a bit of a wrench into it’s process. Some websites like NY Times and Bloomberg have bot detection to prevent scraping. The bot can occasionally run into blockers and formulate tweets about the fact that it’s been detected. These are probably my favorite tweets to wake up to after a night of it autonomously tweeting.

Deploy Your Own Bot with No Code

To run this bot on your own Twitter account follow the following steps. This Twitter bot is published on AgentHub under “Hands Free Tweeter”.

  • Visit this link to use our OpenAI API tokens with GPT-4 access.
  • Click on the published agent.
Autonomous Tweet Bot on Published on AgentHub
  • Click “Run Agent” to have it create a single post for you
  • Link your Twitter account if you haven’t yet.
  • Click “Schedule Agent” near the bottom of the page.
  • Enter your OpenAI API token (you can skip this step, I put my own token in with GPT4 access so can try it out for free).

Contributing

I’m sure the average person reading this article can come up with a more useful agent than this Twitter bot. I’d love to see what others are able to do. If you want to get involved you can:

  1. Join our Discord: This is probably the best way to get in touch with us and hear about daily updates.
  2. Create your own agents: Give AgentHub a try. Once you get one working you can hit the publish button to get your creation posted live on the platform.

dm me on Twitter if you have any questions. Thanks for reading.

--

--