Extending WhatsApp capabilities with ChatGPT and Serverless

Published in

CloudEx Cloud Solutions

10 min readMar 29, 2023

WhatsApp is an amazing messaging app, enabling you to communicate with your peers or family in a simple and easy way. WhatsApp has three main interfaces you can interact with — a mobile app, a web application and a desktop app.

While WhatsApp is easy to use, it does lack some functionality that I wish it had, things like the ability to share the group chats with those who joined the group in a later stage or the ability to reduce noise in a noisy group and the list goes on. The following post demonstrates how you can enable more functionality in WhatsApp by using a bot that interacts directly with WhatsApp web. I’m going to go over the main components and major architectural decisions that I made and why.

You can follow the code and try it for yourself by checking aws-hebrew-book/whatsapp-logger: Archiving WhatsApp messages (github.com) github repository. The accompanying repository has full IAC support, so you can deploy it on your AWS account as well.

Foundations

There are two main ways to interact programmatically with WhatsApp — by developing a bot for WhatsApp for Business, or by using a headless browser powered by code that interacts with the groups you are part of.

A WhatsApp for Business bot must be added proactively to each group, and it costs money, while the headless browser solution allows you to interact directly with the WhatsApp Web interface without needing to be added to any specific groups and it is free.

For this project, the headless browser solution was used, and specifically the whatsapp-web.js library, which uses Puppeteer, a JavaScript browser automation library, to interact with the browser and has an extensive API for sending messages, creating groups, and more.

const { Client } = require('whatsapp-web.js');

const client = new Client();

client.on('qr', (qr) => {
    // Generate and scan this code with your phone
    console.log('QR RECEIVED', qr);
});

client.on('ready', () => {
    console.log('Client is ready!');
});

client.on('message', msg => {
    if (msg.body == '!ping') {
        msg.reply('pong');
    }
});

client.initialize();

The term “headless browser” refers to a browser that is used programmatically without a graphical user interface. Essentially, the browser is being run in the background without any visible windows or tabs. This allows the code to interact with the browser without human input.

Whatsapp-web.js is running on top of the real WhatsApp web application, listening to UI changes like new message was written and transforms the UI events to code events in which I can operate on

In terms of architecture, this project uses Amazon ECS + AWS Fargate as the compute solution, rather than AWS Lambda. While Lambda is often a preferred choice for serverless solutions, in this case it was not suitable as ‘hatsApp-web.js listens to events arriving from the browser and requires Puppeteer to always run. As a result, it’s a service that always has to be running, which is not possible with Lambda’s 15-minute limitations. Instead, ECS + Fargate provides a compute solution that is always “on”.

Overall, this approach enables more functionality in WhatsApp and allows for greater customization and control over group chats. With the ability to interact programmatically, there are endless possibilities for extending the capabilities of WhatsApp beyond its existing features.

Creating Messages

To feed the system with relevant events that arrive from whatsapp-web.js, a message delivery mechanism was needed. Moreover, since there were multiple ways in which the incoming messages needed to be processed, a way to distribute them to different destinations based on their intended function was needed. Therefore, I decided to use SNS for this purpose.

Each event created by whatsapp-web.js is sent using SNS. However, it’s considered best practice to use SQS instead of connecting SNS directly to Lambda. This is because SQS allows for improved batching and number of retries, which can help improve the reliability and performance of the system.

What does the app do with the different messages?

Saving content to Google Sheets

To save the message content, a Lambda function is used to write the data to Google Sheets. Specifically, the function uses Google’s API exposed by gspread to create a new sheet in a Google Spreadsheet for each group, and appends the message content to the end of the sheet.

All Lambda functions in the application that are connected to SQS use the AWS Lambda Python Power Tools record handler mechanism. This mechanism encapsulates record handling using a simple Python method without the hassle of extracting the content.

def record_handler(record: SQSRecord, sheet: gspread.spreadsheet.Spreadsheet):
    payload: str = record.body
    logger.debug(f"Retrieved {payload}")

    if payload:
        raw_message = json.loads(payload)["Message"]

def lambda_handler(event: dict, context: LambdaContext):
…
  with processor(
          records=batch,
          handler=lambda record: record_handler(record, google_sheets),
      ):
          processor.process()
…
@functools.cache
def _get_get_google_sheet_object() -> gspread.spreadsheet.Spreadsheet:
…

External resources like boto3 resource initialization, are cached in memory, so re-initializing them in a warm Lambda function won’t cost any extra latency.This Lambda function initializes the Spreadsheet API only once, and the @functools.cache decorator is used to cache functions that need it.

The @functools.cache decorator is a built-in feature of Python that caches the return values of a function for subsequent calls with the same arguments, thereby improving the function’s performance and reducing the need to recompute the same values repeatedly.

Overall, saving content to a Google Sheet allows message content to be stored in an organized and structured way, and enables easy access and analysis of the data. By using Google Sheets and gspread, the data can be easily shared and collaborated on and it allows to share the group chats with those who joined the group in a later stage.

I wanted to include the ability to generate a daily summary of group chats. This would provide a quick overview of what happened in the group the previous day, helping to reduce the amount of time and effort needed to catch up. For this task, there is no tool more suitable than ChatGPT.

Data lake and ChatGPT

To enable more advanced analysis and processing of the data, all chat messages are saved to an S3 bucket. This bucket serves as the source for any questions that may arise about the data, such as summarization, data statistics (e.g. who talks the most, which groups are used the most), and more. Because this data is not real-time and does not have a clear schema or query, an S3 bucket with ad-hoc querying is the most suitable solution.

To generate daily summaries of group chats, the ChatGPT natural language processing tool is used. First, all chats from a specific day are collected and grouped by WhatsApp group. Next, each chat in a group is appended to a long string, which is sent to ChatGPT to generate a summary. Finally, the summary is sent back to WhatsApp-web.js using an event bus.

response = chatgpt.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"The text is a conversation in a forum where people are discussing various topics. Summerize the following text {summary['chats'][:3940]} Divide the summary into the different topics",
            },
        ],
    )

I am using the official OpenAI Python package. To avoid hitting the limit of the gpt-3.5-turbo model, which only allows for a maximum input length of 4,096 tokens, it is necessary to trim long text.

The application utilizes AWS Step Function, with daily workflow triggered by Amazon EventBridge.

To overcome the 256K payload limitation, I’m using the Map state to iterate over all the content that needs to be sent to ChatGPT. Since the payload itself may be larger than the limit, it’s recommended by AWS to use S3 to pass large payloads, not only in Step Functions but in many other AWS services. Hence, the payload in the Step Function is an S3 object path, and the subsequent step loads the object for further processing.

This approach enables more advanced analysis of the data and provides greater insights into the patterns and trends of the group chats. By using ChatGPT, natural language processing is used to generate a summary of each day’s group chats, making it easier to quickly catch up on what was discussed. Overall, this approach enhances the value of the group chat data and enables more advanced analysis and processing.

Up to this point, I have described how whatsapp-web.js feeds events to the system. However, it’s also important to understand how the system sends events back to whatsapp-web.js. To accomplish this, an event bus is used.

Sending events from and to whatsapp-web.js

To notify on different events in the application, the recommended approach is to use EventBridge, a serverless event bus that makes it easy to connect different applications and services. There are different events that the current system produces, such as client connectivity updates and summary events.

An event bus is used to send events back and forth between the system and whatsapp-web.js. On the Lambda side, the event bus is connected through EventBridge targets, which allows events to be sent to and received from the event bus. On the whatsapp-web.js side, SQS connected to the event bus is used.

The reason for using SQS on the whatsapp-web.js side is that it provides a long polling feature, which is not available when reading directly from an event bus using code. Long polling is a technique that allows the client to hold a connection open until new data is available, rather than repeatedly polling the server for new data. This approach can reduce the number of requests and improve response times, especially when dealing with low-volume or infrequent updates.

In contrast, EventBridge pushes events to the Lambda function, making direct connections easier.

After reviewing the different parts of the system, you may be wondering how everything is deployed.

CI, Deployment & Cost

This application is deployed to the cloud using AWS Cloud Development Kit (CDK), a software development framework for defining cloud infrastructure in code. CDK is used to create a directory structure in which each component has its own separate infrastructure-as-code (IAC) code. This approach allows for easier management of resources and simplifies the deployment process. This directory structure was recommended by AWS and you can read more about it in CDK best practices.

To follow best practices, the application’s secrets and configurations are not stored as Lambda environment variables. Instead, AWS SSM and Secret Manager are used for storage, and AWS Lambda Powertools for Python are used to retrieve the values.

As a continuous integration (CI) tool, GitHub Actions is used to automate the build, test, and deployment process. To avoid storing AWS credentials, the GitHub OIDC provider is used in conjunction with a configured AWS IAM Identity Provider endpoint. This approach ensures that access to AWS resources is granted securely and without the need to store access keys or secrets.

The serverless components in use, such as Lambda, DynamoDB, and SQS, are inexpensive and are fully covered under the free tier. The only elements that cost money are a NAT instance and a single Fargate task. A t3.micro NAT instance costs around $7.40 per month, while a Fargate instance with 0.25 vCPU and 1GB of memory costs approximately $10.40 per month. Using an ARM architecture may result in some savings for the Fargate instance. Overall, the total monthly cost is approximately $17.80.

In AWS, a NAT instance is a single EC2 instance that allows instances in private subnets to connect to the internet or other AWS services, while a NAT gateway is a fully managed AWS service that also performs this function, with built-in redundancy and higher performance. NAT gateways are generally more expensive than NAT instances due to their built-in redundancy and higher performance.

Overall, the use of CDK and GitHub Actions enables fast and efficient deployment of the application, while ensuring security and compliance with best practices. By automating the deployment process and using secure access management, the risk of errors or security breaches is reduced, enabling a more reliable and secure deployment process.

Conclusion

By using a headless browser solution and a combination of AWS services, it’s possible to add more functionality to WhatsApp and enable greater customization and control over group chats. From creating messages to saving content to Google Sheets, and generating daily summaries using natural language processing, this blog post has demonstrated how different components can be integrated to create a powerful and flexible solution.

Using event-driven architecture and serverless technologies, the system can handle high volumes of messages and events, while ensuring reliability and scalability. With AWS CDK and GitHub Actions, deployment is automated and secure, reducing the risk of errors or security breaches.

Open source code — https://github.com/aws-hebrew-book/whatsapp-logger