Automated Sentiment Analysis of Opinions on ChatGPT

Analyzing and visualizing data in Google Cloud using Dataflow, BigQuery, and Grafana with Kubernetes to analyze social media opinions on ChatGPT

Christian Coello
Slalom Technology
8 min readApr 11, 2023

--

Photo by Edmond Dantès from Pexels

By Christian Coello, Lenny Juarbe, and Diego Bautista

Today, data analytics enabled by the cloud is increasingly popular in many industries. Companies are being introduced to cloud technologies and face challenges identifying business use cases that justify building modern cloud-enabled solutions. In this article, we’ll walk you through how we built a complete data analysis solution to understand the public’s Twitter perspective on the very popular generative AI technology — ChatGPT. To build our scalable, automated solution we leveraged Google Cloud services including Cloud Storage, Dataflow, Google BigQuery, Google Kubernetes Engine (GKE), and Cloud Build, as well as Grafana and Bitbucket.

Solution overview

The first step is to store our ChatGPT Twitter data in Cloud Storage. This is then picked up by Dataflow. Dataflow processes the data and stores it in BigQuery. A Kubernetes cluster hosts Grafana on multiple nodes, in multiple pods, and a load balancer distributes traffic. A CI/CD pipeline leverages Bitbucket, where all the Terraform code is stored — which is then built and deployed using CI/CD service Cloud Build.

A Kubernetes cluster hosts Grafana on multiple nodes within the cluster. Cloud development is hosted on Bitbucket and deployed via Cloud Build.

Data pipeline details

Dataflow and BigQuery

Dataflow is a fully managed and serverless batch and streaming data ingestion and processing service, based on Apache Beam, that seamlessly integrates with BigQuery. Dataflow allows users to develop data pipelines and apply necessary transformations, or other analytical work using Python. We can leverage the underlying infrastructure to test and run Python code before moving it to a main pipeline.

For our pipeline, we used Dataflow to ingest a CSV file containing Twitter data stored in Cloud Storage. The Dataflow pipeline transforms the data, making it compatible for our predesigned BigQuery table schema. Wrangling Twitter text data required several transformations including:

  • Removing symbols and emojis, and
  • Structural changes to remove columns containing data we didn’t need.

New to Dataflow and Apache Beam? Fear not! Google Cloud offers several predefined Dataflow templates for ingesting data from a source, processing it, and landing it at its destination. Templates include “Cloud Storage text files to BigQuery,” “Pub/Sub Topic to BigQuery,” and many more. You can also create your own custom template.

Dataflow templates available through the Google Cloud console.

We used the “Text Files on Cloud Storage to BigQuery” template, supplied a JavaScript function and a JSON Schema file to select data from Cloud Storage and insert the transformed data into BigQuery as shown below.

function transform(line) {
var values = line.split(',');
var obj = new Object();
obj.id = values[0];
obj.name = values[1];
obj.job = values[2];
var jsonString = JSON.stringify(obj);
return jsonString;
}

We also provided a JSON file that defined our BigQuery table schema. Notice the 1-to-1 match between the parsing JavaScript function (above) and the JSON Schema (below).

{
"BigQuery Schema": [
{
"mode": "NULLABLE",
"name": "id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "job",
"type": "STRING"
}
]
}

Google Cloud Natural Language API

Google’s Natural Language API offers a pretrained machine learning model as a service that can be integrated with any application, providing natural language understanding (NLU) and text analysis capabilities. By leveraging this powerful API we performed sentiment analysis on all Tweets now stored in our BigQuery table.

We used the Dataflow Workbench feature to leverage Python and Jupyter Notebook and the Natural Language API to analyze the sentiment for each tweet in the data set before landing the resulting output to the user_sentiment column in our BigQuery table.

Code samples below illustrate how we called the Natural Language API, query the first 1,000 rows in our BigQuery table, and select the tweet text column in BigQuery where the user_sentiment column is blank or null. The pipeline loops through every tweet, analyzing each row of text through the Natural Language API.

from google.cloud import bigquery
from google.cloud import language_v1
client = bigquery.Client()
nlu = language_v1.LanguageServiceClient()

The Natural Language API provides two outputs. The first is a sentiment score. Sentiment scores range from -1 to 1, where -1 is a very negative sentiment, 0 is neutral, and 1 is very positive. The second result is a magnitude score. The magnitude indicates the strength of emotion and confirms the overall sentiment score. Magnitude can range from zero to infinity. The longer the text, the higher the magnitude score can be.

Example Natural Language API output

In the output above, you can find a few examples of text being scored in various ranges that receive a positive, negative, or neutral sentiment result.

After scoring all the tweets in our data set, we updated the user_sentiment column with each tweet’s corresponding sentiment output. Depending on the combination of the sentiment and magnitude scores, these outputs can range from very negative to very positive — with mid-range negative, neutral, positive in between.

Visualizing the results

Kubernetes and Grafana

Google Kubernetes Engine (GKE) is critical for deploying Grafana and creating custom dashboards to visualize data from BigQuery. Grafana is a powerful tool for creating beautiful and customizable dashboards to display data from a variety of sources.

For our solution, we used a standard-mode three-node cluster consisting of general-purpose E2 series machines with a single pod in each node, and a load balancer to distribute the traffic.

GKE Cluster Design

We leveraged Helm, a package manager for Kubernetes, making it easier to install and manage Grafana, via a YAML file. This included user credential setup, plugins, datasources, and predefined dashboards. Below is an example of a YAML file and configurations.

adminUser: admin
# adminPassword: strongpassword

A Kubernetes cluster can also be exposed by running the command below where “grafana” is the name of the deployment using LoadBalancer as the service type.

kubectl expose deployment grafana --type=LoadBalancer --port=80 --target-port=3000

With Grafana, we created visualizations that looked great and effectively communicated key information. The wealth of options the platform offers for data display took our analysis to the next level. Creating our dashboard involved a few steps.

First, we had to install a BigQuery plugin to establish the connection for the dashboard to BigQuery.

Grafana’s BigQuery Connector

After installing the plugin, we configured it to securely connect to our Google Cloud project using a service account and key. When creating your service account, remember that it will require the BigQuery Job User role and permissions. After the service account is created, generate a key and download the JSON file. Upload the JSON file to the BigQuery plugin configuration in Grafana as a Google JWT File option as shown in the image below.

Provide Grafana with the service account and key to access data in BigQuery.

With Grafana configured, we created a few visualizations using data in BigQuery. To get started with a graph implementation, do the following:

  1. Click Add New Panel.
  2. On the right side under Visualizations select Table from the drop-down.
  3. Select the BigQuery DataSource previously created.
  4. In Query Builder select Code.
  5. Add your query.
SELECT distinct user_sentiment as Opinion, round(safe_divide(count(*), (select count(*) from `project_id.dataset.table`  where user_sentiment <> "")) * 100,2) as Percentage
FROM `project_id.dataset.table` where user_sentiment <> ""
AND user_sentiment="Super Positive"
group by user_sentiment

6. Run Query to view results in the table.

7. To complete the pie chart with data points from all other sentiments (in the sample code above we queried only for “Super Positive”), click + Query and use the same query from step 5, adjusting the sentiment type in double quotes.

Click here to use the same query for other sentiments.

Alternatively, with bar charts you can add a new panel and use a query as shown below.

SELECT count(*) as Negatives
FROM `project_id.dataset.table`
where user_sentiment='Negative'

Grafana is dynamic and enables us to create visualizations that query in real time. Our data set shows that tweets regarding ChatGPT are very mixed and fairly evenly distributed across all sentiment scores and magnitudes.

Pie chart in Grafana illustrating ChatGPT tweet sentiment analysis
Grafana bar chart for a small sample focusing on US States, illustrating the number of negative tweets, by location
Bar chart queried a small sample focusing on US States for neutral sentiments, by location

Conclusion

We wanted to build a data pipeline solution that would be data-driven and result in user-friendly and interesting insights about public opinion of ChatGPT, the generative AI that has captured the world’s attention. We leveraged the power and scale of the cloud with Google Cloud services including Dataflow (to clean and prepare data) and BigQuery. BigQuery allowed us to seamlessly query data from over 100K rows in milliseconds, enabling us to rapidly gain insights into the opinions of Twitter users worldwide and illustrate the power the cloud can bring to data analytics challenges.

Source of data used in BigQuery: ChatGPT — the tweets.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--

Christian Coello
Slalom Technology

Principal, Cloud Enablement at Slalom Florida specializing in Google Cloud and AWS