GRAMMY Debates With Watson (Part 1)

From the lab to music’s greatest stage

Published in

IBM Data Science in Practice

11 min readMar 12, 2021

By Aaron Baughman, Tony Johnson, Elad Venezian, Yoav Katz

The journey of adapting Project Debater from IBM Research to the GRAMMYs created a resilient and component-based system that can summarize opinions on music topics into fluent narrations.

The Project Debater workflow combines artificial intelligence and natural language processing (NLP) techniques to summarize music fans’ point of view around the most important music conversations. The first phase of the workflow is collecting arguments made by fans. These arguments are collected from two sources: through a dedicated web page on GRAMMY.com and by mining arguments from Twitter. In the second phase, we run the Speech by Crowd natural language processing pipeline to generate a summary of all the arguments collected, highlighting the key points made by the participants and generating summary narratives for each side of the debate. In this post, we’ll walk through the process for the first phase: collecting arguments.

The music debates

Over a two-week period leading up to the 63rd annual GRAMMY awards show, we are presenting four debatable topics to the fans to provide their own unique insight. The first asks about who you think is or was the most groundbreaking artist of all time. As expected, we are receiving a variety of responses. Next, we wanted to know what music fans and experts thought about mandatory music education. The topic that states, “Music education should be mandatory in all K-12 schools” is resulting in great arguments for and against the notion. Most of the arguments supported the opinion that music education should be mandatory in schools. However, many did not agree because they thought only the most well-funded schools can afford music education, thus increasing the socioeconomic divide.

Another topic that we wanted to know opinions on was comparing virtual concerts to live shows. This was the most evenly split topic between support and disagreement. Perhaps people are becoming acclimated to technology and enjoy watching shows in their own comfort. The crowd thought that the demand for a hybrid of virtual and physical shows will continue to emerge.

Finally, we wanted you to weigh in on who you think is the biggest style icon in music. This is your chance to tell us how music and fashion are related.

a person holding a digital tablet with the Grammy debates with Watson page — Figure 1. The great music debates

Now let’s get into the architecture of the system.

GRAMMY Debates with Watson system architecture

Figure 2 shows the overall architecture of the GRAMMY Debates with Watson system. The system runs on a hybrid cloud that consists of IBM Cloud and IBM Cloud Private. Throughout the cloud components, the system is packaged into images that can be run anywhere on top of Red Hat OpenShift that manages Kubernetes clusters. Most of the OpenShift clusters run on IBM Cloud. There are two OpenShift clusters with nine workers each. Each worker has four cores and 16 GB of memory to support our natural language processing workloads. A total of 11 apps are spread evenly across the worker nodes over three regions.

The system has several bare metal machines that support the workflow. The Redis lyric infringement detection app runs on two bare metal machines. Each bare metal machine has 36 GB of memory and 20 cores. The raw compute power can keep up with the pace of argument infringement detection.

diagram of the Grammy debates with Watson workflow cloud components with IBM Public Cloud, IBM Private Cloud, and Research Private Cloud components — Figure 2. The overall architecture of GRAMMY Debates with Watson

The consumer-facing applications are deployed on a multitenant private cloud. The debater API interface and messenger Python microservices take all of the consumer traffic. Each is scaled out to two PODs across four regions. There are test, development, and production environments to ensure that the work passes functional and quality testing before deployment to production. In parallel, the IBM Research private cloud runs on dedicated IBM Kubernetes clusters that are fronted by several ingress NGINX nodes. The pro/con, argument, and debater key point analysis services are isolated on separate clusters to support the natural language processing scale. Additional worker nodes can be provisioned as required.

To help with the load, we used a combination of GPUs and CPUs. For all of the offline jobs, the services used GPU-based clusters. This enabled us to handle batch jobs quickly. The online services such as the pro/con and argument quality handle real-time user responses with CPU-based clusters. However, the computationally intense online key point analysis service requires GPUs.

Next, we use the Speech by Crowd Debater platform. The Speech by Crowd platform is deployed on a K8 cluster. Each of the NLP services used by the Speech by Crowd system is dedicated. The databases, PostgreSQL and MongoDB that support the Speech by Crowd platform are located on IBM Cloud.

To manage the data across the entire architecture, we use IBM Cloud Object Storage and IBM Cloudant, both of which use JSON document formats. The IBM Cloud Object Storage is the origin for the IBM Content Delivery Network. To handle message passing, we instantiated IBM Event Streams Kafka. Submitted arguments queue on topics, and are then processed by enrollment threads. IBM Cloud Internet Services provides a set of edge servers so that we can apply edge functions to all incoming traffic. We also had a cloud-enabled Redis that supported the traffic throttling solution.

Now, let’s break this architecture into two pieces.

Phase 1: Argument mining from Twitter

Leading up to GRAMMY night, fans can respond to Tweets about each topic such as “Virtual concerts are a better experience than live shows.” This is the crowd’s opportunity to voice their opinions on Twitter. A set of applications searches for all of the relevant responses to a top-level Tweet to include into the corpora of candidates arguments. To optimize quality and language fluency, each Tweet is processed by several algorithms to determine whether they are a fit for speech synthesis. Figure 3 highlights the components of the argument mining process.

Closeup of the cloud components workflow: the Argument Mining has arrows to Twitter, Cloudant NLP Store, Speech by Crowd, Extractive Summarization, Social Transformer & Editing Expansion, and Copyright Infringement. The Social Transformer & Editing Expansion has an arrow to IBM Cloud Storage. The Copyright Infringement has an arrow to Bare Metal — Lyric Redis Databases — Figure 3. The argument mining portion of the architecture

The architecture

The argument mining workflow is orchestrated by a Python application that runs on OpenShift. The application queries responses to Tweets from social influencers to find highly relevant and polarized opinions. Hashtags, symbols, and @ indicators are removed from the Tweet. Now, the clean text is posted to an extractive summarization capability to compress the text into a single sentence.

Next, each of the cleaned and focused sentences are paraphrased by a T5 language model. Candidates are ranked according to surface and semantic forms, producing a quality score. Only the highest-quality sentences that pass an experimentally determined threshold continue to the next step in the process. Each high-quality sentence is posted to a copyright infringement service. Now, all of the successfully passed opinions are batch ingested into the Speech by Crowd IBM Research platform and associated with a particular topic. The entire process is a 5-step process.

Step 1. Twitter mining

The Twitter mining process is incredibly detailed. At a high level, we first time-box and search top-level Tweets for replies using Twitter’s API to set a search dictionary. We then look for all direct replies to the original author within a four-hour time window. From here, the process of mining the conversation thread begins. If an eligible Tweet is retrieved, we clean it by stripping out the text and applying quality filters.

However, to find relevant Tweets, we use a computing technique called recursion. For example, searching for replies within any number of threads requires using tail recursion. As shown in the following code example, the _get_replies first method has an escape if statement that limits it to a depth of three conversations. If we reach that limit, we return the retweet and process the results. If we do not reach the depth limit, we continue deepening.

def _get_replies(self,payload,tweet_id,screen_name,tweet_creation_datetime,counter):
  payload_usr = dict.copy(payload)
  payload_usr['query'] = 'to:'+str(screen_name)
  payload_usr['fromDate'] = tweet_creation_datetime.strftime("%Y%m%d%H%M")
  if 'toDate' in payload:
    del payload_usr['toDate']
  response = requests.post(self._url, json=payload_usr, auth=HTTPBasicAuth(self._user, self._pwd)).json()
  retweets = []
  counter+=1
  if counter>4:
    return retweets
  if 'results' in response:
    for tweet in response['results']:
      if 'text' in tweet and 'in_reply_to_status_id' in tweet:
        if tweet['in_reply_to_status_id'] == tweet_id:
          if 'extended_tweet' in tweet:
            tweet_text = tweet['extended_tweet']['full_text']
          else:
            tweet_text = tweet['text']
          retweets.append({"text":tweet_text,"topic":self._topic})
      if 'reply_count' in tweet and tweet['reply_count']>0 and 'user' in tweet and 'screen_name' in tweet['user']:
        pick_tweet_id = tweet['id']
        pick_screen_name = tweet['user']['screen_name']
        pick_tweet_creation_datetime = datetime.strptime(tweet['created_at'],'%a %b %d %H:%M:%S %z %Y')
        retweets = retweets + self._get_replies(payload,pick_tweet_id,pick_screen_name,pick_tweet_creation_datetime,counter)
return retweets

Step 2. Extractive summarization

After we have a list of Tweets, we enter the second step of the process — summarization. The algorithm that we used for summarization was extractive in that it generated summaries from existing text fragments from the original source. The other alternative, which we did not use to stay true to the source data, is abstractive summarization. With abstractive techniques, the algorithm produces new fabricated text. The algorithms are unsupervised, so we do not need any labeled data and it’s much easier to apply to any problem domain such as the GRAMMYs. Because we are using extractive summarization, the algorithm does not require any domain knowledge. The resulting sentences are now compressed, focused, and concise.

Step 3. Paraphrasing

After we finish the extractive summarization, we post the sentences to a paraphraser and rewriter that attempts to boost the natural language quality of the Tweets. To accomplish this, we use a T5-small model from PyTorch and the Huggingface transformer library.

We used an existing transformer model to fine-tune our task of Twitter rewriting. An important ingredient for transfer learning is the unlabeled data set that is used for pre-training. This helps to learn the encoding representation of data. For pre-training, the data set must be high quality, diverse, and large. Unfortunately, Wikipedia does not meet all of these requirements. The corpora is large and high quality but uniform in style. Another common data set is Common Crawl, where web pages have been scraped. This data set is large in scale and diverse but low in quality. As a result, the team that trained the original transformer created the Colossal Clean Crawled Corpus (C4), which is a cleaned version of the Common Crawl and two orders of magnitude larger than Wikipedia. This meets the quality, diversity, and scale data requirements. C4 is available through the TensorFlow data sets.

The original model can accept a sentence and rewrite it to a new sentence. We fine-tuned the small T5 model with only 150 labeled exemplars. For example, you can see a Tweet within the training sample along with the label. We collected 150 of these pairs and created our own fine-tuned paraphraser for the GRAMMYS.

The paraphraser application exposed a Swagger endpoint so that a sentence could be posted for paraphrasing. The service encapsulates spell checking and quality measures to return the top-ranked sentence from the set of expanded sentences and the original sentence.

screenshot of a Swagger endpoint and a POST request — Figure 4. The swagger interface for text paraphrasing

Step 4: Sentence quality

Now that we have rewritten the Tweets into natural language candidates, we apply a quality pipeline. In the first step, we check for any spelling errors and correct them. Next, the clean text is applied to a surface form quality measure. The en_core_web_lg library from spaCy encodes our words. This helps us to get the token positions and tags about each word. We use the part of speech patterns within a sentence to determine the surface form quality. We apply a set of rules in the form of case-based reasoning to give us a foundational score. Next, we trained an XGBoost tree on patterns of speech sequences with quality labels of true or false. The tree was applied to the extracted parts of speech sequences to retrieve a machine learning score. We used a list of tags and parts of speech to discern the quality. We used 1,897 exemplars, of which 70% were train and 30% were test. We achieved an accuracy of approximately 80%.

We picked the XGBoost algorithm from the lineage of decision trees. Decision trees learn branch and bound rules for patterns. When we ensemble many trees together, we call this bagging. To go further, a random forest uses a subset of predictors to build a collection of trees to ensemble together. Increasing in complexity, the boosting algorithm uses algorithms to sequentially learn after each model build by boosting the influence of higher performing models. In the learning process of trees, gradient boosting is used to minimize errors in sequential models. Finally, we arrive at our selection, which prunes poorly performing trees and adds regularization terms for an optimized gradient-boosting algorithm.

We averaged both the rules and model-based score for an overall quality score. Only the highest-quality surface forms are retained and returned. This is how we pick the best paraphrased sentence.

Next, we check on the semantic quality of the sentence. We run a Project Debater polarity detection model and pick sentences that take a clear stance toward a topic. Then, we determine how well the sentence is aligned to the meaning of the sentence. If the best paraphrased sentence survives the quality checks, it then moves on to the next step, infringement detection.

Step 5: Lyric infringement detection

A challenging and important problem we had to solve was to ensure that no arguments infringed on musical lyrics. If any overlapping text is found within a song, the text is removed from the candidate argument list. To detect infringement, a search payload is sent to two bare metal machines. The query uses the lyric search and artist name of the topic, if applicable, in a search. If the artist name is available, we search the song lyric index only. We check for verbatim, in order, and slop. We experimented with different queries for each of the topics to get infringement detection coverage.

The infringement corpus came from the 29G LyricFind company. The data was ingested on two Redis bare metal machines with 32 GB of RAM and 2 TB of disk. The raw storage and 20 cores per machine ensured that we had enough compute and memory capacity for thousands of parallel text searches. Multiple concurrent threads pushed search queries during batch jobs every 4 hours.

Finally, all of the Twitter arguments that pass all of the filters are stored in a database, together with arguments collected on GRAMMY.com.

To understand how these arguments are summarized and key highlights extracted, check out part 2 in this series.

Music + debating

Music and debating go together just like artificial intelligence and humans. The combination of both creates world-class experiences where we can gain a better understanding around all sides of an issue. Learn more at https://www.ibm.com/sports/grammys/.

Join us live!

Join the authors of this post live on March 16, 2021, at 12 p.m. (ET) as we talk about the solution we built for the GRAMMYs and show you how the argument results changed over the course of the two weeks leading up to the show. Sign up at: https://www.crowdcast.io/e/grammys-debates-with-watson-from-the-Lab-to-musics-biggest-night/.

Originally published at https://developer.ibm.com.