Project Debater Tutorial: Finding Insights in Survey Data

Published in

IBM Data Science in Practice

13 min readJun 1, 2021

Project Debater team authors: Yoav Kantor, Yoav Katz, Roy Bar-Haim, Elad Venezian, and Noam Slonim

a view of downtown Austin from a lakefront platform with a tree shading i — Photo by Tomek Baginski on Unsplash

When you have a large collection of texts representing people’s opinions, such as product reviews, survey answers or social media, it is difficult to understand the key issues that come up in the data. Going over thousands of comments is prohibitively expensive. Existing automated approaches are often limited to identifying recurring phrases or concepts and the overall sentiment toward them, but do not provide detailed or actionable insights.

In this tutorial, you will gain hands-on experience in using Project Debater services for analyzing and deriving insights from open-ended answers. To learn more about an overall view of Project Debater, please read our introduction piece to it, or read more about Project Debater in action. If you would like to get more hands on experience on using it yourself after reading this tutorial, please see the Project Debater Tutorial GitHub page (this post is derived from one tutorial there).

The data we will use is a community survey conducted in the City of Austin in the years 2016 and 2017. In this survey, the citizens of Austin were asked “If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?”.

We will analyze their open-ended answers in different ways by using four Debater services: the Argument Quality service, the Key Point Analysis (KPA) service, the Term Wikifier service and the Term Relater service. We will see how these services can be combined into a powerful text analysis tool.

Run Key Point Analysis on 1000 randomly selected sentences from 2016 survey

Read random sample of 1000 sentences from 2016 comments

Let’s take a look at the first 4 lines in the dataset_austin_sentences.csv file, which holds the Austin survey dataset.

The file has all the survey answers after they were split into sentences. Each row in the file corresponds to a single sentence. Each row has the following attributes: [‘id’, ‘text’, ‘district’,’year’]. We will first read the attached csv file into the ‘sentences’ variable.

import csv
import random
with open('./dataset_austin_sentences.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    sentences = list(reader)

Let’s have a look at the content sentences variable.

print('There are %d sentences in the dataset' % len(sentences))
print('Each sentence is a dictionary with the following keys: %s' % str(sentences[0].keys()))There are 6274 sentences in the dataset
Each sentence is a dictionary with the following keys: dict_keys(['id', 'text', 'district', 'year'])

Let’s select only the sentences from the 2016 survey and randomly sample 1000 out of them. The Key Point Analysis service is able to run over hundreds of thousands of sentences, however since the computation is heavy in resources, particularly GPUs, the trial version is limited to 1000 sentences.

sentences_2016 = [sentence for sentence in sentences if sentence['year'] == '2016']
print('There are %d sentences in the 2016 survey' % len(sentences_2016))
random.seed(0)
random_sample_sentences_2016 = random.sample(sentences_2016, 1000)There are 3005 sentences in the 2016 survey

Run Key Point Analysis on the random sample

Key point analysis is a novel and promising approach for summarization, with an important quantitative angle. This service summarizes a collection of comments on a given topic as a small set of key points. The salience of each key point is given by the number of its matching sentences in the given comments.

Before running the Key Point Analysis service we first need to initialize our client. The DebaterApi object supplies the clients for the various Debater services. The clients print information using the logger and a suitable verbosity level is should be set. The DebaterApi object is configured with an API key. It should be retrieved from the Project Debater Early Access Program site. In this case it is passed by the enviroment variable DEBATER_API_KEY. We then obtain the keypoint client from the DebaterAPI object.

The Key Point Analysis service stores the data (and results cache) in a domain. A user can create several domains, one for each dataset. Domains are only accessible to the user who created them. In this tutorial, we will run all Key Point Analysis jobs in the same domain named ‘austin_demo’.

Full documentation of the Key Point Analysis service can be found here.

from debater_python_api.api.debater_api import DebaterApi
from austin_utils import init_logger
import osinit_logger()
api_key = os.environ['DEBATER_API_KEY']
debater_api = DebaterApi(apikey=api_key)
keypoints_client = debater_api.get_keypoints_client()
domain = 'austin_demo'

Let’s define a method named run_kpa. The method receives a list of sentences, where each sentence is a dictionary with the keys ‘id’, and ’text’ and then runs Key Point Analysis on these sentences.

In order to run Key Point Analysis, as in the run_kpa method below we need to:

Upload the comments into a domain using the keypoints_client.upload_comments(domain=domain, comments_ids=sentences_ids, comments_texts=sentences_texts, dont_split=True) method. This method receives the domain, a list of comment_ids and a list of comment_texts. By default, when uploading comments into a domain, the Key Point Analysis service splits the comments into sentences by default and runs a minor cleansing on the sentences. Since we already split the comments into sentences ourselves and we want the Key Point Analysis service to use them as is, we will set the dont_split parameter to True.
Wait until all comments in the domain are processed using the keypoints_client.wait_till_all_comments_are_processed(domain=domain) method.
Start a Key Point Analysis job using the future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids, run_params=run_params) method. This method receives the domain, a list of comment_ids and a run_params. The run_params is a dictionary with various parameters for customizing the job. One of the parameters we can set is n_top_kps which tells the system how many key points are required. We will set it to 20, therefore we will use run_params={‘n_top_kps’: 20}. The job runs in an async manner: therefore the method returns a future object.
Use the returned future and wait till results that are available using the kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5) method. The method waits for the job to finish and eventually returns the result.

The result is a data structure containing the key points, sorted in descending order according to number of matched sentences, and for each key point, there is a list of matched sentences, also sorted in descending order according to their match score. An additional ‘none’ key point is added which holds all the sentences that don’t match any key point.

def run_kpa(sentences):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]
    run_params={'n_top_kps': 20}
    keypoints_client.upload_comments(domain=domain, comments_ids=sentences_ids, comments_texts=sentences_texts, dont_split=True)
    keypoints_client.wait_till_all_comments_are_processed(domain=domain) 
    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids, run_params=run_params)
    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    return kpa_result

We will now use the run_kpa method and run it over the random sample and print the result. The austin_utils file is in the full GitHub tutorial.

from austin_utils import print_results

kpa_result_random_1000_2016 = run_kpa(random_sample_sentences_2016)
print_results(kpa_result_random_1000_2016, n_sentences_per_kp=2, title='Random sample 2016')

Below is a summary of the results of the key point analysis.

Random sample 2016 coverage: 28.36%

Run Key Point Analysis on 1000 top quality sentences from 2016 survey

Select top 1000 sentences from 2016 data using the Argument Quality service

a table next a building’s window. There are trees in the background. Next to the table there are two chairs, and there is a cowboy hat and two drinks on the table. — Photo by Megan (Markham) Bucknall on Unsplash

The answers in the Austin Survey dataset vary in length, style, and quality. Selecting the sentences randomly may lead to running over many sentences that are not very informative. The analysis above used randomly selected sentences and only reached a 28.36% coverage of the dataset. This means that only 28.36% of the sentences matched a key point. In order to improve the coverage and the quality of our results, we will now run over higher quality sentences and select the 1000 sentences with the highest Argument Quality score. The Argument Quality service receives pairs of [sentence, topic] and returns a score indicating whether the sentence is phrased in grammatically correct, clear and concise language. The ranking of the quality is based on the machine learning model, which was trained on human assessments of over 30,000 arguments. Below we show the code for and the results of searching for the top and bottom quality sentences for the key point “Austin is a great place to live.”

def get_top_quality_sentences(sentences, top_k, topic):    
    arg_quality_client = debater_api.get_argument_quality_client()
    sentences_topic = [{'sentence': sentence['text'], 'topic': topic} for sentence in sentences]
    arg_quality_scores = arg_quality_client.run(sentences_topic)
    sentences_and_scores = zip(sentences, arg_quality_scores)
    sentences_and_scores_sorted = sorted(sentences_and_scores, key=lambda x: x[1], reverse=True)
    sentences_sorted = [sentence for sentence, _ in sentences_and_scores_sorted]
    print_top_and_bottom_k_sentences(sentences_sorted, 10)
    return sentences_sorted[:1000]

sentences_2016_top_1000_aq = get_top_quality_sentences(sentences_2016, 1000, 'Austin is a great place to live')

Top quality sentences:
- Affordable housing is essential to keep Austin diverse, welcoming, and growing in the ways that reflect the progressive ideals of this city and the future generations.
- Austin has a unique charm, with high quality customers service, and great quality of life that will quickly deteriorate with this rapid urban sprawl.
- We need to make sure that our city continues to be an example across the country for unwavering progress, with regards to energy, policing, fair housing and employment, non-discrimination policy, and creating a sound infrastructure to accomodate the city’s rapid growth

Bottom quality sentences:
- BRING LYFT UBER BACK.
- Please compost!
- FIX THE WASHBOARD SURFACE OF GREYSTONE BETWEEN MESA NAD BALBURN

Run Key Point Analysis over the selected sentences

We will now run the run_kpa method over the top 1000 quality sentences

kpa_result_top_aq_1000_2016 = run_kpa(sentences_2016_top_1000_aq)
print_results(kpa_result_top_aq_1000_2016, n_sentences_per_kp=2, title='Top aq 2016')

Here is a summary of that output:

Top argument quality selected sentences 2016 coverage: 41.05%

Increase coverage by decreasing the matching threshold

By running key point analysis over higher quality sentences we managed to increase our coverage of the dataset to 41.05%. In order to increase the coverage more, we will add another parameter to the run_params called mapping_threshold.

We will rerun the run_kpa method, but this time the method keypoints_client.start_kp_analysis_job will also receive a threshold parameter and we will use it in the run_params in the following way: run_params={‘n_top_kps’: 20, ‘mapping_threshold’: threshold}

The mapping_threshold is responsible for deciding whether a sentence matches, or supports, a key point. Therefore reducing the threshold from the 0.99 default value makes more sentences match key points and increases the coverage, at the risk of reducing the precision.

In addition, the method will now return two values. The result and the job_id stored in the future (using the future.get_job_id() method). We will need this job_id in the next section.

def run_kpa(sentences, threshold):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]
    run_params={'n_top_kps': 20, 'mapping_threshold': threshold}
    keypoints_client.upload_comments(domain=domain, comments_ids=sentences_ids, comments_texts=sentences_texts, dont_split=True)
    keypoints_client.wait_till_all_comments_are_processed(domain=domain) 
    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids, run_params=run_params)
    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    
    return kpa_result, future.get_job_id()

Let’s now run again over the top 1000 quality sentences, this time with mapping threshold of 0.95.

kpa_result_top_aq_1000_2016, kpa_top_aq_1000_2016_job_id = run_kpa(sentences_2016_top_1000_aq, 0.95)
print_results(kpa_result_top_aq_1000_2016, n_sentences_per_kp=2, title='Top aq 2016')Top aq 2016 coverage: 49.89

Now, with the lowered threshold of 0.95, the coverage is increased to 49.89%, and the key point analysis is summarized below.

Let’s examine the top 5 and bottom 5 sentences that were matched to the first key point and make sure that the precision is still high.

from austin_utils import print_top_and_bottom_matches_for_kp
print_top_and_bottom_matches_for_kp(kpa_result_top_aq_1000_2016, 'Traffic congestion needs major improvement', 5, 5)

Top 5 matches:
- Traffic congestion needs major improvement.
- Austin need improved transportation infrastructure to alleviate current traffic and accommodate rapid population growth.
- Fast population growth is the cities biggest problem in areas such as congestion and expensiveness.
- I really wish that city planning would find a way to improve traffic flwo.
- Public transport to improve Mopac Congestion

Bottom 5 matches:
- The biggest problem Austin has is a lack of viable north-south arteries.
- If the train traveled to more locations there would be more people on the train and less traffic.
- Please plan for the future through investments in a comprehensive transit plan rather than quick fixes related to traffic.
- Providing more bike lanes and dashed lanes is adding to the problem because drivers have no idea what the rules are and neither do bicycles.
- Traffic in Austin is a major problem no matter which road you use.

Run Key Point Analysis over 2017 survey using the key points from 2016 survey

Select top 1000 sentences from 2017 data using the Argument Quality service

A much needed utility for data is to be able to compare between different subsets of the data, such as comparing between different years or different geographical districts. We will demonstrate how easy it is to compare the 2017 data to the 2016 data. A similar comparison can be done between districts or other subsets.

Let’s first filter the 2017 sentences and take the top 1000 quality sentences, as done for the 2016 sentences.

sentences_2017 = [sentence for sentence in sentences if sentence['year'] == '2017']
sentences_2017_top_1000_aq = get_top_quality_sentences(sentences_2017, 1000, 'Austin is a great place to live')

Top quality sentences:
- Affordable housing is important to promote diversity in Austin.
- More space (and more affordable, quality housing) means more artists, more artwork of all mediums, and a more culturally diverse and enriched city.
- AUSTIN NEEDS A BETTER PUBLIC TRANSPORTATION SYSTEM THAT CAN ACCOMMODATE MORE PEOPLE TO HELP WITH TRAFFIC AND OVERALL GROWTH OF THE CITY.

Bottom quality sentences:
- Well, come on!!!!!!
- I LOVE TREES BUT JEEZ!
- Don’t California our Texas

Run Key Point Analysis over top 1000 quality 2017 sentences using the key points from 2016

In order to compare the 2017 sentences to 2016 sentences we will want to map the 2017 sentences to the same key points extracted on the 2016 sentences. Otherwise different key points could be automatically extracted on the 2017 sentences and it would be hard to compare between them.

For this end we will reimplement the run_kpa method. This time the method will receive a new key_points_by_job_id parameter. This parameter is passed to the key_points_by_job_id parameter in the future = keypoints_client.start_kp_analysis_job method. When None is passed to key_points_by_job_id parameter, key points are automatically extracted, however when it is set with a job_id of a previous job it uses the key points from that job and matches all sentences to them.

def run_kpa(sentences, threshold, key_points_by_job_id=None):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]
    run_params={'n_top_kps': 20, 'mapping_threshold': threshold}
    keypoints_client.upload_comments(domain=domain, comments_ids=sentences_ids, comments_texts=sentences_texts, dont_split=True)
    keypoints_client.wait_till_all_comments_are_processed(domain=domain) 
    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids, run_params=run_params, key_points_by_job_id=key_points_by_job_id)
    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    
    
    return kpa_result, future.get_job_id()

Let’s use the new run_kpa and provide it with the top 1000 quality sentences from 2017 and the job_id of top 1000 quality sentences from 2016.

kpa_result_top_aq_1000_2017, _ = run_kpa(sentences_2017_top_1000_aq, 0.95, kpa_top_aq_1000_2016_job_id)
print_results(kpa_result_top_aq_1000_2017, n_sentences_per_kp=2, title='Top aq 2017, using 2016 key points')

What follows is a summary of the key point analysis of the 2017 quality sentences using 2016 key points:

Top argument quality sentences from 2017 coverage (using 2016 key points): 49.15%

Since both jobs have the same key points, we can now easily compare the two results.

from austin_utils import compare_resultscompare_results(kpa_result_top_aq_1000_2016, '2016', kpa_result_top_aq_1000_2017, '2017')

Note: This comparision is for illustration only. Given that we ran on a subset of comments, the statistical significant of difference between the years is limited, except for the most recurring keypoints.

Deep dive into the traffic problem in Austin using the Term Wikifier and Term Relater services

a view of downtown Austin at sunset with a view of a traffic on a bridge as well — Photo by Carlos Alfonso on Unsplash

As we’ve seen in the 2016 results, that the traffic problem in Austin is significant. In this section, we will use the Term Wikifier and Term Relater services to select a subset of the sentences related to the Traffic topic and run Key Point Analysis over them.

The Term Wikifier service runs over sentences and identifies the Wikipedia concepts that are referenced by phrases in the sentence text. Concepts correspond to Wikipedia articles. Each occurance of a concept in the sentence is called a mention. For example, the sentence “My car insurance went up 20% due to vehicle thefts and burglary” mentions three Wikipedia concepts: The phrase “car insurance” is mapped to the concept Vehicle insurance; the phrase “vehicle thefts” is mapped to the concept Motor vehicle theft and the phrase “burglary” is mapped to the concept Burglary.

The Term Relater service runs over pairs of Wikipedia concepts and scores how closely these concepts are related. For example, the Car concept is very related to the Traffic concept but the Cat concept is not very related to the Traffic concept.

We will use the Term Wikifier to extract all mentions in all sentences. We then use the Term Relater to select a subset of these mentions which are related to the ‘Traffic’ concept. Next, we select all sentences that have mentions related to the ‘Traffic’ concept, and finally run Key Point Analysis over them. Running over these sentences will create key points specifically to the traffic problem in Austin and expose insights and suggestions related to it.

Calculate the mentions in the sentences using the Term Wikifier

The get_sentence_to_mentions(sentences_texts) method uses the Term Wikifier service, calculates the mentions for each sentence, and stores it in a dictionary named sentence_to_mentions.

The Term Wikifier client runs over the sentences_texts using the mentions_list = term_wikifier_client.run(sentences_texts) method and returns a list of mentions_lists.

def get_sentence_to_mentions(sentences_texts):
    term_wikifier_client = debater_api.get_term_wikifier_client()
    mentions_list = term_wikifier_client.run(sentences_texts)
    sentence_to_mentions = {}
    for sentence_text, mentions in zip(sentences_texts,    
                                       mentions_list):
        sentence_to_mentions[sentence_text] = 
        set([mention['concept']['title'] for mention in mentions])
    
    return sentence_to_mentions

Let’s calculate the mentions on all 2016 sentences.

sentences_2016_texts = [sentence['text'] for sentence in 
                        sentences_2016]
sentence_to_mentions = 
                get_sentence_to_mentions(sentences_2016_texts)

Find the mentions that relate to the traffic concept using the Term Relater service

Since we’re interested in the Traffic concept, we will now take all mentions and find the ones that are related to that concept. Then we will select all sentences that have at least one mention that is related to the Traffic concept.

all_mentions = set([mention for sentence in sentence_to_mentions 
                   for mention in sentence_to_mentions[sentence]])

The get_related_mentions(concept, threshold, all_mentions) method receives a given concept, a threshold and all_mentions. It then uses the Term Relater service to calculate the relatedness between the mentions and the concept and returns all the mentions that have a relatedness score above the given threhold. The term_relater_client runs over the pairs using the scores = term_relater_client.run(concept_mention_pairs) method and returns a list of scores.

def get_related_mentions(concept, threshold, all_mentions):
    term_relater_client = debater_api.get_term_relater_client()
    concept_mention_pairs = [[concept, mention] for mention in 
                             all_mentions]    scores = term_relater_client.run(concept_mention_pairs) 
    
    return [mention for mention, score in zip(all_mentions, scores) 
            if score > threshold]

We will now use the get_related_mentions method and find the mentions that match the Traffic concept.

matched_mentions = get_related_mentions('Traffic', 0.5, 
                                         all_mentions)
print(matched_mentions)

Here are fifteen of the many found:

Run Key Point Analysis over the sentences that relate to the Traffic concept

Let’s select the sentences that have mentions that are related to the Traffic concept and run a key point analysis over them. We will need to switch back from sentences_texts to sentences dictionaries since our run_kpa method needs the sentences dictionaries.

matched_sentences_texts = [sentence for sentence in 
                           sentences_2016_texts if 
len(sentence_to_mentions[sentence].intersection(matched_mentions)) > 0]
matched_sentences = [sentence for sentence in sentences_2016 if sentence['text'] in matched_sentences_texts]
matched_sentences = matched_sentences if len(matched_sentences) <= 1000 else random.sample(matched_sentences, 1000)
print('Running over %d sentences' % len(matched_sentences))Running over 980 sentences

Finally, let’s run over these sentences and examine the Traffic related key points

kpa_result_traffic_2016, _ = run_kpa(matched_sentences, 0.99, None)
print_results(kpa_result_traffic_2016, n_sentences_per_kp=2, title='Traffic KPA 2016')Traffic KPA 2016 coverage: 62.30

Here is a summary of this key point analysis:

Traffic KPA 2016 coverage: 62.30%

Conclusion

In this tutorial, we’ve shown how Key Point Analysis can provide you with detailed insights over survey data right out of the box, thereby significantly reducing the effort required by a data scientist to analyze the data. We also demonstrated how key point analysis over unstructured text can be combined with available structured information to provide new views over the data. Finally, we showed how utilization of additional Project Debater text analysis services such as Argument Quality, Term Wikifier , and Term Relater can further improve the quality of the results. Please check out our GitHub tutorial to learn more about how to use the Project Debater services.