Create word cloud scraping data from Reddit API using PRAW and spaCy

Published in

MCD-UNISON

4 min readSep 18, 2021

A step-by-step tutorial to create word clouds from any topic using Python and spaCy for Natural Language Processing with data extracted from Reddit using PRAW wrapper.

Text scraping is an incredible technique that allows us to extract and process a big amount of data from different inputs like web pages or API (Application Programming Interface).

In this example, we are going to create a word cloud of what people are commenting on Reddit about bitcoin in El Salvador. We are going to use PRAW (Python Reddit API Wrapper) to extract data and the natural language processer spaCy to clean data

There are some steps to get the word cloud.

Login to Reddit or create an account if you don’t have one
Register the application to get OAuth 2.0 authorization
Load needed libraries and create a Reddit instance with credentials
Get submissions and their comments related to El Salvador and Bitcoin
Clean and process all text and extract important data using spaCy
Create and plot the word cloud

Step 1: Login or create an account on Reddit https://www.reddit.com/

Step 2: Before start, please read the API usage guidelines, they have some restrictions about using the API, for example, we may make up to 60 requests per minute.

To register the application click on your user avatar and then click on User Settings.

After that, click on Safety & Privacy.

Then, click on Manage third-party app authorization.

Finally, click on are you a developer? create an app… and fill

Fill name with your app name
We have three options of applications, as we are going to use python, select script.
Put any description of the app
On redirect uri just write http://localhost:8080
Click on create app button
For more information about the option, you can check the documentation

Now we have the authorization.

client_id = personal use script
client_secret = secret
user_agent = app name

Step 3: Load libraries required and create Reddit instance

import praw
import spacy
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

If you don’t have any of these libraries, please install them

pip install praw, spacy, pandas, matplotlib, wordcloud

Create the Reddit object with your credentials from Step 2 using praw wrapper

reddit = praw.Reddit(client_id = "fake-id-DKKEKYV1YXHOPQ",
                    client_secret = "fake-secret-HNHoqbkl4LDbRMmoQ",
                    user_agent = "api-test",
                    username = "your_username",
                    password = "your_password")

Step 4: Get submissions and comments of all subreddits, limit is 1000

sub_reddit = reddit.subreddit("all")

In the first for loop, we search the word “Salvador” in subreddits, if in any post found the word, then look for keywords concerning bitcoin. If found at least one, then save them in an array, submissions title, author, a number of comments, and all comments in a for loop with author and text. The results are sorted by newest but you can choose sort by hot, top, random, etc. Finally, save the array in a pandas data frame.

keywords = ['bitcoin','btc','chivo']# Create an empty array
data = []# Search keyword "salvador"
for submissions in sub_reddit.search("salvador", limit=1000,                  sort="new"):  
    #Look for any word in keywords array
    if any(keyword in submissions.title for keyword in keywords):
        #If at least have 1 comment then go and append data to array
        if submissions.num_comments > 0:
            comments = submissions.comments.replace_more(limit=0)            
            for comments in submissions.comments.list():       
                data.append([submissions.title,
                             submissions.author,
                             submissions.num_comments,
                             comments.author,
                             comments.body])# Save data to dataframe
df = pd.DataFrame(data, columns=['title','autor','num_comments','comment_author','comment_text'])

Step 5: Clean and process found text, first install and load a default trained pipeline package from spaCy

Install trained pipeline package

python -m spacy download en_core_web_md

Load trained pipeline package and parse founded words to spaCy

nlp = spacy.load("en_core_web_md")
words = '\n'.join(df.comment_text)
text = nlp(words)

Clean data selecting only adjectives, nouns, and proper nouns all in lowercase

new_text = ""
for word in text:    
    if word.pos_ in ['ADJ','NOUN','PROPN']:
        new_text = " ".join((new_text, word.text.lower()))

Step 6: Finally, create and plot the word cloud

wordcloud = WordCloud(stopwords=STOPWORDS, max_words=100, background_color='white', width=800, height=600).generate(new_text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Output

Now try yourself playing around with different topics and values.

Full code:

References:

Create word cloud scraping data from Reddit API using PRAW and spaCy

Written by Daniel Mendoza