FROM THE ARCHIVES OF PRAGPUB MAGAZINE, APRIL 2016
Analyzing Cultural Domains with Python
by Dmitry Zinoviev
📚 Connect with us. Want to hear what’s new at The Pragmatic Bookshelf? Sign up for our newsletter. You’ll be the first to know about author speaking engagements, books in beta, new books in print, and promo codes that give you discounts of up to 40 percent.
Dmitri walks you through how to construct a reasonably quick and not-so-dirty Python data exploration script that is powerful and flexible enough for serious social science research.
Cultural domain analysis (CDA) is the study of how people in groups think about lists of terms that somehow go together and how this thinking differs between groups.
Some people associate the term “candle” with “Christmas,” others with “hurricane” and “blackout,” and yet others with “self-injury.” Anthropologists, ethnographers, psychologists, and sociologists use CDA to understand semantic mindscapes of social, ethnic, religious, professional, and other groups.
Before personal computers and specialized CDA software became available, social scientists used to do CDA essentially by hand. But not anymore!
Thanks to the Anaconda Python distribution generously provided for free by Continuum Analytics, the “paper-and-pencil” CDA approach is forever obsolete. In this article, I am going to show you how to construct a reasonably quick and not-so-dirty Python script for analyzing cultural domains. In fact, the script is flexible enough to support real social science research.
Cultural domain analysis is a special case of data analysis. As such, it consists of the same three major steps:
- Getting data
- Building a data model
- Interpreting results
We’ll look at each of these.
Getting Started
Most of the Python modules that I need for CDA are already included in Anaconda: urllib.request
for getting data off the Web; os
, os.path
, and pickle
for caching data; NLTK
for preprocessing data; NumPy
and Pandas
for numerical operations; and networkx
for network modeling. However, one of the modules, community
, must be installed separately (for network community detection). I import all modules at the beginning of the script.
import urllib.request
import os, os.path, pickle
import nltk
import pandas as pd, numpy as np
import networkx as nx
import community
These modules are so ubiquitous in a data analysis loop that I often import them even before I decide where to get the data and what exactly to do with it.
Getting and Caching Term Lists
A term is a unit of CDA. It can be a word, a word group, a word stem, or even an emoticon (emoji). CDA looks into similarities between terms that are shared by a reasonably homogeneous group of people. So I need to find a reasonably homogeneous group of people, a list of terms, and a way to assess their similarity.
My favorite source of semantic data is LiveJournal— a collection of individual and communal blogs with elements of a massive online social network. LiveJournal (LJ) has an open, easily accessible API, and it also encourages using their data for the purpose of research. True, LJ membership and activity peaked in the early 2000s, but the site still hosts some active blogging communities (such as the celebrity gossip blog Oh No They Didn’t!), and it serves rich layers of historical data.
LJ communities are formed by their members. The members — LJ users — have their personal blogs, profiles, friend lists, and interests (online identity markers). In fact, LJ treats communities and users as same-class citizens: communities, like individuals, have their profiles, interests, and even “friends.” The URLs of user/community profiles, interest lists, and friend lists have a regular and simple structure: if thegoodwife_cbs
is the name of a community (I will actually use this in the rest of the study), then thegoodwife_cbs.livejournal.com/profile is the community profile, www.livejournal.com/misc/fdata.bml?user=thegoodwife_cbs&comm=1 is the friend list, and www.livejournal.com/misc/interestdata.bml?user=thegoodwife_cbs is the interest list.
For the purpose of this mini-study I define two terms to be similar from the perspective of an LJ community if they are consistently listed together on different interest lists of the community members. (Why would they be co-listed if they are not similar, indeed?) My first assignment is to obtain and process the community membership list, which looks like this:
# Note: Polite data miners cache on their end.
# Impolite ones get banned.
# Note: thegoodwife_cbs is a community account
P> emploding
P> poocat
...more members...
P< brooketiffany
P< harperjohnson
The first line of the document makes a very important point: if something can be downloaded only once, it should. I will create a directory cache and store all downloaded data into it. If I want to run CDA using the same data again, it will hopefully still be in the cache. (I assume that interest lists and community membership are reasonably stable.)
cache_d = “cache/” + GROUP + “.pickle”
GROUP = “thegoodwife_cbs”
if not os.path.isfile(cached):
# BEGIN
# prepare the data as variable “domain”
# END
if not path.os.isdir(“cache”):
os.mkdir(“cache”)
with open(cache_d, “wb”) as ofile:
pickle.dump(domain, ofile)
else:
with open(cache_d, “rb”) as ifile:
domain = pickle.load(ifile)
In this code fragment, I use module pickle
— a native Python data serializer. If the cache directory and file exist, function pickle.load
deserializes the data object. Otherwise, I create the directory and the file and call function pickle.dump
to serialize the data object domain
and save it into the file. Note that the file must be opened in the binary mode.
All other code explained in this section goes into the highlighted section of the previous code fragment.
The rest of the community membership list is a two-column table. The first column represents some subtle aspects of membership types (“P” for individual users, “C” for communities, “<” for “friends,” “>” for “friends-of”); the second column has user names. What a perfect case for a Pandas
data frame, I exclaim, and call pandas.read_table to convert the table into a two-column DataFrame
:
members_url = “http://www.livejournal.com/misc/fdata.bml?user=" \
+ GROUP + “&comm=1”
members = pd.read_table(members_url, sep=” “, comment=”#”,
names=(“type”, “uid”))
My next task is to download all interest lists, convert interests into terms (no, they are not always the same), and combine all term lists into a term vector matrix. An interest list looks very similar to the membership list: even the call to caching is the same:
# Note: Polite data miners cache on their end.
# Impolite ones get banned.
# <intid> <intcount> <interest ...>
18576742 1 +5 sexterity
624552 7 a beautiful mess
18576716 1 any more hot chicks?
44870 28 seriously?
1638195 94 shiny!
...more interests...
This is still a table (showing interest ID, number of times the interest is declared system-wide, and the actual interest), but the number of columns differs, depending on how many words are in the interest description. Pandas data frames are poor parsers for irregular texts, and I will have to tackle them essentially by hand, using low-level Python tools.
If the user name is not found or the user has not declared any interests, the content is quite different:
! invalid user, or no interests
I certainly want to read the first line before I attempt any intelligent analysis.
In the next code fragment, I prepare a lemmatizer wnl
(explained below) and a list of stop words stop
(more about them later, too) and set up an empty accumulator dfs
, obtain an interest list for each unique community member, append it to the accumulator, and merge all lists into one DataFrame
.
wnl = nltk.WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(“english”)) | set((“&”))
dfs = []
for uid in members.uid.unique():
print(uid)
LJ_BASE = “http://www.livejournal.com/
uid_url = LJ_BASE + “misc/interestdata.bml?user=” + uid
try:
with urllib.request.urlopen(uid_url) as source:
raw_interests = [line.decode().strip()
for line in source.readlines()]
except:
print(“Could not open”, uid_url)
continue
if raw_interests[0] == ‘! invalid user, or no interests’:
continue
interests = [“ “.join(wnl.lemmatize(w)
for w in nltk.wordpunct_tokenize(line)[2:]
if w not in stop)
for line in raw_interests
if line and line[0] != “#”]
interests = set(interest for interest in interests if interest)
dfs.append(pd.Series(index=interests, name=uid).fillna(1))
domain = pd.DataFrame().join(dfs, how=”outer”).fillna(0)
The try-except block
opens an interest list URL and downloads the list as a Python list of strings. Each string is decoded and stripped of trailing white spaces. LJ interest lists are always in the lower case, but if they were not, a call to lower would ensure that in the rest of the script I compare apples to apples. If the URL fails to open (for any reason beyond my control), my script still does not crash but politely informs me of the failure and proceeds to the next user. The same thing happens when the user has no interests or does not exist at all.
An interest may consist of several words, and some or all of these words may actually be forms of other words (as in “chicks” — “chick”). Different researchers preach different ways of handling word forms. Some suggest to leave the forms alone and treat them as words on their own. Some, like me, advocate lemmatizing or stemming: reducing a form to its lemma (the standard representation of the word) or to the stem (the smallest meaningful part of the word to which affixes can be attached). A lemmatizer would reduce “programmers” to one “programmer,” and a stemmer, depending on its zeal, would yield a “programm” or even a “program.” Furthermore, almost everyone agrees that certain words (such as “a,” “the,” and “and”) should never be counted at all.
The code makes a clever use of NLTK
— the Natural Language Processing Toolkit. The toolkit has tools for word tokenization (splitting a string into a list of words: wordpunct_tokenize), stemming, lemmatization (wnl.lemmatize
; I created the lemmatizer and many more. It also has a list of stop words, which I extended to include “&”. Note that I converted the standard list of stop words to a set for faster lookup.
As a result of lemmatization, stemming, and stop word elimination, a term list may end up having duplicates (e.g., “the chicks” and “a chick” may both become “chick”). I convert each term list to a set. Surely, sets have no duplicates.
My goal is to produce a term vector model (TVM)— a table where rows are terms and columns are community members. In the Pandas
language, this table is known as DataFrame
, and its columns are known as Series
. The code converts a term list to a Series
. The individual terms form the Series
index, and the values are all 1s
, like this:
shiny! 1
+5 sexterity 1
big damn hero 1
...more interests...
Name: twentyplanes, dtype: float64
Finally, I join all Series
into a DataFrame
. This operation involves the mystery of data alignment, whereby all participating Series are stretched vertically to have their row labels (terms!) aligned. Such stretching almost inevitably produces empty cells in the frame. I fill them up with zeros: an empty cell in the row A and column B signals that the term A was not on the list B. The variable domain fully represents my cultural domain — the domain of LiveJournal users interested in the CBS show The Good Wife.
Building the Term Network
The next CDA step requires that I build a network of terms: a graph where nodes represent terms and (weighted) edges represent their similarities.
I could include all 12,437 discovered terms in the graph, but some of them are mentioned only once or twice (which is expected, given Zipf’s law). Rather than wondering why the less frequently used words are in fact less frequently used, I remove all rows with fewer than 10 occurrences, but leave an option of changing the cut-off value MIN_SUPPORT
in the future. At this point, I feel pity again that Python does not have first-class constants, but nonetheless spell MIN_SUPPORT
in all capital letters. Data frame limited
is a truncated version of domain
: it has only 331 rows.
MIN_SUPPORT = 10
sums = domain.sum(axis=1)
limited = domain.ix[sums[sums >= MIN_SUPPORT].index]
There are many ways to define similarities between TVM rows (or columns):
The first one is usually very coarse; the last one is hard to calculate, especially for large domains. Pearson correlation (also known simply as correlation) is well understood and nicely supported by DataFrames
. As I go for it, I transpose the matrix to get correlations between rows (rather than columns) and subtract an identity matrix to get rid of self-correlations.
words = limited.T.corr()
words -= np.eye(len(words))
The resulting square data frame words has both negative and positive elements, and I have to make another painful decision: which elements to convert to weighted graph edges? Network analysis, in general, allows edges with negative weights, but most algorithms are very unhappy about them. I will remove all edges that connect somewhat or strongly dissimilar nodes. The choice of the cut-off value SLICING
controls the density of the resulting network: if the cut-off is too high, the network falls apart into tiny disjoint fragments; if it is too low, the network becomes one hairball with no analyzable structure.
SLICING = 0.25
words = words[words >= SLICING].fillna(0)
Here enters module networkx
— a toolbox for constructing, analyzing, exporting, importing, and visualizing complex networks. networkx
has an elegant interface to Pandas
that allows me to convert a data frame into a network in just one function call. The new network object, words_network
, is born with no node labels, but I relabel it using the original data frame column labels (I could have used row labels as well).
words_network = nx.from_numpy_matrix(words.values)
nx.relabel_nodes(words_network, dict(enumerate(words.columns)),
copy=False)
These are my first really interesting results, and I save them without hesitation into a GraphML
file.
if not os.path.isdir(“results”):
os.mkdir(“results”)
with open(“results/” + GROUP + “.graphml”, “wb”) as ofile:
nx.write_graphml(words_network, ofile)
If I had no access to non-Anaconda modules, I would abandon Python at this point and switch to Pajek or Gephi for network analysis. But I have community.
Extracting and Naming Term Communities
community
uses the Louvian algorithm to extract network communities: groups of nodes that have more (than expected) connections between them than to the rest of the network. The discovered communities are represented as a partition: a dictionary with node labels as keys and integer community identifiers as values. The module also calculates the modularity of the partition. A network with perfect community structure has the modularity of 1, and a hairball-style network has the modularity around 0. My modularity is not so bad:
partition = community.best_partition(words_network) print(“Modularity:”, community.modularity(partition,
words_network))
Modularity: 0.8724188314419324
The partition fully defines the word communities, or clusters. But the clusters are nameless. I could have explored them with my bare eyes and come up with suitable names, but then I wouldn’t be a hardcode Python programmer. Instead, I add a cherry on the cake and write another seven lines of code that locate the five most frequently used terms per cluster. My hope is that these indeed describe the content.
I create a little helper function describe_cluster
(in fact, the only user-defined function in the project). The function takes a list of terms in one community, extracts the namesake rows from the original domain data frame, calculates their use frequency, and returns the top HOW_MANY
performers.
HOW_MANY = 5
def describe_cluster (x):
# x is a frame; select the matching rows from “domain”
rows = domain.ix[x.index]
# Calculate row sums, sort them, get the last HOW_MANY
sums = rows.sum(axis=1)
top_N = sums.sort_values(ascending=False)[:HOW_MANY]
# What labels do they have?
return top_N.index.values
I convert the partition into a DataFrame, group the rows by their partition ID, and ask the helper to come up with a name. The last line of the code is ugly and very un-Pythonic (nobody calls print in a list comprehension!).
word_clusters = pd.DataFrame({“part_id” : pd.Series(partition)})
results = word_clusters.groupby(“part_id”).apply(describe_cluster)
_=[print(“ — “, “; “.join(r.tolist())) for r in results]
But it works:
— criminal mind; ncis; cold case; without trace
— london; lady gaga; sherlock holmes; paris; muse
— tennis; tegan sara; pretender; u2; killer
— music; reading; movie; writing; book
— harry potter; jane austen; pride prejudice; colin firth; ...
— gilmore girl; lauren graham; kristen bell; oc; one tree hill
— veronica mar; firefly; glee; friend; buffy vampire slayer
— west wing; true blood; friday night light; mad men; ...
— grey ‘ anatomy; csi; brother sister; private practice; ...
— supernatural; vampire diary; prison break; jensen ackles; ...
— food; yoga — slash; fan fiction; anime; johnny depp; manga
— good wife; white collar; josh charles; kalinda sharma; ...
— musical; ugly betty; theatre; singing; broadway
— battlestar galactica; x -file; mulder / scully; ...
— lord ring; beatles; disney; queen; monty python
— internet; twilight; romance; family; concert
— office; daily show; jon stewart; stephen colbert; life
— hugh laurie; house md; lisa edelstein; house m . d .; ...
— x -men; fantasy; science fiction
...another 50 descriptions...
Each line shows up to five most frequently mentioned terms per cluster (the terms are separated by semicolons). Some terms look strange and barely recognizable (“x — file”), but remember all the harsh transformations they had to go through! Some terms are clearly duplicates (“house md” — “house m . d .”), but this simply means that the transformations were not harsh enough.
Interpreting the Results
There are two levels concerning the interpretation of the CDA results.
At the lower level, I conclude that all 331 terms selected for analysis are important for The Good Wife viewers — otherwise, they wouldn’t be selected. Moreover, some of the terms go together better than the others. I can see that “jensen ackless” has something to do with “supernatural” (in fact, he starred in the namesake series), and “vampire diary” is “supernatural,” too (because it is). “Music,” “reading,” and “writing” are all activities, and “anime” is like “manga.” I can make these mechanistic conclusions without having a slightest idea about the cultural domain.
At the higher level, I might be an ethnographer, anthropologist, psychologist, or sociologist interested in the mindscape of The Good Wife fans. In particular, I might want to compare their mindscape to the mindscapes of, say, Harry Potter or House M.D. fans. But since I am a humble computer scientist, I will entrust the higher-level interpretation to a domain expert.
About the Author
Dmitry Zinoviev is a professor of Computer Science at Suffolk University (Boston, MA). He received M.S. in Physics from Moscow State University and Ph.D. from Stony Brook University. He has been teaching introductory programming, user interface design, operating systems, computer graphics, distributed systems, simulation and modeling, complexity theory, and data science at Suffolk since 2001.
Professor Zinoviev’s research interests include network theory, social networks, digital humanities, and computer simulation and modeling. Check his presentations, publications, and other projects at NetworkScienceLab.org.
When you are ready to learn more about Python, network analysis, data science, and code reuse check out Dmitry’s books.