Introduction to natural language processing with Cortext

Tutorial 06 in a series on Controversy Mapping

Ethnographic Machines
7 min readFeb 20, 2019

In this tutorial, we will go through the steps of uploading a dataset to Cortext, extracting significant noun phrases, and mapping these noun phrases as networks based on their co-occurrence in documents. Cortext is a piece of software that allows you to do natural language processing: computationally reading and comprehending text. It is useful if we do not know which words to search for, but want to know, bottom-up, what kind of language a corpus of documents is using.

We will use a dataset of full-text content from our 173 circumcision related articles on Wikipedia produced with this Python script. Before we start, you will need to run that script and set up a Cortext account. Go to the Cortext Manager to do so.

Preparing and uploading your dataset

The easiest way to get text into Cortext is in a spreadsheet format. Cortext will automatically consider each row as a separate document. In the example below I use the CSV export from the script above. Each row is a Wikipedia article from the ‘Circumcision’ category. Column ‘B’ contains the full-text for the article. The other columns contain various kinds of metadata which will be useful for analysis at a later stage.

After dragging and dropping the file into Cortext I can click “Accept & Upload”

Cortext will ask you how to parse the uploaded dataset. Choose “csv” from the “Corpus Format” dropdown and tick “standard csv separated by ; and minimal quoting” (in other cases you might have to use “default csv” or “tab separated” depending on how your csv file is separated). Leave the other settings in their default positions and click “start script”.

The .csv file will now be indexed as a database (.DB) in Cortext. If all goes well you should be able to see a green flag that says ‘finished’ next to the indexed database.

Terms extraction

We can now proceed to extract noun phrases from the abstracts in the dataset. To do so, launch the ‘Terms extraction’ script. You should be able to see the column headers from the original spreadsheet as textual fields. Select ‘Page_text’ (or whatever is the name of the column containing your text) to specify that this is where the text from which you want to extract terms is located. Furthermore:

  • Leave ‘Minimum frequency’ at 3. This sets a minimum threshold for terms to be included (namely terms that are found at least 3 times in the dataset). In very large datasets you may try to increase the minimum frequency further.
  • Leave ‘List length’ at 100. This determines how many terms the algorithm will extract. After having applied the minimum frequency criteria, it will select the 100 terms with highest specificity core. High specificity means that a term has a very uneven distribution across the documents in the corpus.
  • Leave ‘Maximal length’ at 3. This simply means that terms cannot be longer than 3 words (trigrams). You may at a later point decide to go back and increase the length to 4 or 5 if you have many terms broken abruptly into 3-word multi-terms.
  • Finally, select ‘yes’ for ‘Lexical extraction advanced settings’. This will open a sub-menu where you can switch the frequency computation from ‘sentence level’ to ‘document level’. Do so. This will ensure that your minimum frequency is counted as occurring at a minimum in 3 documents, not sentences.

When you are done with the settings, run the script. Again you should be able to see a green flag when it is done. You can click the eye icon next to the results file to see the list of extracted terms.

  • The ‘Forms’ column shows you the different versions of a noun phrase that Cortext has decided to consider as one term.
  • The ‘Stem’ column shows you the root of the noun phrase that will be used to recognize different versions of it.
  • The ‘Main form’ column shows you how the noun phrase will be labeled on your maps.
  • The ‘Occurrences’ column shows you how many times a term is found in the dataset. Remember that you asked the script to count occurrences on the document level. The occurrence count, therefore, reflects how many documents a term is found in. You can sort the column by clicking on the header.
  • The ‘Cooccurrences’ column shows you how many times a term is found together with other terms. Cooccurrence is defined as being found in the same document. You can sort the column by clicking on the header.
  • The ‘Specificity chi2’ column shows you the specificity score for each term. The specificity score is a statistical measure of how a term is distributed across the documents in the dataset. If a term is very evenly distributed across all documents (i.e if it occurs with the same frequency in almost every document) it gets a low specificity score. If a term is very unevenly distributed across all documents (i.e. if it occurs with high frequency in a particular subset of documents but not at all in other documents) it gets a high specificity score. In the latter case, we say that a term as a biased distribution. Eventually, Cortext is geared towards selecting these terms as important, the assumption being that we already know the theme of our dataset and thus its generic language and that the reason for employing semantic analysis is to discover discursive differences below this generic level. If we were to run the ‘Terms extraction’ script again but expand the list of extracted terms to 200 or 300, this expansion would follow the ranking of terms by specificity. Rather than the top 100 most specific terms in the dataset, we would get the top 200 or top 300 most specific terms. You can sort the column by clicking on the header.

Map co-occurrence networks

When we have a list of extracted terms we can map how they occur together in our documents. To do this, we have to produce a co-occurrence network. Select the “Map Heterogeneous Networks” and select ‘Terms’ in both ‘First Field’ and ‘Second Field’ under ‘Nodes Selection’. This will produce a network of terms connected to terms through co-occurrence in the same documents. Set the ‘Number of nodes’ to 100, since you have only extracted 100 terms. Run the script.

When the script is finished you can click the eye icon in the “Map explorer” to navigate and interact with the resulting network.

You will notice that the network is very well clustered. This is partly due to the fact that Cortext has already selected terms with high specificity (they will, as a function of their biased distribution in the set, co-occur much more with some terms than they do with others), but it is also due to the fact that the network mapping imposes an edge filter that ‘punishes’ weak edges between otherwise well-connected nodes. Every edge between two terms has a weight that is determined by the number of documents in which the two terms co-occur. This weight is normalized on the degree to which both terms tend to co-occur with all other terms in the dataset (their global co-occurrence). If two terms are very likely to co-occur, they are thus obliged to have a relatively heavier edge between them for that edge to be included in the network. To explore what this means you can try to run the network mapping script again, but this time, under “Edges”, select “no” for “Automatically define the edges” and switch “Proximity Measure” from “distributional” to “raw”.

As you can see, the same network without edge filtering is considerably less clustered.

Eventually, you should try to extract more terms and map them as co-occurrence networks (with the distributional edge filtering on) to see when the gradual addition of less specific terms disrupts the clustered nature of the network. Below I try first with 300 terms and then 500 terms. Around 500 terms the clusters are no longer retained.

Co-occurrence: 300 terms
Co-occurrence: 500 terms

We can interpret these clusters as discourses. They effectively mean that a group of articles tend to use the same words together that other articles are not using together. There is still a male / female distinction between these clusters, but the fact that we extract noun phrases bottom-up from the text rather means that we can see a more nuanced landscape of sub-topics. In the visualization below we have exported the results from Cortext to Gephi and annotated the resulting network to show this structure:

Example of an annotated co-word network based on semantic analysis of full text from Wikipedia pages.

--

--

Ethnographic Machines

“Traditional social science is on the lookout for variables; ethnographers are on the lookout for patterns” (Agar 2006, 109)