CiceroLite: Not Quite an Orator but A Processor of Language

Natalie Richoux
DH Tools for Beginners
7 min readNov 17, 2015

As a current M.A. candidate at Virginia Tech conducting research in linguistics, I have been exploring ways to text mine ephemeral websites such as 4chan. Prior research by Andrew Kulak offers an introduction to text mining through the works of tools such as Project Gutenberg, Voyant, and primarily CATMA . However, I take a different approach to text mining in that my goal is to conduct natural language processing and analysis from my text mining samples.

Natural Language Processing

Natural language processing (NLP) Natural language processing (or NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. NLP has been a field of study since the 1950s. (Although the exact year is disputed, the field’s origins are traced to when calculators became a factor and scientist and researchers began questioning whether technology could be used to translate natural languages.) In later years, the goal of NLP has been “the computerized approach to analyzing text that is based on both a set of theories and a set of technologies.” (For information you can read Elizabeth Liddy.) NLP is of concern to linguists to study how people use language in digital spaces and processing this information to make it understandable and useable in a research specific way. However, NLP has been complicated by social media sites that have ephemeral existence such as 4chan.

Introduction to 4chan

The homepage of 4chan

A quick introduction to 4chan is necessary to understand how the site complicates text mining and research based on collections from the website. 4chan is simple image-based bulletin board where any internet user can post comments and share images. There are a variety of boards for users to explore and post images ranging in interests such as video games, sports, Japanese Anime, photography, and politics. Users do not need to register an account before participating in the 4chan community and can post on message boards anonymously. 4chan thrives on its anonymity and the fact that users of 4chan interact in an anonymous and ephemeral environment that facilitates rapid generation of new trends, both positive and negative. However, the ephemeral nature of 4chan has proven to be an obstacle for linguists conducting research because a link to a particular message board or post will be non-existent in a matter of days or even hours. (The longevity of posts depends on the popularity of the message board; with a ten-page limit for each board, more posts means each post spends less time on 4chan.) The challenge has been finding a way to text mine 4chan for research purposes. Here enters CiceroLite.

Language Computers: CiceroLite

Langauge Computers homepage that offers a glimpse of what is offers and it’s purposes

CiceroLite is a natural language processing software nestled within Language Computers, a software company that assists in researchers, “with intelligent, semantically-informed search and discovery software tools which unlock value by actually understanding the information stored in any large collections of text.” The goal of Language Computers is to make sense of unstructured texts, like 4chan, and CiceroLite is a software that was developed to be a tool utilized by researchers to make sense of unstructured texts.

The CiceroLite homepage

CiceroLite is a natural language processing tool to help researchers collecting and studying large amounts of text. CiceroLite’s “entity extraction systems provide unsurpassed performance in terms of precision, recall, and processing speed. Designed to provide state-of-the-art performance for large entity type hierarchies, CiceroLite’s robust machine learning framework enables it to be extensible to new languages of interest quickly and easily, given sources of training data.” CiceroLite has the capability to process English, Arabic, and Mandarin Chinese. So, what does processing mean?

Processing Texts Using CiceroLite

The CiceroLite window where a user can begin uploading texts or webpages for processing

CiceroLite can be launched in a variety of web browsers with ease by going to the CiceroLite homepage and launching the application. CiceroLite does not require downloading but rather is a web-based software platform housed by Language Computers. Once the application is launched, the Cicero server will appear where a user can begin processing texts. There are two ways a user can process a text. The first option is having a text file and uploading it to the server for processing. Processing directly from a text file is the fastest option, and will take anywhere from 30 seconds to two minutes depending on the size of the file. The second processing option is directly from a webpage, which is why CiceroLite is useful for websites such as 4chan. Processing from a website is more complex than processing from a text file and, depending on the coding and complexity of the website, can take anywhere from two to five minutes. However, websites such as Facebook and Twitter were non-configurable as of yet due to their changing nature and use of non-specific web addresses.

Processing 4chan

After selecting a text to process (and for the purposes of this tutorial, I am using 4chan), you are directed to a page with new information on it that looks like this:

What information will be displayed after your text has been fully configured and processed

After successful processing, a user can then begin to analyze the text that has been processed. The first area to notice is the entities box in the top lefthand corner. Within this box, a user can see what entities exist within the text and how many of each exist(categories being person, location, organization, contact information, date-time, numeric, or other). To the right is the entities keys to understand the color coding within the text. The legend corresponds to entities of that kind within the text.

Furthemore, you can hover you mouse over a colored entity within the text and an information box will popup with more information about that particular entity.

The next area to notice is the entities key in the top right hand corner and the colored words within the text window in the middle. The entities key informs a user of which words belong to which entity (in the example text window, you can see that the words ‘weeks’ and ‘seconds’ appear in green to coordinate with the entities key of these being date and time. This feature can allow a researcher to begin looking at patterns within a text and begin making connections within the patterns.

The final area of notice is the bottom righthand corner box labelled ‘details.’ Within in this box is some valuable information such as language, document type, and title. However, some of the most useful information is that CiceroLite can process a text and inform a research of what the primary metadata is for that text file and what percentage of the text is that form of metadata. (Metadata is associative data that will allow your text to be more searchable and accessible during a search engine search; see the Dublin Core for more information.)

Downfalls of CiceroLite

While CiceroLite has proven to be a valuable tool for natural language processing, it does have its shortcomings. The ultimate shortcoming is that since the tools is housed digitally on the internet, there are times it did not work. The Cicero server was unable to communicate and therefore could not configure or process the text (and this was the case for both text files and webpages—when it was down it was unable to process both formats). While more often than not I was able to process a text, the servers were down temporarily on multiple occasions not making this an isolated incident.

What a user sees when CiceroLite is down.

CiceroLite as a Digital Humanities Tool

The processing capability of CiceroLite has shown to be a useful tool for any researcher attempting to understand the complexities of natural language and process them in order to make them more manageable and understandable. The tool’s ability to process a text and allow a researcher to begin drawing patterns within text, make connections to texts by the same author, understand differences in authors of the same time period allow a digital humanist to undertake large amounts of texts to annotate and breakdown in shorter amounts of time.

--

--