A little Linguistic Analysis: Python
A few months ago, I got a job interview and… I was not able to do he challenge they gave me. Now, I have pulled it out of my files and reexamined it.
I am working with Python 3.5, in the Pycharm IDE on a Mac running Sierra.
The challenge goes like this:
- The program will be called from the command line
- When it is called, a text file path will be given as an argument and so will an integer.
- After the program starts running, the user will input two words
- The program will then calculate the cooccurrence of the two words with the range of the aforementioned integer.
Cooccurrence in linguistics is the percent of times that a primary word is in proximity to a secondary word.
This type of analysis is important for many reasons. A change in cooccurrence can signal a change in diction or word usage (either through time or space). Differences in cooccurrence can also give clues about the source: gender, first language, and origin. A useful way to look at cooccurrence is by probability:
The first thing I do, is read in the text file path and the k value integer.
Then I clean (no punctuation) the txt and split it into words. This makes each word a “token”.
Afterwards, I create a function to calculate the probability of cooccurrence.
Then I put in prompts for the user to input the primary and secondary words:
The entire thing looks like this:
I am using “Cat in the Hat” as a text source. The cooccurrence of primary => hat, secondary => cat within k=3 is 10. The count of hat in the entire text is 14. This mean that the probability of cooccurrence is about 71.5%.
That was all she wrote for the challenge. I found it difficult the first time around because, at that point, I had used Python exclusively for Computational Physics so I was not familiar with working on text. “What are words?”
I have been looking into linguistics using Python as well as visualizing data. So I want to apply a little of what I know.
Let’s look at word count. I want a dictionary that houses each unique word in the text as the key , and the value will be how many times the word shows up in the text.
When I print the length of resultant dictionary, there are 236 unique words in the text so I will NOT put it all here. The letter “a” has 33 occurrences, and “upupup” has only 1.
This kind of analysis can be considered a “raw frequency”. We can look at a log frequency next.
We can also go in and add a function to calculate augmented frequency. Augmented frequency is used when trying to evaluate the trends of a large amount of different texts and make predictions. This keeps the data from longer texts from overwhelming the data of shorter ones.
The Augmented Frequency can be found by dividing the raw frequency with the count of the most frequent word.
At this point, I would love to calculate the inverse document frequency. According to wikipedia, this calculates whether the word is common or rare. This is usually used with a large set of documents. I plan on expanding the program later but for now we will stick with one document. The output wont be so sexy though.
- N => the total number of documents
- denominator => the number of documents in which the word appears
Nothing informative because everything just ends up being the log(1).
With all this information, we can make a stupid easy tf-idf; emphasis on the stupid.
First, I want to place all these types of frequencies into a pandas dataframe so that I can keep everything together.
I made the unique words the index and then assigned values to corresponding headers
Then I calculated the td-idf by multiplying the raw frequency and inverse document frequency columns.
When I print out the description, I get:
The documentation asks you to pip install Plotly but I used conda install it. Then Plotly must be imported and authenticated using your username and password.
I just followed the directions on the documentation to make a scatter plot and plotted the words vs their raw and log frequencies:
Vworri's interactive graph and data of "Raw Frequency, Log Frequency, Augmented Frequency" is a scatter chart, showing…plot.ly
Plotly allows for amazing interactivity. You can save the data and the graph online. The graph is above and that data can be found here:
'Untitled Plotly Dataset' is a Plotly Dataset created by vworri on Plotly.plot.ly
I just think it’s super cool. I have made the data and the graph public so, have at.
This is a contour graph I made: The log frequencies are the z axis
Now, I want to make a matrix of cooccurrence probabilities of each unique word in the text compared with all other words. Because of the size of the list of unique words, it will be very expensive. It takes a while to return and the matrix will me very big!
Now, in the original challenge, the k value (the range in which we are looking for cooccurrence) will be between 1 and 5. I will make five contour for all five values of k.
Because of the sheer amount of data points, I could not make subplots. Below are the maps for each k value from 1 to five (in order). I had to start the range from 1 instead of zero because I was using it to index the titles.
I also went into the createwordmatrix() and findcooc() functions and added a k argument with a default value of the K value from the command line.
This was really slow (I watched youtube while waiting) because I am running this locally on my computer.
As you can see, the larger the range, the more higher probability of finding another unique word. — Super intuitive — .
This is the final product.
The next steps will be to expand the code to work with multiple text sources. But that is another post. Please message me if you have any ideas on cool ways of expanding this project.