When you want to know more about a subject, the first thing you do is google it or read articles found in newspapers. But this process could be very hard due to the sheer amount of articles, platforms, and newspapers. And moreover to map a controversy, it is very common to have many different actors talking about the same topic but with different views and different words.
To have an overview of the situation without reading thousands of articles, we decided to create a protocol to analyze and compare the evolution in time of the vocabulary used to talk about a specific topic from different sources. The aim of this specific project was first to verify the hypothesis of a huge cultural gap between the world of newspapers, the academic field and the web, second to check how external happenings can affect the vocabulary evolution through time.
We think that this kind of lexical evolution analysis could be more effective if you know who is writing about that topic and if you have in mind the point of view of that group of authors. That’s why we adopted three different sources: Google Scholar, Google Search and Lexis Nexis. We used these three with a specific aim in mind: Google Scholar allows us to have an academics’ point of view, Lexis Nexis is a big news aggregator that shows us how mass media (especially newspapers) talk about our topic, and Google Search to keep track of blogs and non professional websites for an analysis of unofficial phenomena.
Our topic. For this specific project, our study case was focused on the coming and growing of the sharing economy around the world. After our initial research (www.sharelock.xyz) where we analyzed the studies of the main experts (Rachel Botsman, Jeremiah Owyang, Neal Gorenflo), we were able to define four queries regarding this new economic model: “Sharing Economy”, “Collaborative Consumption”, “Collaborative Economy” and “On-demand Economy”. They all intend to communicate the same meaning, but with different connotations. And they are all used indifferently.
Text collection. Using all the fours queries related to the sharing economy, we collected all the results from the three sources filtered by year from 2005 to 2015. Both for Lexis Nexis and Google Scholar we had to use a manual collection of the results one by one since they don’t allow an automatic method: Lexis Nexis returned a txt file, Google Scholar a PDF file that we converted into a txt file. For Google Search, the process has been automated with the plug-in Moz to collect the resulting links, yoyo to download the text from all the links. At first we kept the results of the four definitions separated in order to check the differences between them among the three sources, but then we decide to collapse all the text together keeping only the division by year and by source but not by definition.
Protocol summary. Our protocol is based on R, a programming language used for statistics and data mining that can also produce graphics to better understand our data. We wrote some R scripts combined with different R packages mainly used for text mining (the packages used are: tm, SnowballC, plyr).
- clean up of the text database (script 1)
- frequency of the words divided by source and by year (script 1)
- merge of the years (script 2)
- frequency divided by number of articles of that year (script 3)
- comparison of average use of the year and total average of the past (script 3)
- from absolute to percentage (script 3)
- comparison of the sources in time (script 4)
The complete set of the scripts can be viewed here.
The script starts running from a database of plain text and thanks to the text mining package, it is able to transform everything in lower case, remove all the punctuation, all the English stopwords and strip double white spaces.
Thereafter it analyzes the frequency of each word and create a table with the word, the frequency and the percentage of use of that word in the database you have analyzed.
In fig. 3 you can see the first three rows of the table exported for the database of Google Scholar in 2012.
We used this first part of the script to analyze separately each source divided by year. Afterwards, using the second part of the script, we combined all the data together. In the fig.4 you can see the first three lines of the exported table analyzing the Lexis Nexis database.
With this database of frequency of words organized by year, we decided to compare their evolution and distribution in time. First of all, implementing the R script we divided the frequency of words by the number of articles (or results) collected in that year. Then we compared the average of frequency of each year with the total average of the previous years. Using the graphic package of R we plotted on a scatter plot the results: on the y axis we can see the average frequency in that year, on the x axis the average frequency of the previous years. In this way, the graph reveals three areas: on top left we can find words of the “new vocabulary” (used in that year but not in the previous ones), on bottom right we can find words of the “old vocabulary” (used before but now decayed) and in the diagonal area between the first two we can find the constant words. Words with a low frequency of use are clustered in the bottom left region and while the frequency is increasing the words move in the top right region.
One of our goals was to compare the use and evolution of the vocabulary in time, and this was doable by placing all the plots year by year side by side. In fig. 6 we have the scatterplot for eight years and it isn’t hard to understand that until 2013 almost all the words are clustered in the red area. This means that the vocabulary was almost the same until 2014 when something relevant happened. In our case, 2013/2014 was the peak of the sharing economy.
To get a meaningful comparison between the three different sources we also created a composition showing a scatter plot for each source in time. Always working on the comparability of data we have transformed them from absolute to percentage, because they had different order of magnitude, and set the maximum value at 100,100 to fix the red area as a 45° line in each graph.
On one hand, this kind of visualization is useful to have a general overview of the evolution of the vocabulary of a single source. On the other, it doesn’t make it too easy to track the evolution of a single word and to compare at a glance its use in different sources.
That’s why we have decided to visualize the same data in another way. The idea was to create a visualization per year for each word that summed up all the sources.
For each word, we plotted the values of the three different sources on the same scatterplot and we decide to connect the three values with a line to draw a triangle. In this way the shape of the triangle returns a clear idea of the use of that word by the three sources. A huge triangle means that the use of that specific word is really different in the three sources, otherwise if it is small the use is similar. If instead of a triangle you have a line or a dot it means that one or two sources don’t use that word at all for that period.
An example is given in figure 8.
Using the same criteria of fig. 5, in fig. 8 you can say that the word “sharing” in 2012 in Google Scholar and Lexis Nexis has a constant use compared to the past, but in Google Search its usage has increased. In fig. 9 you have the complete picture for the word “sharing” among the period of 9 years, starting from 2007 and ending in 2015.
Our re-design in Adobe Illustrator gives more readability in terms of colors and symbols to represent the three sources. Highlighting the three areas of the “new vocabulary”, the “old vocabulary” and the constant words allows the reader to have a quick semantic analysis for a complete overview in a glimpse:
A full version of the visualization can be viewed here.
The visualization of the results of this protocol offers many possible analyses of texts and words such as the individuation of the opinions thanks to the top frequent words, the evolution of the vocabulary and the comparison of language among many sources. In our study case it is clear that after a specific year the vocabulary has changed substantially due to some external factors. This outcome, linked with our knowledge of the topic, let us understand the reasoning of this change.
The script could be even more automated, or changed with the needs of the project. In our study case we started from a vast database of articles from different sources to have a comprehensive overview of the debate, but the protocol can be used with either one source or with many other different sources with the same topic. The protocol can be used in every lexical research for every topic, the only requirement is the necessity to have a large amount of text to analyze.