A tale of my journalistic interest in text sources
It was digging into a collection of speeches by Fidel Castro that first helped me realize how useful text mining could be for journalists.
I was interested in exploring the idea of sacrifice in post-1959 Cuba: how it had become so widely accepted in the culture as an integral (and necessary) part of life. It occurred to me that maybe at some point in our route towards “communist prosperity,” we had forgotten that all that selfless suffering was simply supposed to be a means to an end, not an end in itself. Losing sight of that — I thought — had left us stuck in a pointless cycle of perpetual mindless martyrdom.
Looking for sources to evaluate my hypothesis, I thought that the statements by the man who for 50 years had decided every aspect of our lives in Cuba — from the negotiations of the Missile Crisis to the design of the uniforms nurses wear today — were a good place to start.
It felt like a monumental feat, however: a tireless speaker, Fidel Castro held several national and international records for the longest speeches, including one of over 7 hours in 1998 before the Cuban Parliament, and a 4.5-hour appearance at the UN General Assembly in 1960. The word count of five decades of his public addresses would certainly amount to several millions… and that felt overwhelming and discouraging for someone with (at the time) zero coding skills.
A bittersweet taste of automation
I was working as a tech reporter in Havana in 2013 when I came across a developer who, as a hobby, had scraped all of Fidel Castro’s speeches from a government website, and put together a searchable database.
When I told him about my idea, he offered to do the search for me, and a few days later, he handed me a PDF file in a flash drive with the search results. I would have liked to get my hands on the software and be able to explore the issue further, but we didn’t know each other that well, so— with the healthy distrust that we have for strangers in communist Cuba — I understood that he didn’t want to be linked to a future use of his program outside his control that could get him in trouble. I thanked him and tried to make the most out of the 66 pages of results.
Long story short, my theory about Cuban suffering was wrong. I was disappointed to see that there had been little variation in the discourse since the 1960s: it had always been sacrifice for the sake of sacrifice. Exploring other possible reasons behind that mentality — like religious beliefs— were not as interesting to me, and that inquiry never turned into an article.
The real value of that experience, though, was seeing how much time could be saved through automation, and realizing that there were stories buried in piles of words waiting to be explored.
Are text-data driven stories a thing? (and how to find them)
A few years later, while I was doing my master’s in the UK, I went back to thinking about text mining, but this time my goal was to learn to do the analyses myself.
I knew that by the end of the course I would be going back to a data desert (Cuba), and if I wanted to be a data journalist, I’d better get creative.
Without public records, or quality (or reliable) official statistics, I thought that less jealously guarded text archives had some potential.
I soon learned that the toolbox of the text-miner-journalist was extremely extensive, and with plenty of varying levels of complication. From PDF file processing, to text extraction, to visualization, there were dozens of principles, and methods, and math, and software to be mastered:
- NLP theory and practice
- OCR software
- Regular expressions
- Coding (in more than one language preferably)
- Specific libraries for NLP
- Text/Information retrieval
- Machine learning theory and practice
- Querying APIs
- Text data visualization principles (or “learning to hate word clouds”)
- Web scraping
- Version Control tools
- Command line
…just to name a few.
Text-data for all: my JSK Fellowship project
Every time a new problem takes me deeper down the rabbit hole of text analysis skill acquisition, I wonder:
Does every journalist wanting to accomplish similar results need to go through this amount of training?
Now that I’m spending ten months at Stanford University as a JSK fellow, I want to find ways to lower that bar, and make text processing easier and faster for journalists.
I can’t think of a better place to do this: immersed in the entrepreneurial spirit of Silicon Valley; having access to brilliant minds, courses and research at Stanford University (the people who “wrote the book”), and putting everything in a journalistic perspective with the support of the JSK Fellowship program and my fellow fellows.
I hope to be able to integrate all those ingredients into a coherent answer.
Follow me here for updates about my work, or write me if you’d like to share your experiences, concerns, and ideas about textual sources and journalism.