One Coder Coding
On minimizing software tooling for qualitative data analysis tools from a programmer who does a particular kind of qualitative user research.
An hour-long research interview can translate to something like 10 or even 20 single-spaced pages of text, depending on how fast your interviewee was talking. A good portion of that is “like” and “um” and aborted phrases said in the good, healthy course of thinking aloud. Maybe you have 20 or 30 interviews, and, bam, you have on your hands a stunningly unreadable and disorganized 200-to-600-page book. In addition to interviews, I also do a lot of observations, which include a lot of frantic scribbling on paper with a pen, and then I transcribe that into another unwieldy tome to scare people with. So, what is to be done with these heaps of digital texts?
Rule #1: You can’t reason about what isn’t there.
If you have 20 interviews where no-one talked about X, you cannot say that 0% of the participants care about X, because it’s possible you simply did not ask the right X-related question. If you have 20 interviews and you had a question about X which 19 out of the 20 responded to (1 of the 20 was deeply engaged and passionate about something other than X and you went with that instead and ran out of time), and of those 8 hated X and 11 loved X, you still cannot say anything about the distribution of attitudes toward X (42% love it!) of the general population, because you (probably) did not specifically sample your 20 interviewees to be representative. All you can talk about is the contexts and values as expressed in the interviews regarding X, and from that build a nuanced understanding of what it means to love or hate (or even have bittersweet longing for) X.
But how does one actually do the qualitative analysis? In the past, I’ve used “Constructing Grounded Theory” (Charmaz) and “Qualitative Research Design” (Maxwell) for practical exercises — these formed the basis of my prior methods blog-post, How I Do User Research. More recently, in working on my dissertation, I’ve been using “Qualitative Data Analysis Methods Sorucebook” (Miles and Huberman).
But a list of references is not an explanation. As far as I understand, qualitative analysis is the systematic application of some processes after which some insight emerges; the insight itself ought to make sense once it appears, but coaxing it out of the researcher’s psyche may be difficult. I came to qualitative analysis in a round-about way; I was doing computer science and social network analysis and natural language processing, and then, over the course of my graduate work, swung further and further to include other methods for understanding socio-technical systems.
I started out doing one sort of coding (programming) and then slowly did a lot of the other sort of coding (annotating qualitative data with labels as part of analysis). The other coding — annotation — made a lot of sense because it seemed very structured, and structure felt good. For a while I used Atlas.ti — a behemoth of a thing with myriad tiny windows so even when you are just clicking around in exasperation it sort of feels like progress.
Atlas.ti supports qualitative coding — annotation — of segments of text. Then it supports searching by the codes, and relating codes together in complex networks, and enabling more and more complex search queries. All this structure provided a huge comfort to me, because qualitative interview and observation data can be so terrifyingly complex and unapproachable. I am happy to report I haven’t had to use Atlas.ti in years. I’m sure it’s very useful for some tasks, but the problems it solves are not the problems I have. (I am picking on Atlas.ti in particular, but there are many qualitative coding packages and environments, all over-complicated in slightly different ways, IMHO, based on my periodic attempts to find a better tooling solution.)
Requirement #1: Iterative search when everything is changing
The fundamental thing that my ideal qualitative analysis tool needs to do is support searching the documents I have, and it must be robust to an evolving understanding of the documents.
I call my own tooling approach “grep++” or “fancy grep,” and, to be honest, it’s a misnomer, because grep itself is by far the fanciest and ++-iest thing going on. I have a giant directory of files with rigorous filenames that denote the context of sampling (participant group, date), type of data (interview or observation), researcher doing the data collection (I have data collected by 6 other researchers in addition to my own), plus a short, descriptive, memorable title. All files have blocks of metadata including start and end times, participants present, type of interaction (if appropriate), and so on. I can easily search all files, and once I find a few matching my query, I open them and examine the section. If it is appropriate, I label it with a dated hashtag (#CodeName@YYMMDD), where the “@YYMMDD” is optional. The codes are also easily searchable.
For slightly more complex aggregation tasks, I have a sprinkling of Python code for things like “what is an interview with least recent codes?” or “how long did I spend observing this type of interaction in that participant group?” or “give me all the codes!” I have a 112-line Python module for parsing the primary documents, and then I can use iPython Notebook and make whatever aggregated calculations I need in 5 lines or less.
The main philosophical way in which what I am doing now feels different from using something like Atlas.ti is that I am constantly modifying the files. In Atlas.ti, the assumption is that the primary documents are static and unchanging, and then you make codes and write memos on top of that in the Atlas.ti environment. I need no such separation, because I have the life-saving magic of version control! You can have free private repositories on bitbucket.org (or pay to have private repositories on github, or host a local repository if you have to keep your data only on specific machines) and have extensive, rich backups of all your data.
When I mentioned coding, a professor on my PhD committee made a sour sort of face and pointed out that “unless you know what you’re doing, you can get stuck in coding” and tread water fruitlessly. She is the one who has turned me onto the Miles and Huberman and has been exposing me to other analytic processes and exercises besides/in-addition-to coding proper. That said, “grep++” has remained equally useful for staying close to data.
But, coming back to Rule #1 — “you can’t reason about what isn’t there” — I think a major disadvantage of Atlas.ti (et al) is that it makes the counts of code occurrences so front-and-center, and that everything about its interface makes it harder to have more than a few dozen codes. In other words, I would argue that the interface supports a really specific kind of analysis which, in my case, is dangerously close to tempting me to make arguments about the relative occurrences of codes: “oh, there sure is a lot of ‘love X’ in here! Maybe that’s something!” So in m own approach, I tried to intentionally separate aggregate calculations (in iPython Notebook) from iterative search (in a terminal window with grep), so I am not tempted by tooling design to be preoccupied with counts during the course of analysis.
It’s all personal
Software tooling for qualitative analysis must be subject to one rule (“don’t reason about what isn’t there”) and one requirement (iterative search robust to changing document content). My version of not-overkill involves a plaintext editor, grep, version control, and a little bit of iPython Notebook.
My enthusiasm notwithstanding, I think that tooling can and should be a matter of personal taste and skill. The feeling of creative flow requires the right amount of challenge to not be bored, and the right amount of skill to not be anxious. All these processes (e.g., coding) are only as good as their ability to support the researcher in feeling inspired and generating some insight. Being bored (by clicking buttons in Atlas.ti) or anxious (because they feel cultural or institutional pressure to learn programming) are both unhelpful states, and they require changing either the tooling set-up, or the skill-set.