Deception Detection with Natural Language Processing
Here’s the code: https://github.com/Victor-Palacios/Deception_Detection_Capstone
Pamela Meyer, author of Liespotting, takes the stage for her TED talk, “How to spot a liar”. I ready my pencil, eager to take down every morsel of knowledge she has to offer.
Around 8 minutes into the talk, she uses the infamous Bill Clinton statement:
She then points to two indicators of deception: “non-contracted denial” (i.e., “did not” versus “didn’t”) and distancing language (i.e., “that woman”). Furthermore, she notes that “qualifying language” (i.e., “In all candor”, “To tell you the truth”, etc.), repetition of the question in its entirety, and too much detail are also telltale signs of deception.
I first saw this video in 2013 and it inspired my master’s thesis work: identifying the neural correlates of deception for gesture-based and speech-based lies. Here’s a sample of that work: https://www.jcss.gr.jp/meetings/JCSS2014/proceedings/pdf/ JCSS2014_P2–37.pdf.
Flash forward to 2019 and deception still fascinates me. Now I’m wondering if I can find any other signs of deception using natural language processing. Better yet, can I make a machine learning (predictive analytics) model that can detect deception in speech? This led to three questions:
- Can I extract a dataset of true and false statements to analyze verbal deception from the web?
- Are certain patterns more common in either statement type (for example, less verbs, longer sentences, etc.)?
- How well can machine learning differentiate truth from lie?
Answers
Through “scraping” (data extraction from the web) via the programming language Python, I extracted and refined a dataset with over 1,000 data points from an original dataset of over 16,000 data points. I also found several statistically significant patterns: Character count and word count are statistically higher for truthful statements (p < .042), while average word length is statistically higher for falsehoods (p = .003). I also found adjectives and nouns are more common in truth (p < .006), whereas verbs are more common in deception (p = .040). Finally, machine learning models were able to detect deception with accuracy rates of 57–60% (7–10% better than the baseline chance of 50%).
But what does it all mean?
Why do liars use more verbs, while truth-tellers use more adjectives and nouns? It may be the case that liars want your attention to be focused on the purported actions that did not actually occur, while truth-tellers, because they’re recalling true memories, are prone to more detail in the form of descriptors and recounts of people, places and things. As for the machine learning models, they were able to detect deception 7–10% better than chance so they can still use some work.
What next?
There is still room for improvement such as creating new independent variables from the dataset. For example, we might examine “non-contracted denials” and “qualifying language”. We could also fine tune the classifiers or test other models. A more exciting area of research might be combining verbal deception detection with behavioral deception detection using machine learning!
