Towards Jungian typology with data science

Mattias Östmar
Analytics Vidhya
Published in
9 min readAug 28, 2020

Carl Jung did not develop a quantitative methodology for assessing personality type, which most likely would have made his theory more well-received in the scientific community. However, he considered words spoken by the analysand (i.e. client) as primary empirical data, aswell as paintings etc. Pre-internet the more or less only scientifically acceptable (and economically defendable) methodology to measure personality was self-reporting using rigorously constructed questionnaires. Today, the advances in digital media and data science promises an assessment methodology more faithful to Jung’s original point that the psyche is dynamic, noisy and changes over time.

Psychologist Carl Jung in his 1921 book Psychological Types proposed a theory of four cognitive functions. His own original idea was not the four cognitive functions, but how they are either directed outwards towards the external world of people and things or the inner world of thoughts, emotions and subjective experience.

An outward inclination he called extraversion and an inner he called introversion. Taken strictly, his original theory of personality types is based on the eight cognitive functions that are the product of combining the four core functions with their direction.

Today many more people know about his ideas through later developments such as the Myers-Briggs Type Indicator or a plethora of more or less valid and reliable online personality type questionnaires.

As both qualified professionals using such questionnaires and critics of such tests are well aware of, there are many problems with getting an accurate profile of an individual by relying on such self-reporting of psychological preferences. For an overview of best practices and pitfalls I suggest reading the more academic Essentials of Myers-Briggs Type Indicator Assessment by Dr. Naomi Quenck or the more accessible My True Type by Dr. A.J. Drenth.

The problem with self-reporting questionnaires

To sum it all up I’d just suggest to focus on the essential fact that Carl Jung’s main focus as a psychologist was to describe how the psyche is dynamic and that we develop those cognitive functions more and more over the course of our lives.

The point then, wasn’t or isn’t, to pigeon-hole people into a static category, but to analyse and possibly to facilitate the further development of cognitive functions that have “gotten stuck” in development in an individual and thus causing neurosis and suffering in life.

The role of the analyst isn’t really to “nudge” people to grow beyond their current psychological state and potential hang-ups, but to help the person being analysed (the analysand in Jungian speak) to become more aware of the unconscious, repressed parts of the psyche and be able to acknowledge them and also integrate them into their conscious awareness.

The cognitive functions as a map to personal growth

The role of the four main cognitive functions in this process is to act as a map for psychological growth and wholeness. A person born with and from childhood the most nurtured specific function has it’s opposite function as the least developed and also unconsciously operating function. A lot of the trouble in a person’s life can be attributed to the least developed of the four functions, especially in the first half of a person’s life.

Only by becoming aware of and understanding the workings of ones least developed cognitive function — together with the three increasingly more developed functions that matures into awareness and conscious operability prior to it — can one become psychologically whole and experience a deep sense of existential “completeness” and joy of unhindered ability to use ones psyche freely and adequately in response to the myriad of different external events that life throws at us.

The problem of lack of tagged training data

It is since especially the pioneering work by psychologist James W. Pennebaker published in 2003 and his colleagues today safe to say that word use reflects patterns in our social and psychological worlds. The realms, of course, from where we draw conclusions about personality type and traits no matter what methodology we use to get into the minds and hearts of people.

However, in order to make use of data science such as machine learning we need lots of language samples from lots of people of whom we know the personality type. Studies have been done, especially correlating Pennebakers thoroughly researched LIWC word categories with personality type, but the number of individuals in each study has been very low from a machine learning perspective.

An example is the 2007 study of correlation between LIWC-categories in stream-of-conciousness essays and self-reported Big Five and MBTI types from 80 korean students that showed statistically significant correlations. But, in a machine learning experiment one would at least want an n-number around a thousand individuals and to be able to use recent years advances in e.g. deep learning more an n between 100.000 to millions individuals.

The needed data might be created by companies, but…

In todays world of internet-scale webb apps where personality type data could be relevant such as online dating apps, such data would actually be possible to access. At least for the companies that own those apps.

But some individuals interested in this area have taken the initiative to scrape content from personality type forums. On the popular Data Science community platform Kaggle.com there is for example a scraped dataset with 8600 users forum posts together with their self-reported Myers-Briggs personality type.

The quality of that raw data could be questioned since it is no control over what questionnaires the forum users have used to get their type, how well-constructed those are and how much self-understanding and self-observation capacity those individuals have developed through the course of their their lives.

And also, it isn’t likely that the content they post on that particular forum (personalitycafe.com) is of the stream-of-consciousness character that would probably yield the best results. Nevertheless, I think it’s a very good contribution people make when scraping online data in order to make training data available.

Maybe an organisation such as The Myers & Briggs Foundation that concentrates what looks like most of the academic knowledge about Jungian type (even though in the slightly, but perhaps importantly, different form of Myers-Briggs personality type) would have the financial muscles and foresight to take the lead in exploring the new possibilites with data mining of personality type. Or maybe they won’t and don’t.

Or maybe the Centre for Applied Jungian Studies, dedicated to making Carl Jung’s ideas accessible to a wider audience could help promote or even call for financing and development of more explorations into Jungian data science.

But instead it seems like the global research community with James Pennebaker’s long-standing research in psychological text analysis at it’s epicenter has taken the lead, without any particular interest in Carl Jungs depth-psychology per se. Any advances made within the scientific community of course will spill over into everybody engaged in the assessment of and conversation about personality type, whether for commercial or therapeutic use.

Examples from Swedish social media analysis

Last year I did a data science-experiment trying to predict the eight Jungian cognitive functions from Reddit posts, building on a previous experiment by another private individual.

Currently I’m exploring how to map and use Pennebaker’s LIWC-categories to make predictions about common-sense relations between cognitive functions and likely social behaviour in social media.

For instance, I’ve tried translating specific categories such as LIWC-categories 132 (Insight) and 133 (Cause) to Swedish in order to test them on Swedish Twitter users.

There is a recurring natural pattern at around 18% of every tweets total words

Human languages studied at scale have interesting patterns. Recent years’ radical improvements in machine translation from one language to another is a living proof of that. I was actually pretty baffled when I noticed the patterns seen above, but for personal pronouns in that instance. The same goes for words and phrases related to cognitive reasoning (LIWC category 132) and causation (133). When analyzed across a large enough sample of individuals and a large enough number of language samples from them a natural pattern emerges.

Note that each dot in the chart above representents 1000 twitter accounts with 323 tweets from each individual on average. A tweet is on average about 10 words long, so the results says that almost 2 out of 10 words in a tweet are related to thinking processes as defined by LIWC (and the quality of my translations to Swedish, of course).

When looking at peoples language from this statistical high-level perspective you can very precisely measure correlations between linguistic patterns and the content of their communications. I wanted to make a quick check to see if people with a higher degree of words related to LIWC-categories 132 and 133 (a rough estimate of what I believe should map to Jungian Thinking-functions, wether introverted or extraverted) and how much they talk about family and friends.

I began with a personal hunch that the higher the degree of thinking-style words the lower the references to family (e.g. mom, brother, spouse) and friends (e.g. buddy, friend, guests) would be. Seems pretty obvious, doesn’t it? But actually a simple visual exploration of those two variables, measured as percentages of all words per tweet, showed no signs of relatedness at all.

This is just noise. Which perhaps is a finding in itself.

If there had been a statistical relationship between thinking-related LIWC-words and how much people talk about family and friends there would have been a very different look of the dots and the regression line in the chart. Like it is between the age of trees and their height. Then plot like the one above would look like something like the one below.

Personally, I’m fascinated by the fact observed by James W. Pennebaker and his colleagues that words that are usually thrown away before doing any machine learning experiments on natural human language exhibit several consistent patterns related to social status and personality type amongst others. He has written an easy-to-read popular science book about many of these findings called The Secret Lives of Pronouns which I highly recommend for anyone interested in exploring personality type with the means of data science.

My interest was really piqued when I just recently reproduced his findings about personal pronouns in Swedish. The use of personal pronouns, he discovered, tells a lot about a person’s level of psychological and physical (they are of course related, it’s hard to be cool and happy with pain in the body) well-being. Since the usage on a statistical level exhibits clear signs of the well-known statistical normal distribution it is possible to with high precision analyse people with consistent higher or lower use of specific personal pronouns than the average of the population.

Look at how neatly the use of self-reference pronouns (e.g. I, my, mine) in Swedish tweets cluster around a mean when sifting through hundreds of thousands of tweets from 9 different random samples of Swedish twitter users!

Of all words found in all tweets from each user, about 2 per cent are self-referring pronouns. The experiment is repeated 9 times with 1000 users in each subsample (the dots).

Swedish research using LIWC and similar methods

Here in Sweden I have found very little academic research or published experiments in this area except from the Swedish Defence Research Agency — FOI who’ve repeatedly used Pennebakers LIWC for studies like

Linguistic markers of a radicalized mind-set among extreme adopters

Automatic Detection of Xenophobic Narratives: A Case Study on Swedish Alternative Media

A Machine Learning Approach towards Detecting Extreme Adopters in Digital Communities

All of the above are actually part-studies of the recent (2019) doctoral dissertation at the Swedish Uppsala University by Amendra Shrestha which gives a very nice overview of data science approaches applied to analysing peoples online discourses sometimes using LIWC. The dissertation was supervised by Lisa Kaati at FOI, also with a PhD from Uppsala University.

His full dissertation can be found here:

Techniques for analyzing digital environments from a security perspective

--

--

Mattias Östmar
Analytics Vidhya

Technology, philosphy and nature. tps://www.linkedin.com/in/mattiasostmar/