How can CEOs/Consultants use natural language processing to find valuable insights from a large volume of opinionated text

Published in

Latent

11 min readNov 21, 2018

Finding addiction trends from Singapore Netizens

Abstract

In this truncated case study, we analyze addiction insights from a popular forum in Singapore and suggest ways to help the public-at-large in Singapore.

To perform this, we broke down the request into a sequential list of tasks. We first generated various lexicons of ‘addict’ and ‘drug’. We also added in the names of the most commonly abused drugs. We used this list to tag posts from the universe set to be of interest to us.

Next, we passed this text through corpora analysis, sentiment analysis and classification to identify patterns. The analysis was finished off with the human touch to ensure all quantitative data made fundamental sense.

From our findings, there are 3 curious issues which public institutes could further address as appropriate:

Are painkillers given to legitimate medical needs indirectly causing drug addiction? What should be done?
Is there a grey area in steroids consumption in Singapore, and should there be more outreach effort to educate people about the effects?
There is murmured chatter on study supplements/nootropics for the youths, perhaps driven by the demands to perform. Similarly, should there be outreach to our youths on this?

In terms of actionable to fill in market gaps observed from the social insights, public institutes can consider:

Create a wellness site for better youth engagement in aiding early-stage addiction struggles

Quitting at the early stage is easiest for addicts
The engagement level of the current sites can be made better

Create a mobile app for anonymous posting and sharing of drug addiction struggles

To bounce off peers and trained counsellors
Anonymity is desired by addicts but is uncommon in current channels

Increase treatment coverage and publicity for managing sex and junk food addiction

These insights were found in the forum chatter but sparsely mentioned in public sites

Introduction

We embark on a project to mine and uncover insights on ‘addiction’ from a popular local Singapore forum. This obtained information will then be cleaned, processed and analyzed using natural language processing algorithms. It will be finished up with the human touch, to ensure all insights make fundamental sense.

These insights will be used to provide social intelligence to public institutes, to understand what the Singapore public-at-large feel on various issues pertaining to ‘addiction’ and ‘drug’. These empirical insights provide a situational awareness base to public institutes, without subjectivity or speculation. Public institutes can tap into this to structure its outreach efforts.

Context

We first define our research terms. Addiction is a brain disorder caused by compulsive substance with harmful consequences. The core pathology that drives the development and maintenance of an addiction is a biological process. It is reinforcing and intrinsically rewarding.

Singaporeans are a very internet savvy bunch, supported by a world class connectivity infrastructure. It is deemed to be the world most ‘network-ready’ country by the World Economic Forum . People in Singapore spend over 12 hours/ day online, with 71% of them spending it on social media and networks. By listening to what Singaporeans say online, we can tap into a goldmine of insights.

Online netizens engage in discussions by posting and replying to messages in forums. The discussion topics range widely and are often open and free to whatever opinions users may have. Users often engage in questions & answers, comparisons, polls, topical debate etc. It not uncommon to see nonsensical, unsocial and ‘trolling’ behaviors, often done in the cloak of anonymity.

Though our machines, we selectively retrieve forum posts with insights on ‘addiction’, ‘drugs’ and its alike. We will then run this data through natural language processing algorithms to uncover insights. This is through corpora comparison (compare forum posts with a general English dictionary), sentiment analysis (positive or negative interpretation of text) and classification (group related conversational terms together). We will then qualitatively examine the insights produced by the machines and suggest actionable steps which public institutes can take to better align its goal of promoting better mental health.

Methodology

Refining our collection and retrieval mechanism

Defining search terms

We need to define the search terms for which we will retrieve data for. We start with ‘addiction’ and its shortened wildcard form ‘addict*’. To get a better picture of the conversations, we use a wider list of related terms in our search retrieval effort. For this, we plug in a list of commonly abused drugs and their corresponding street names . We also supplemented the list with generic ‘drug’, ‘addict’ terms. This process can give us a more targeted search results.

Table 1 Terms used for the search querySearch terms ketamineclonazepamamphetaminestimulantoxymorphoneklonopinadderalldepressantDmtmethylphenidatehydrocodonestudy drugscarisoprodolritalinvicodinrecreational drugssomamethamphetaminemarijuanadrugmetadonetramadoltobaccolorazepamlsdalcoholativanmdmadrug addictionmorphineecstasydrug abusebuprenorphinemollyaddictheroinalprazolamaddictionzolpidemxanaxhallucinogenambienoxycodonetranquilzerdiazepamoxycontinopioidvaliumcocainesedative

Universe set

The data which we collect is vast and we have 630,279 unique posts from 13,952 unique users over 13,980 unique threads.

Table 2 Description of the full data set titleurlusercontentcount654613654613654613646668unique139805289113952630279topBuffetlicious!User-xContent-xfreq99051644155913

The historical period of the data spans over 5+ years.

Table 3 Time range of data universe

Date time stampFirst data point25–03–2013, 12:35 AMFinal data point18–08–2018, 01:08 PM

Analyzing the distribution of post forums with related search query

Amongst the large universe of post counts, we need to be specific and tag only the posts that are related to what we are searching for. For each post in our universe, the post will be deemed as relevant if any of the terms in the keyword list table appear in the post. We will mark these posts as relevant by tagging ‘Addict’ in a newly created ‘Interest’ column in our dataset.
After tagging the posts, we do a keyword search to visualize the frequency of our keywords. From the table below, we can see scattered mentions of various drug names. For some names, the frequency count is sparse.

Table 4 Distribution of related post countsPost counts ketamine post counts: 11diazepam post counts: 0xanax post counts: 1oxymorphone post counts: 0valium post counts: 5oxycodone post counts: 0dmt post counts: 6clonazepam post counts: 0oxycontin post counts: 0carisoprodol post counts: 0klonopin post counts: 0cocaine post counts: 36soma post counts: 31methylphenidate post counts: 0amphetamine post counts: 0metadone post counts: 0ritalin post counts: 0adderall post counts: 0lorazepam post counts: 0methamphetamine post counts: 10hydrocodone post counts: 0ativan post counts: 0tramadol post counts: 6vicodin post counts: 0morphine post counts: 13lsd post counts: 25marijuana post counts: 27buprenorphine post counts: 0mdma post counts: 1tobacco post counts: 85heroin post counts: 49ecstasy post counts: 9alcohol post counts: 995zolpidem post counts: 0molly post counts: 18drug addiction post counts: 19ambien post counts: 148alprazolam post counts: 0drug abuse post counts: 33addict post counts: 977opioid post counts: 0study drugs post counts: 0addiction post counts: 175sedative post counts: 11recreational drugs post counts: 0hallucinogen post counts: 2stimulant post counts: 8drug post counts: 1137tranquilzer post counts: 0depressant post counts: 23

Findings and discussion

Corpora comparison

In this analysis, we find distinguishing terms between bodies of text via the following 3 forms.

Analyze differentiated corpus within the post universe via terms
Analyze differentiated corpus within the post universe via categories
Analyze terms that are more characteristic of the corpus vs a standard dictionary

Analyze differentiated corpus within the post universe via terms

This compares the full corpus (all the collected forum posts) to a general English corpus. We can see the terms that stood out the most in our universe of text.
From the table below, the terms with the highest scaled f-score were the terms that are most colloquial to Singaporean speak like ‘gagt’, ‘edmw’, ‘sinkies’ and is atypically found in a standard English corpus. The terms in the table give us a data sanity check on what words are most localized.

Table 5 Top terms — Scaled_f_scores vs backgroundtermscorpusbackgroundScaled f-scoregagt675500.004906liao40714255110.002563leh19723519280.001272wrote35067564215940.001185lah16564565930.001033hdb14903305150.000968dun182016737710.000823lor11696178060.000695edmw91600.000667sinkies91000.000662hardwarezone926517500.000662ish152918861940.00066xiaomi90300.000657ppl184029366230.000647sinkie81000.00059chiu7865497240.000477wah99114230190.000475ceca671865410.000474meh8006397660.000472kena645857890.000455wo136833361210.00045pap104019409410.000444sia84911350080.000437jiak59900.000436knn5861261520.000408jin97221719630.000395singaporeans5601903640.000381satki50500.000368cpf5683700770.000364

Using the list of key search queries, we tag the forum posts which we are interested in as ‘Addict’ and everything else as ‘Base’. We then used a scaled_f_score to see how often these terms appear relative to each other.
From the table below, we see terms that are more characteristic of our tagged posts. These terms are: ‘alcohol’, ‘gambling’, ‘ketogenic’, ‘tobacco’, ‘disease’, ‘girlfriend’, ‘diabetes’, ‘liver’. Further going down the list in the database, we see more terms like ‘insulin’, ‘cigarettes’. These represents conversations which Singaporeans typically talk about regarding addiction and drugs.

Table 6 Terms that are most associated with our queriestermsAddict freqBase freqaddictaddicted32001drugs78501doo doo47001addiction24401addicted to21201alcoholic28710.999972drug72930.999967addict18410.999955alcohol91660.999946doo62450.999934addicts14810.999911addictive14310.999885ambience12010.999348gambling339590.997786the accused169350.997169globalization10220.997064ketogenic114220.996443tobacco9500.995206disease186560.99509girlfriend115330.994524yrs old117350.994378accused211740.993802diabetes139490.993665the deceased95210.992129liver139580.9918casino149670.990854smokers111470.990416alcohol and8500.990068my girlfriend8680.989783

On the other hand of the spectrum, we also analyze the terms that are found predominately in the untagged posts and not our tagged posts of interest. These are common terms mentioned in discussions that have the least relevance to ‘drug’ and ‘addiction’. Terms are ‘ceca’, ‘lease’, ‘cpf’, ‘samsung’ and ‘xiaomi’. There are topics close to the hearts of Singaporeans and perhaps make up some of their disgruntlement.

Table 7 Terms that are most associated with the baseline forum poststermAddict freqBase freqbasececa186531lease93980.995337cpf215470.993762samsung368410.991795xiaomi388650.991179hdb6614240.989853hr194480.988955samsung sm336450.987151from samsung356800.987008sm366950.986832sinkie417690.986002mi295480.985799pigu113600.985197team366170.983113goku03010.982995code204060.981977bui83320.981799from xiaomi365850.981228chiu477390.980253wat294760.979968knn355510.979907posted467070.979423pap649760.979144loan83230.978738satki304750.978684from your304750.978684wow284540.978648google355310.978354wrote how264330.978334xiaomi mi123390.977848ts659550.977738

Analyze differentiated corpus within the post universe via categories

We use the empath library to validate lexical categories over a set of seed terms. We pass our forum posts through the library which will then be categorized into 1 out of the 200 standard topics. Instead of analyzing terms, we now analyze categories that overarch the terms.
The categories that are constituted from our tagged posts are are ‘heath’, ‘alcohol’, ‘medical emergency’, ‘crime’, ‘violence’. The categories that constitutes the other posts are ‘real estate’, ‘shoppng’ and ‘giving’.

Table 9 Categories that best represents tagged post and non-tagged postsTop AddictTop Basehealthreal_estatealcoholshoppingliquidgivingeatingspeakingmedical_emergencypaymentcrimecommunicationviolencebusinesschildreneconomicssmellmoneyinjuryvaluablerestaurantworkstealingtravelingdeathbankinghealingpositive_emotionTable 10 Empath categories between queries and baseline

Analyze terms that are more characteristic of the corpus vs a standard dictionary

The x-axis of the scatter graph shows the relevance of the corpus against a standard dictionary. The more to the left, the more common the word is. Also, the higher the term appears in our x-axis, the more related the term is compared to the non-tagged posts.

Table 11 Queries vs Baseline vs General English Corpus Comparison

Sentiment analysis for tagged posts

We ran the whole post universe through a Valence Aware Dictionary and sEntiment Reasoner library, a lexicon and rule-based sentiment analysis tool. For each post, the tool will provide a polarity score of between -1 and 1. From the table below, we can see the sentiment score fluctuated a lot between negative and positive territory.

Sentiment over time

To transform the sentiment into a usable data for analysis, we smoothed it out with a simple 50 period moving average. We can see that there are some cyclical patterns in the text. Polarity was negative in end 2012, positive in early 2012 and negative in start 2013 again.

Sentiment over time — smoothed to 50 period

We then analyze the post volume over time for our universe. We can see there is a recency effect as most of the collected posts came in from 2018 onwards.

Forum post volume

Likewise, we use a 100 period smoothing factor to smooth out the fluctuations. We see the distinct spike in posts post 2018.

Forum post volume — smoothed to 100 period

Conclusion

1. We structurally mapped out the entire online conversation using our source on ‘addiction’ related topics using natural language processing

The high-frequency conversations centred around the themes of ‘carbs’, ‘crime’, ‘judge’, ‘hiv’. These were generally netizens’ discussions on news reports and controversial vice topics.

How to go on a healthy diet
Discussing the sentencing of criminal cases and reacting to the verdicts
How to reduce the odds of catching on HIV when engaging in prostitution.

For the medium frequency conversations, there were mentions of ‘dpp’ (deputy public prosecutor), ‘gambling’, ‘ketongenic’, ‘pleasure’, ‘bail’, ‘chronic’, ‘girlfriend, ‘girl’. These reflect vice habits. Also interesting is that the person’s partner (i.e. girlfriend) is often mentioned in the conversation.
The low frequency yet unique to ‘addiction’ conversations had mentions of the drugs like ‘morphine’, ‘heroine’, ‘ketamine’ and ‘sedative’.

There was some dissatisfaction that ‘kleptomania’, a mental disorder is used a get-out-of-jail-free card to mitigate convictions, as compared to other drug drugs who had to face heavy punishments.
There were conversations centred on the legalization of drugs in jurisdictions around the world, the various forms of addiction like sex, internet.

Curious mentions are:

If painkillers like Tramadol prescribed for legitimate concerns unintendedly caused drug addictions in patients?
Anabolic steroids are mentioned to be prevalent amongst — bodybuilders and physical trainers. There group have a career concern to look their best physically. There is some doubt on what is appropriate. Public institutes can do some outreach and bring light to what is best for their well-being.
For the student demography, the need to perform has become even more paramount. There were mentions of nootropic being used, with it being a ‘seductive’ drug. Other alternatives to enhance study performance were omega and fish oil.

Actionable solution

Look into the curious mentions of unintended addiction, steroids and study drug uses

2. Tackle the addiction problem when it is early and easy

The 7 stages of an addiction are:

Initiation. People try substances for the first time before adulthood, often out of curiosity.
Experimentation. Use substances in specific situations that are associated with fun, unwinding and a lack of consequences.
Regular use. There becomes a predictive pattern of using substances, sometimes used alone.
Risky use. The use of substances start to threaten one’s or other’s safety.
Dependence. The body develops tolerance and there becomes physical and psychological dependence.
Substance use disorder. The user loses controls on the use of substances.
Treatment. The combination of detox, behavioral therapy and medication.

From online anecdotes given by rehabilitated drug users, who gave their own advice on dealing with addiction . Many mentioned:

Quit at the early stage when it is easiest
Get family support early before inevitably burning bridges later
Get involved in Alcoholic Anonymous (AA) / National Addictions Management Service (NAMS, in Singapore’s context) meetings early

We put this coping framework in the context of Singapore. In recent times, it has been reported that people initiate and experiment into drugs at a young age and youths are generally more brazen now.

Actionable solution

Create a localized wellness site optimized for youth engagement, to deal with early stage addiction issues.

The site can share best practice methods in dealing with common issues like peer pressure, stresses faced at work or school. These are often a precursor to substance abuse.

Google is a good example of a healthy redirection, for searches on ‘how to commit suicide’ being redirected to Samaritans of Singapore (SOS).

Figure 1 SEO optimized redirection link

3. Singapore drug abusers relish anonymity in dealing with their demons

Drug abusers are going overseas kick their habit , often paying a hefty fee for their rehabilitation packages. Reason being that they are fearful to see a doctor in Singapore because that they might get arrested. By law, doctors in Singapore need to inform Central Narcotics Bureau (CNB) if they are treating any patient for drug addiction.

To many, the cost of anonymous overseas rehabilitation centers is sky high. Many do not have the luxury and means to afford it. Unfortunately, the group that needs the most help are the ones that are unable to pay for such access to help. They value the anonymity in fixing their problems.

Actionable solution

Create an anonymous mobile app for addicts to freely share and learn about their problems.

Give addicts an environment of anonymity where they can seek help without fearing repercussions. This can be done through a convenient the mobile device platform. Have an application that allows the addict to post and receive help on their problems. There can be trained counselors or rehabilitating peers on the platform to bounce thoughts off.

For more posts like this, visit Latent App’s blog.

How can CEOs/Consultants use natural language processing to find valuable insights from a large volume of opinionated text

Finding addiction trends from Singapore Netizens

Abstract

Introduction

Context

Methodology

Refining our collection and retrieval mechanism

Defining search terms

Universe set

Table 3 Time range of data universe

Analyzing the distribution of post forums with related search query

Findings and discussion

Corpora comparison

Analyze differentiated corpus within the post universe via terms

Analyze differentiated corpus within the post universe via categories

Analyze terms that are more characteristic of the corpus vs a standard dictionary

Sentiment analysis for tagged posts

Sentiment over time

Sentiment over time — smoothed to 50 period

Forum post volume

Forum post volume — smoothed to 100 period

Conclusion

1. We structurally mapped out the entire online conversation using our source on ‘addiction’ related topics using natural language processing

Actionable solution

2. Tackle the addiction problem when it is early and easy

Actionable solution

3. Singapore drug abusers relish anonymity in dealing with their demons

Actionable solution

Written by Wen Jie Lee