Natural Language Process (NLP) Will Transform Traditional B2B Market Research Industry
- Xiao-Fei Zhang
Jul 28, 2017
AaaS vs. HiPPOs
Few years, there was an inside joke in my industry, namely market research for the IT industry: since everything is going to be “as a Service,” why not AaaS (Analysts as a Service)? We, the industry analysts, will all be out jobs.
Luckily, we have dodged that bullet — at least for now.
It seems all the easy answers are few taps or voice commands away: if we want to know where is the closest Indian restaurant, best way getting there, India’s GDP growth, or anything else we want to know about India, we just Google it, or ask Siri or Alexa. How soon can you get it? Right away. And how much is it? Absolutely free.
But for the hard questions, however, you’d still need to come to us, right? Yes, it will cost you an arm and leg and it may take few hours, days, weeks, or even months to get some answers, but hey, where else are you going to find it? Can you really ask Google or Alexa “how much did IBM’s market share grow or shrink in the last 3 years in the XYZ market in ABC continent or country?” or “why is HCL doing better than some of the other Indian players in, say infrastructure business?”
So here comes another funny acronym (at least one of my favorites): HiPPOs (highest paid person’s opinion). We (the analysts) are HiPPOs — certainly not the most expensive kind but we are not cheap neither. The first time I heard this one was at a keynote speech given at one of my company’s events. The topic was on machine age and how algorithms will replace HiPPOs. So here was an event hosted by a company full of experts, who paid an even bigger expert a lot of money, to talk about how experts will be done away with by smarter machines. To put it bluntly, a HiPPO company hired another HiPPO to say that HiPPOs are done for. You must appreciate the irony of it all.
But irony aside, the speech itself is brilliant and this HiPPO has a point: if we keep on peddling the same market research the same way and charge our clients the same price, our days are numbered. AaaS is coming! If we don’t do it, others will do it for us.
In this post, I will talk about how I think artificial intelligence (AI) will disrupt and transform our industry. Please note that the views expressed here are my personal views only.
Some AI technologies, I believe, while exciting, are still far away from wide commercial adoption, and/or primarily consumer focused. For example, deep neural network for images and videos has made leaps and bounds, but we are still years away from fully autonomous driving, flying drones, and machine generated live-action blockbuster films. Let’s leave that to Google, Amazon, Facebook, and others.
Instead, I will focus only on the language aspect of AI, namely natural language process, or NLP.
What is NLP and How Does It Work
For those not familiar with machine learning or AI, natural language processing, commonly known as NLP, translates our speech (documents, sentences, and words) into numerical values, i.e. part of speech (noun, verb, adjective, subject, object, etc.), position and sequence of words, distance between certain words, etc. and run them through computer programs to make sense of our speech computationally. Of course, this is quite a simplification. If you really want to learn more about NLP, I highly recommend the NLP course offered by Prof. Dan Jurafsky & Chris Manning at Sandford (Stanford University is a pioneer in NLP.) The entire course is on YouTube.
Once machines can understand speech, they can also generate speech. The comprehension part is also called Natural Language Understanding (NLU) and the generation part Natural Language Generation (NLG.) As for now, speech generation is still harder than just the understanding part.
As you can see, NLP lets machines do “reading and comprehension at scale.” The sky is the limit when it comes to potential use-cases. Even with what we have so far, we have some compelling use-cases already. Some common use-cases include:
- Sentiment analysis
- Topic extraction
- Content categorization/classification (a popular one)
- Text summarization
- Others
For example, we couldn’t do Google translate without NLP. It also underpins chat-bots and virtual assistants (Google Home, Amazon Echo, etc.) Some weather reports, financial information write-ups, and sports articles are generated algorithmically already. You can read hundreds or thousands long reports/documents and create short synopses — something that would otherwise take tons of man hours to do. Financial firms use NLP to analyze how Twitter feeds correlate to stock prices. There are many more.
To give you a better idea, let me walk you through couple quick use-cases to illustrate how it works.
Sentiment Analysis
Sentiment analysis focuses specifically on the polarity of our language, such as whether we like or dislike an object or topic, how much we like or dislike it, which part of it do we like or dislike, so on.
Early day sentiment analysis was primarily lexicon-based or rule-based (also considered knowledge based approach.) For example, certain words, such as “great” are assigned a higher rating (i.e. closer to 1.0) than “good” (i.e. above 0.5 but lower than 1.0), whereas “bad” or “terrible” are rated between 0 and -1. We can then quantify positive or negative feelings expressed in a phase, sentence, and even document. This requires building extensive “rule-books” first, such as dictionary polarity, negation words, booster words, idioms, etc.
We are now moving more towards the learning approach or the combination of both. The first widely known research work using ML for sentiment analysis (still widely quoted today) was published almost 15 years ago: in 2002, Bo Pang and Lillian Lee from Cornell University and Shivakumar Vaithyanathan from IBM Almaden Research Center, used Internet Movie Database (IMDb) archive of the movie reviews to train ML models. They used Naïve Bayes, Maximum Entropy (MaxEnt), and Support Vector Machine (SVM) models (I will not explain what these models are — they are widely used models and there are plenty resources online to find out what they are.) The models somewhat accurately predicted movie ratings by just reading movie reviews.
This approach was a huge step forward because we don’t have to manually map out all the “rules.” Machines can do this at scale. It can also uncover latent aspects (i.e. different features of a product) and their ratings, which humans can’t.
News Classifier
A news classifier was the first machine learning (ML) project I commissioned. The concept is fairly simple and common: scrape the web for company news, winnow out the “noise” (information we don’t care for) and classify them into different categories. Initially, there was only one type of information that we needed but we eventually build on top of that by adding more categories i.e. executive appointments, M&A, etc.
We decided to use open-source and supervised models (supervised means humans determine which features to analyze data. For example, if we need to predict housing price, we use human knowledge to determine which variables — features — such as size, neighborhood, etc. are important information to determine housing price.) We collected couple thousands news articles (didn’t include social media initially) and categorized them by hand. And then used those to train different models.
We used Python and embedded the codes into our own database application, which is also written with an open source language. We finalized on two models: Naïve Bayes and Support Vector Machine (SVM). Finally, we settled on SVM b/c it gave us the best results. After the tool was first deployed, we hit an accuracy rate of 70%-80%. The false positives (news feeds not relevant to us but have made through the model) were fed back into the model to help it “learn.” We eventually reached an accuracy of more than 90%.
Although the model itself was simple, we learned a lot along the way. The biggest fringe benefit was learning how to normalize unstructured data (raw speech and text): tokenization, lemmatization, parsing, etc.
NLP, Why Now?
NLP is a more mature field of AI. Part of it is because early frameworks started more than 60 or 70 years ago when linguists systematically broke down our language patterns and structure. However, it only took off recently because all the forces are finally coming together — the same reason that the first internal combustion engine was invented in the 1790’s but Ford rolled out its model T only in 1908.
More Data, Easy Data, and Right Data
There are more data because we are leaving more digital footprints behind. Access to data is also easier: web-scrapers are powerful and cheap; major social media sites have APIs to allow others mine user postings (not free but very affordable.)
To get things started, there also plenty of data-sets online (often free) to help developers train ML models. There is also a misconception about the amount of data are needed: more is always better. Yes, if you are using deep neural networks, you do need a lot data. But often the right kind of data is more important. For example, an AI vendor told recently that in training their virtual agent, they only need one to two hundred hours of transcripts of live contact center agent/customer calls. The key was using calls only from the best agents.
In NLP, the more specialized and narrow are the topics, the more accurate are the models. For example, a chat-bot that deals with auto-insurance claims is a lot easier to build than a bot for general purposes. Think of it like this: a specific topic has its own jargon and unique word combinations. It’s easy to find patterns. This is a huge plus for industry analyst firms. For example, take our industry, how often does the word “computing” follow the word “cloud”? Probably 99% of the time. But if you randomly meet someone on the street, cloud can be followed by any word: cloud nine, cloud in the sky, cloud on the mountain, you name it…
Smarter and Cheaper Tools
The days of relying solely on relationship databases are over. We now have machine learning (ML) and Natural Language Processing (NLP) tools to tackle unstructured data. There is no need to re-invent the wheel: the building blocks are already there and you just need to know where to look for them.
The AI development community took to the open source approach to freely share prepackaged ML codes. Many tools and training data-sets are provided by research institutions (i.e. Stanford), developer communities, and large enterprises. “We are not Google” is no longer a good excuse for IT departments because Google has opened its AI platform to the public via APIs and libraries (Tensorflow.)
Even in the proprietary software camp, such as IBM Watson, AI tools are API based and relatively affordable.
Easier to Learn
When I first embarked on the ML a year and half ago, I had no budget for data scientists. This was new for my developers too — they are from an external provider based in India. The company did some projects in ML for another client but they couldn’t spare an experts for my project. It was a steep learning curve but we finally made progress. At the advice of a marketing analytics startup’s CTO I met in San Francisco, I pulled one developer off current projects and send him “back to school” for few months. His “school” was the web (Coursera, Udacity, YouTube, google, GitHub, etc.) I myself can’t code even if my life depends on it and haven’t taken a ML class on any campus but still learned tremendously just through online courses and watching YouTube videos.
Yes, it helps to be very smart or has a PhD in math from MIT, Harvard, Princeton or Stanford, but to get the ball rolling and build quick use-cases, you don’t need to. There are plenty of materials online. You just have to have the curiosity and put in the hard work.
Why Does It Matter to Market Research?
Clients’ expectation for market research has changed: they want a more Google like experience; the days of “waiting for months and spending hundreds of thousands of dollars for that single version of truth” is over. Nowadays, they want good-enough but relevant insights, but they want it quick and on the fly. If traditional market research firms can’t do it, they will find someone else who do.
Because most traditional market research firms are still purely labor based, they simply can’t scale like a technology company. But with NLP, this is possible.
An Alternative to Traditional Primary Market Research
Traditional market research is composed of primary and secondary research.
Primary research essentially includes quant and qual:
Traditional quantitative research (quant) — namely surveys, depends on asking predetermined and structured multiple choice answers and scales. With more sensitive topics, sampling bias (i.e. self-selection bias) and response bias (respondents not telling their true beliefs) can render results un-reliable. It also misses valuable insights from unstructured data, such as respondent’s written answers, which offers richer insights. Pollsters’ failure to predict Donald Trump’s 2016 election win was a case in point.
Qualitative research, namely focus groups, interviews, etc., provides better insights, but cannot scale. Data gathering cost is exorbitant. Sample size is too small. And more importantly, study results are in unstructured format. Pulling insights from transcripts is manual — expensive and slow. A professionally conducted focus group usually runs between few hundred to couple thousand dollars per subject.
NLP can offer new alternatives or at least complement the old methods. For example, sentiment analysis can gather hundred, thousands, and even millions of opinions in a matter of minutes or hours. It can also uncover latent patterns easily missed by humans. For example, last year, an AI system called MogIA, which was developed by an Indian start-up Genic.ai, accurately predicted Trump’s presidential win by analyzing 20 million data points from online platforms like Twitter, Google, etc.
Automate Secondary Research
Secondary research means pulling insights from other peoples’ work (news articles, press releases, case-studies, speeches, videos, social media, etc.) Nowadays, this is mostly about searching the web. This probably accounts for about at least half of what we do. And it is not just us: knowledge workers in general spend much of their days searching the web.
While Google has made search easy and quick, this is still a manual process. More information we need to find, the more bodies we need to throw at it. We can move this process to low-cost locations such as India, but simply adding more and more people will eventually become untenable. Also, adding more people doesn’t necessarily make searching faster or more accurate.
Yes, building more automation tools can alleviate some of the pains of repetitive processes. But existing systems are still largely rule-based, such as Robotic Process Automation (RPA), at least most of the existing RPA tools. Existing systems are good with structured data but often fall short when it comes to unstructured data. When we do secondary research, we still need human intuition to deal with the nuances of language, content, and context. A pure rule-based system won’t work. This is where I think investing in NLP will have the biggest bang for the buck. What NLP offers is “reading and comprehension” at scale. For example, machines can read hundreds, thousands, or even millions of documents at once, pull out only ones we need — or out just the parts that we need, and present it in a format that we want. I have had start-ups contacting me and promising to automate 60% to 70% of what we do. While I don’t think the technology is exactly there yet, I do think current NLP technology can automate as much as 20% to 30% of our existing processes.
Leverage Dark Data
Another source of valuable insights we have is our client interactions (face to face, calls, and emails.) We share our knowledge with them but we also learn from them — they are the pulse of the market. Most of knowledge we acquired are sitting in our heads. A portion does get recorded but they are mostly unstructured data, scattered on our PCs, emails, CRM systems, or even notebooks. This is what is commonly known as dark data.
So why don’t we leverage them? Unstructured data are messy and noisy. They are useless unless you can easily turn them into structured data. So far, there are no such tools. But with NLP, that is now possible. Obviously, this is exploratory. You may need to try few different projects till you hit the right one.
Where to Start (Different Approaches)
As to how to build a broader NLP capability, you can build, hire, buy/rent, or acquire.
Build Our Own by Leveraging Open Source
Because a large number open source ML and NLP tools and APIs are available, either free or relatively cheap, you can build this capability on your own. To be precise, this is a bespoke approach: you are simply BUILDING a chaise and pulling different tools from different sources. The cost will mostly be on design and integration.
You can use your internal team, external providers, or a combination of both. There are start-ups who specialize in designing and building NLP related products can help you. If this is your first time doing this and your business unit leaders are not well versed in NLP or machine learning, it’s best to pick a company with significant on-shore presence. This is because you need same-time zone interactions as much a possible and also, while labor costs is high here, the US is also the leader in AI.
Pros & Cons
Here are some pros and cons:
Pros:
- You will own this capability
- More customization and flexibility
- More hand-holding in design phase
- Not locked into someone’s proprietary system — NLP/ML field is changing fast and new free/cheap tools are coming out every day
- Cost: this is a one-time build cost and you can scope it up or down; license cost is low because APIs are either free or very cheaper
Cons:
- Requires you to be more hands-on and build some internal NLP knowledge base
- Still takes months to stand up
- Hidden costs: maintenance, upgrade/enhancement development cost in the future (not just labor cost but also business disruption cost)
Rent Third Party Proprietary Platforms or Partner with AI Companies
Another approach is to use a non-competing third-party vendor to supply the data and analysis. For example, there are quite a few social analytics software vendors out there.
Most of these companies already have a long list of data sources, including Twitter (Firehose), Facebook, Instagram, forums, Google+, new sites, etc. They also add more sources if you ask for them. Keep in mind that most of the sources are open and easy to access (either free or cheap to access or scrape.) But the advantage is that such a vendor can pull everything together in one channel. Some of them also host historical data.
Their solutions tend to be off the shelf. But some customization is allowed. Some also “rent out” their algorithms to let you plug your internal data into their apps to be analyzed. However, that is more an exception than the rule. Also, because B2C companies (like consumer electronic products, retail, etc.) are way ahead of B2B companies, most social analytics firms are geared more towards B2C clients.
Pros & Cons
Pros:
- Off-the-shelf: you don’t to build it from scratch
- On-boarding process is fast and easy
- Requires minimum internal resources and talents to manage the project
Cons:
- Cost: recurring yearly subscription cost
- Lock you into their proprietary system
- Have little room to customize
- You are depending solely on the vendor’s ability to upgrade and innovate: the NLP/ML field is progressing fast; you may miss out on new free/cheap tools
Go Acquire an AI Company
If you have the cash, the quickest way is to acquire an AI company. Ideally the target should have:
- A roster of (hopefully) B2B clients (this proves that they know how to sell this to large enterprises)
- A proven solution
- A team of strong R&D team
Obviously, given how hot the AI has become in the last couple years, getting something on the cheap will be difficult. Another danger of this is how will it fit into your own culture. But more importantly, if you don’t have enough experience with NLP or machine learning, it will be hard to gauge the true value of an AI start-up.
Final Thoughts
In my opinion, the best approach is start building your own capability, but using both internal team and external providers. This is because you need to acquire internal capabilities — understand NLP at least on a conceptual level — and know what is nice and easy, need to have, nice to have, and pure science fiction. I recommend using both internal and external providers because based on my experience, it takes more than few months to get train your own people on NLP — having outside start-ups to hold your hands will shave off months in designing and use-case finding.
Once your organization gets its feet wet in NLP, you can then explore the buy/rent/partner and even acquisition option. Without some working experience with NLP, it is difficult to figure out what are possible and what are not.
