Machine Learning 101 for Startups — Part 4

ML Visualisation and Data Analysis based on Amazon QnA Dataset

Tanya Thakur
Kontiki AI
7 min readMar 9, 2018

--

This is the fourth post in a series of posts on ML techniques for startups and business to help in decision making. The target audience for this post is business folks looking to understand how they can leverage Machine Learning in their business or product.

Today, data is the game changer, and if used properly, data leads to better decision making. The quantity and quality of the data available are numerous, therefore, helping us in building new insights and making predictions, about what can and what will happen in the future.

We, at Kontiki Labs, know the importance of data and what machine learning actions like entity extraction, clustering and more should be performed on it to help your business grow.

The aim is to gain insights and extract relevant topics from data to detect what your customers are saying about your products.

What is Machine Learning in the real world and, what can be done using it?

Machine Learning is the developed approach levied upon our computing methods to help our enterprise grow.

In this post, we will see how we can extract the maximum out of the Amazon QnA dataset using visualisation tools powered by KNN and Clustering and benefit our business.

Note: We have tried our level best to dive in deeply into the answered questions for electronic items by the Amazon’s customers until the year of 2016. Our only limitation here is the permission to access the Amazon ASINs (ID of the product), but this does not effects the output retrieved and the positive impact our ML algorithm can pass onto your business.

The Amazon Data set

This Amazon question/answer dataset contains an enormous set of Question and Answer data from Amazon, figuring around 1.4 million answered questions. The set includes:

  1. ASIN — ID of the product,
  2. questionType — type of questions, which can be ‘yes/no’ or ‘open-ended’,
  3. answerType — type of answer. Could be ‘Y’, ’N’, or ‘?’ (if the polarity of the answer could not be predicted). Only present for yes/no questions.
  4. question — question text.
  5. answer — answer text.

In the next steps, you will see how our developers have used ML algorithm on the electronic product data and reconstructed it into various formats for different business use-cases.

The Data Cleanup

The data preparation process involves manipulating the ZIP file to JSON and then to CSV.

Once the ZIP file gets converted to a JSON format, we smartly append the data to a CSV file, which discarded storing it into the RAM. Therefore, saving our local system from massive dataset.

Extracting text from ‘Read More’ till ‘Show Less’

The data cleanup process also required cleaning special characters from customers queries and answers, like replacing semi-colons, smileys and more.

Running the Code Output Analyses

Analyses 1 — The question set comparison

Why Compare?
Product reviews and the user’s comment not only helps in building trust between your brand and the customers but also provides a long-term exposure to your store. Your new customer base depends upon the satisfaction level of the existing customer base, because:

70% of customers consult reviews or ratings before making a final purchase. (PeopleClaim)

To extract the maximum from both the reviews and comments the organisation should know what type of queries do customers have and how the other customers like to response to them.

What we did?
To help you in understanding the customer mindset, we did extensive machine learning and data analysis on the 2 type of questions in the dataset, the yes/no and open-ended.

The Conclusion
The conclusion is the pie chart below.

The blues sector represents — open-ended, while the orange represents — yes/no

The blue sector that covers more than 75% of the circle indicates that the majority were interested in answering open-ended questions in comparison to a simple yes or no.

Analyses 2 — Understanding Customer’s Sentiments

Why Compare?
Building a business is a journey that should never end, and while covering this journey, is a responsibility of meeting customer’s commitment to delivering high-quality work.
A shopkeeper in a physical store has a clear understanding of what his customers want, as they present verbal queries and suggestions, but what about an online store?
To bring the same experience, we implemented Sentiment Analyses on all the questions and answers asked by Amazon’s customers.

What we did?
We implemented sentiment analyses on all questions and answers posted by users for Amazon electronic products. With this, we inferred the opinion of the writer on the specific topic and his polarity towards the product. The output helped us in learning the response to the product which can be — positive, negative or neutral.

The Conclusion:
The outcome of the analysis on question and answer set is in the pie chart below.

Question Sentiment vs Answer Sentiment

The motivation: On comparing blue sectors from both the charts above, we can discover that the maximum of the neutral questions had a positive response from the customers.
The bandwidth allocation: The ‘very negative’ and ‘negative’ sector in the pie chart helps us in learning where the problem is. With this, a leader can easily make out and allot man-hours to the desired area and get better results in the future.
The revenue generator: The ‘very positive’ and ‘positive’ sector can help the organisation in knowing what products are creating the real buzz in the market and how can they start dealing and investing more on them to increase the company sales.

Analyses 3 — The Rating Predictions

Why Compare?
Ratings not only help customers with their purchase decisions but, also drive more qualified buyers towards your product. Hence, boosting your sales by a significant amount.
The ads which appear online also play a significant role because they represent aggregated rating data for the product gathered from varied sources including third-party reviews aggregators, merchants, editorial sites, and consumers.

What we did?
We ran our algorithm to see how many answers for electronic products fall in what rating bracket.

The Conclusion
The conclusion is the bar chart below.

The product rating bar graph

The Y-axis represents the total number of answers, while the X-axis represents product rating. So, the blue bar indicates that more than 50 thousand answers have product rating of [0.9,1.0].

With the results, it becomes quite evident that which products are mass stealers. This result can help the content management team, as they will know in what area they can build their content and can target the right audience for company profits.

Analyses 4 — Get the product report

Why Compare?
In a highly ambitious global marketplace, the pressure on organisations to find new ways to deliver the best to their customers grows ever stronger. The increasing need for the industry to compete with its products in a global market across quality and service dimensions has given rise to the need to develop more dynamic inventory strategies.
Therefore, it becomes important for the organisation to analyse timely detection of shifts in demand, and know in which month or time slot customers show maximum interest in their product.

What we did?
Our ML study identified each month and then each day in a month to figure out the period involving maximum interest rate of the customers towards the products.

The Conclusion
The conclusion is the bar chart below, which shows the peak time when the customers were alert.

Monthly Comparison & Comparing on annual basis

The result can prove to be a big boon for businesses as they know what time of the year they need maximum employee strength.
From the managerial fronts
, an HR or a support management team can judge in prior, the time in the year when the company is in need of mass hiring.

As the graph points out January and December, which are progressing towards crossing the 10 thousand mark. This is a signal to the sales department, to react quickly to changes in sales demand or inventory stocks, and respond faster to meet customer requirements.

The code gist

Below is an example prepared by our developers. The code will help you in collecting all the results explained above.

Conclusion

The study identified some important aspects that machine learning and data analysis can bring to build an efficient business. It suggests how we can loop into our existing dataset and derive ways that can be progressive and customer-centric. And to integrate the same experience in your business, we, at Kontiki Labs, have provided the code above for ML Visualisation based on Amazon QA Dataset that you can leverage to monitor your business stats.

At Kontikilabs, we help our customers build such custom Machine Learning and deep learning tools, with a focus on the data available, the problems to be solved and the end user objectives. You can connect with us on twitter or on email: hi[at]kontikilabs[dot]com

--

--

Tanya Thakur
Kontiki AI

Developer Evangelist & Engineer | Community and Content Lead @kontikilabs