Exploring the Pulse of Opinion: A Dive into Sentiment Analysis through Reviews
Along with the rapid development of technology in all parts of the world, it has an impact, one of which is the emergence of many online buying and selling sites popularly known as electronic commerce. Electronic commerce or better known as e-commerce is the activity of buying and selling goods, services, transmitting funds or data using electronic devices connected to the internet network.
Today, e-commerce is preferred by consumers because they don’t need to come directly to a physical store and every year the number of e-commerce users continues to increase. The number of e-commerce used by the people in Indonesia, more crimes in the network occur, so an analysis of the online market or e-commerce is needed based on consumer assessments of the products sold. Overcoming this problem, consumer reviews of products on e-commerce can help other consumers be more careful in conducting transactions on the network.
Information contained in the product reviews is valuable and can be used as a policy-making tool that is processed through text mining. Data processing using text mining will be more effective when management and processing use software assistance. Then sentiment analysis is carried out, which consists of grouping the polarity of the text in a sentence or document to find out whether the opinion is positive or negative.
In management and processing, several methods can be used and implemented such as Support Vector Machine, Decision Tree, Naïve Bayes, and K-Nearest Neighbors. From the choice of methods, Support Vector Machine is one of the methods that produce the best accuracy value. Accuracy result is obtain from previous research on text classification including news topic classification using Support Vector Machine. The results of this article conclude that Support Vector Machine is a classification method that provides the most accurate prediction result and is higher when compared to other methods, which is 92,24% [1]. Research on the classification of wood species using Support Vector Machine has also been used to classify a person’s facial characteristics which result in an accuracy value of 90% for the true detection level [2]. In the movie review dataset, researcher used 40.000 reviews to train the classifier and 10.000 reviews to test the model performance. Researcher found out that accuracy of the classification increases as researcher increase the feature size. Linear Support Vector Machine (LSVM) achieved an accuracy of 89,91% [3].
By applying Support Vector Machine, it is expected to get the best final result, and it can be seen in which direction the public sentiment has been processed. Sentiment analysis is a computational study that collects opinions from individuals expressed in the form of the text. Based on this background, this article to classify product reviews and measure the accuracy of sentiment analysis.
In this article, there are 8 (eight) steps to be carried out, including data collection, data labeling, split the data, preprocess the data, feature extraction, model evaluation, and website integration. The discussion of each step is as follows:
Data collection
The data used as input in this system is a review dataset obtained from product reviews, such as from one of the largest e-commerce sites in the world.
Data labeling
After the data collected, the next step is labeling it. In this step, the data must be assigned a label based on the rating value. The rating in question is that if the user gives a rating of 4 or 5, it will show the positive class, and a rating of 3 and below will show the negative class.
Split the data
The next step is the split or separation of the data used to see the proportion of the review data. We can say, 70% of the training data and 30% of the testing data were used from the total review data. The determination of training data and testing data is carried out randomly so that the proportion between product reviews categories is balanced.
Preprocess the data
After the data is split, the preprocessing step is carried out, so that the data is ready to be processed at the next step. The preprocess step consists of data preparation, tokenization, data cleansing, case folding, and stop words removal. The discussion of each step is as follows:
A. Data preparation which aims to prepare the data to be processed at a later step.
B. Tokenization which serves to break sentences into tokens or single words.
For example:
Input: Very oily and creamy. Not at all what expected…
Output: ‘very’, ‘oily, ‘and’, ‘creamy’, ‘not’, ‘at’, ‘all’, ‘what’, ‘expected’
C. Data cleansing which aims to turn dirty data into quality data so it can produce accurate information.
For example:
Input: ‘very’, ‘oily, ‘and’, ‘creamy’, ‘not’, ‘at’, ‘all’, ‘what’, ‘expected’
Output: very oily and creamy not at all what expected
D. Case folding which changes all words in the review text into lowercase letters, and other characters can be removed.
For example:
Input: very oily and creamy not at all what expected
Output: very oily and creamy not at all what expected
E. Stop words removal which is removing a collection of words that are considered meaningless.
For example:
Input: very oily and creamy not at all what expected
Output: ‘oily, ‘creamy’, ‘expected’
Feature extraction
In this step, the text is extracted to be numeric because the computer does not process data other than numeric data. Feature extraction is used to explore potential information and represent words in the feature vector. This feature vector will be used as input for the classification method in the next step. Feature extraction in this article uses TF-IDF. TF-IDF will assess how important a word is in a document. To perform calculations using TF-IDF, you can use the library in the Python Sklearn, specifically TfidfVectorizer().
Model evaluation
The classification results obtained are then evaluated to obtain an accuracy value which is then analyzed and shows whether the classification model made is feasible or not. Describing the performance of the classification model, we can use 2x2 confusion matrix. The classification process carried out using the Support Vector Machine begins with the training data. In this article, the classification process uses the library of Support Vector Machine and Python Sklearn. Therefore, this classification includes a binary classification that describes 2 (two) categories, specifically the positive class and the negative class of product reviews in e-commerce.
On the classification on 2x2 confusion matrix, there are 4 (four) types of possible cases that will occur, including TP (True Positive) is the data for the positive class that is predicted with a true value. TN (True Negative) is data for the negative class that is predicted to be correct with the true value. FP (False Positive) is data for negative class but is predicted as positive class data. And FN (False Negative) is data for positive class but predicted as negative class data.
Based on the values of TP, FP, FN, and TN for each positive class and negative class, the values of precision, recall, and specificity.
In addition to precision, recall, and specificity based on the calculations in the positive class and negative class, the overall accuracy value for the product reviews classification model using the Support Vector Machine is calculated, specifically the sum of TP + TN, so the result divided by the total number of test data. Thus, the accuracy value of the calculation can be generated. And if the results >80%, it can be concluded that the product reviews classification model using the Support Vector Machine method can work well.
Website integration
The last step is website integration, specifically by integrating the system that has been created to process data with a user interface to make it more accessible to users. This is an update through a website intermediary to see the results of sentiment analysis based on reviews of e-commerce products. In this step, we can use flask and pickle as a medium to connect the program and the designed website. The website integration process begins with entering sentences.
Then the system will classify them into positive or negative classes. In this article, flask and pickle are used as a medium to connect programs and websites that are created. The results of website integration can be seen in figure below.
Conclusion
Several methods can be used and implemented such as Support Vector Machine, Decision Tree, Naïve Bayes, and K-Nearest Neighbors. From the choice of methods, Support Vector Machine is one of the methods that produce the best accuracy value. Accuracy result is obtain from previous research on text classification including news topic classification using Support Vector Machine. The results of these articles conclude that Support Vector Machine is a classification method that provides the most accurate prediction result and is higher when compared to other methods, which is >80%.
Reference
[1] L. G. Irham, A. Adiwijaya, and U. N. Wisesty, “Klasifikasi Berita Bahasa Indonesia Menggunakan Mutual Information dan Support Vector Machine,” J. Media Inform. Budidarma, vol. 3, no. 4, p. 284, 2019, doi: 10.30865/mib.v3i4.1410.
[2] A. Setiyono and H. F. Pardede, “Klasifikasi Sms Spam Menggunakan Support Vector Machine,” J. Pilar Nusa Mandiri, vol. 15, no. 2, pp. 275–280, 2019, doi: 10.33480/pilar.v15i2.693.
[3] B. Das and S. Chakraborty, “An Improved Text Sentiment Classiciation Model Using TF-IDF and Next Word Negation”, 2018, arXiv preprint arXiv:1806.06407.