Industrial applications of topic model

Fatma Fatma
15 min readApr 4, 2019

--

In short, topic modeling is a text-mining technique for discovering topics in documents . A topic contains a cluster of words that frequently occur together, and topic modeling can connect words that have similar meanings and can distinguish between uses of words with multiple meanings . Given that text documents are composed of words, a topic covered in more than one document can be expressed by a combination of strongly related words, and any given document can be associated with more than one topic . Thus, topic modeling is a technique that can be used to infer hidden topics in a collection of text documents . The two key outputs from generating a topic model on a collection of documents are: 1) a list of topics (i.e., groups of words that frequently occur together) and 2) lists of the documents that are strongly associated with each of the topics. Ideally, each topic should be distinguishable from other topics. In the more, scholars leverage the topic model in different research.

Figure 1. topic model applications

Figure 1. shows the fields which are used the topic model to improve the related results, In the other words, these are some current industrial applications of topic models. The following are abstracts of selected papers per application.

Living lab: This study applies topic modeling analysis on a corpus of 86 publications in the Technology Innovation Management Review (TIM Review) to understand how the phenomenon of living labs has been approached in the recent innovation management literature. Although the analysis is performed on a corpus collected from only one journal, the TIM Review has published the largest number of special issues on living labs to date, thus it reflects the advancement of the area in the scholarly literature. According to the analysis, research approaches to living labs can be categorized under seven broad topics: 1) Design, 2) Ecosystem, 3) City, 4) University, 5) Innovation, 6) User, and 7) Living lab. Moreover, each topic includes a set of characteristic subtopics. A trend analysis suggests that the emphasis of research on living labs is moving away from a conceptual focus on what living labs are and who is involved in their ecosystems to practical applications of how to design and manage living labs, their processes, and participants, especially users, as key stakeholders and in novel application areas such as the urban city context [[1]]

Bio-informatics: Topic modeling is a useful method (in contrast to the traditional means of data reduction in bioinformatics) and enhances researchers’ ability to interpret biological information. Nevertheless, due to the lack of topic models optimized for specific biological data, the studies on topic modeling in biological data still have a long and challenging road ahead. In recent years, we have been witnessing exponential growth of biological data, such as microarray datasets. This situation also poses a great challenge, namely, how to extract hidden knowledge and relations from these data. As mentioned above, topic models have emerged as an effective method for discovering useful structure in collections. Therefore, a growing number of researchers are beginning to integrate topic models into various biological data, not only document collections. In these studies, we find that topic models act as more than a classification or clustering approach. They can model a biological object in terms of hidden “topics” that can reflect the underlying biological meaning more comprehensively. Therefore, topic models were recently shown to be a powerful tool for bioinformatics [[2]]

Summarization:

Opinion summarization: summarizing the newly discovered opinions is important for governments to improve their services and companies to improve their products . Because no queries are posed beforehand, detecting opinions is similar to the task of topic detection on sentence level. Besides telling which opinions are positive or negative, identifying which events correlated with such opinions are also important [[3]]

Meeting summarization: Producing meeting documents requires an instantaneous recorder during meetings, which costs extra human resources and takes time to amend the file. However, a high-quality meeting document can enable users to recall the meeting content efficiently. The paper aims to discuss these issues. An application based on this framework is developed to help the users find topics and obtain summarizations of meeting contents without extra effort. This app uses the Bluemix speech recognizer to obtain speech transcripts. It then combines latent Dirichlet allocation and a TextTiling algorithm with the speech script of meetings to detect boundaries between different topics and evaluate the topics in each segment. TextTeaser, an open API based on a feature-based approach, is then used to summarize the speech transcripts “[[4]]

Sentiment: With the expansion and acceptance of Word Wide Web, sentiment analysis has become progressively popular research area in information retrieval and web data analysis. Due to the huge amount of user-generated contents over blogs, forums, social media, etc., sentiment analysis has attracted researchers both in academia and industry, since it deals with the extraction of opinions and sentiments. In this paper, we have presented a review of topic modeling, especially LDA-based techniques, in sentiment analysis. We have presented a detailed analysis of diverse approaches and techniques, and compared the accuracy of different systems among them. The results of different approaches have been summarized, analyzed and presented in a sophisticated fashion. This is the really effort to explore different topic modeling techniques in the capacity of sentiment analysis and imparting a comprehensive comparison among them [[5]]

Chatbot: this papir said “Dialog evaluation is a challenging problem, especially for non task-oriented dialogs where conversational success is not well-defined. We propose to evaluate dialog quality using topic-based metrics that describe the ability of a conversational bot to sustain coherent and engaging conversations on a topic, and the diversity of topics that a bot can handle. To detect conversation topics per utterance, we adopt Deep Average Networks (DAN) and train a topic classifier on a variety of question and query data categorized into multiple topics. We propose a novel extension to DAN by adding a topic-word attention table that allows the system to jointly capture topic keywords in an utterance and perform topic classification. We compare our proposed topic based metrics with the ratings provided by users and show that our metrics both correlate with and complement human judgment. Our analysis is performed on tens of thousands of real human-bot dialogs from the Alexa Prize competition and highlights user expectations for conversational bots.“[[6]]

Topic tracking: Discovering and tracking topics in a text stream has attracted the interests of many researchers. A limitation of most existing methods is that they organize topics in flat structures. Topic hierarchy could reveal the potential relations between topics, which can help to find high quality topics when analyzing the text stream. In this paper, a hierarchical online non-negative matrix factorization method (HONMF) is proposed to generate topic hierarchies from text streams. The proposed method can dynamically adjust the topic hierarchy to adapt to the emerging, evolving, and fading processes of the topics. In the experiment, HONMF is evaluated under a variety of metrics. Compared with the baseline methods, our method can achieve better performance with competitive time efficiency[[7]]

Question and answer: There is increasing interest in text analysis based on unstructured data such as articles and comments, questions and answers. This is because they can be used to identify, evaluate, predict, and recommend features from unstructured text data, which is the opinion of people. The same holds true for TEL, where the MOOC service has evolved to automate debating, questioning and answering services based on the teaching-learning support system in order to generate question topics and to automatically classify the topics relevant to new questions based on question and answer data accumulated in the system. To that end, the present study proposes an LDA-based topic modeling. The proposed method enables the generation of a dictionary of question topics and the automatic classification of topics relevant to new questions [[8]]

Text categorization: topic detection is defined as the task of finding out different themes from the collection of documents. One of topic detection approach is about finding a topic for every document in the corpus. Any word or group of words which tells what the document is about is defined as the topic of the document.[[9]]

Similarity: Reputation management experts have to monitor — among others — Twitter constantly and decide, at any given time, what is being said about the entity of interest (a company, organization, personality…). Solving this reputation monitoring problem automatically as a topic detection task is both essential — manual processing of data is either costly or prohibitive — and challenging — topics of interest for reputation monitoring are usually fine-grained and suffer from data sparsity. We focus on a solution for the problem that (i) learns a pairwise tweet similarity function from previously annotated data, using all kinds of content-based and Twitter-based features; (ii) applies a clustering algorithm on the previously learned similarity function. Our experiments indicate that (i) Twitter signals can be used to improve the topic detection process with respect to using content signals only; (ii) learning a similarity function is a flexible and efficient way of introducing supervision in the topic detection clustering process. The performance of our best system is substantially better than state-of-the-art approaches and gets close to the inter-annotator agreement rate. A detailed qualitative inspection of the data further reveals two types of topics detected by reputation experts: reputation alerts / issues (which usually spike in time) and organizational topics (which are usually stable across time) [[10]]

Spam filter: At present, content-based methods are regard as the more effective in the task of Short Message Service (SMS) spam filtering. However, they usually use traditional text classification technologies, which are more suitable to deal with normal long texts; therefore, it often faces some serious challenges, such as the sparse data problem and noise data in the SMS message. In addition, the existing SMS spam filtering methods usually consider the SMS spam task as a binary-class problem, which could not provide for different categories for multi-grain SMS spam filtering. In this paper, the authors propose a message topic model (MTM) for multi-grain SMS spam filtering. The MTM derives from the famous probability topic model, and is improved in this paper to make it more suitable for SMS spam filtering. Finally, the authors compare the MTM with the SVM and the standard LDA on the public SMS spam corpus. The experimental results show that the MTM is more effective for the task of SMS spam filtering. [[11],[12]]

Classification: With the overflowing of Short Message Service (SMS) spam nowadays, many traditional text classification algorithms are used for SMS spam filtering. Nevertheless, because the content of SMS spam messages are miscellaneous and distinct from general text files, such as more shorter, usually including mass of abbreviations, symbols, variant words and distort or deform sentences, the traditional classifiers aren’t fit for the task of SMS spam filtering. In this paper, the authors propose a Short Message Biterm Topic Model (SM-BTM) which can be used to automatically learn latent semantic features from SMS spam corpus for the task of SMS spam filtering. The SM-BTM is based on the probability of topic model theory and Biterm Topic Model (BTM). The experiments in this work show the proposed model SM-BTM can acquire higher quality of topic features than the original BTM, and is more suitable for identifying the miscellaneous SMS spam[[13]]

Recommender System: suggest using technologies of TDT to group news items instead of common item-based clustering technologies[[14]]

Chemical Topic Modeling: this paper said “we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.”[[15]]

IOT+health-care : The purpose of this study is to unravel key themes latent in the sparse but growing academic literature on the application of IoTs in healthcare. Specifically, we performed topic modeling and identified five dominant clusters of research, namely, privacy and security, wireless network technologies, applications, data, and smart health and cloud. Our results show that research in healthcare IoT has mainly focused on the technical aspects with little attention to social concerns. In addition to categorizing and discussing the topics identified, the paper provides directions for future research.[[16]]

HR: Human-machine teaming aims to meld human cognitive strengths with the unique capabilities of smart machines. An issue within human-machine teaming is a lack of communication skills on the part of the machine such as the inability to know when to interrupt human teammates. A proposed solution to this issue is an intelligent interruption system that monitors the spoken communication of human teammates and predicts appropriate times to interrupt without disrupting the teaming interaction. The current research expands on a prosody-only task boundary model as an intelligent interruption system with a topic-only task boundary model. The topic-only task boundary model outperforms the prosody-only model with a 9.5% increase in the F1 score, but is limited in its ability to process topical data in real-time, a previous benefit of the prosody-only task boundary model.[[17]]

IOT: this paper said “The Internet of Things (IoT) provide intelligence for the communication between people and physical objects. An important and critical issue in the IoT service applications is how to match the suitable IoT services with service requests. To solve this problem, researchers use semantic modeling methods to make service matching. Semantic modeling methods in IoT extract meta-data from text using rule-based approaches or machine learning techniques often suffer from the scalability and sparseness since text provided by sensors is short and unstructured. In recent years, topic modeling has been used in IoT service matchmaking. However, most topic modeling methods do not perform well in IoT service matchmaking since the text is too short. In order to address the issues, this paper proposes a new topic modeling method to extract topic signatures provided by intelligent devices. The method extends the classical knowledge representation framework and improves the qualities of service information extraction, and this process is able to improve the effectiveness of service matchmaking in IoT service. The framework incorporates human cognition to improve the effectiveness of the algorithm and make the algorithm more robust in heterogeneous systems in the IoT. The usefulness of the method is illustrated via experiments using real datasets[[18]].

Healthcare: this paper said “Pharmacovigilance, and generally applications of natural language processing models to healthcare, have attracted growing attention over the recent years. In particular, drug reactions can be extracted from user reviews posted on the Web, and automated processing of this information represents a novel and exciting approach to personalized medicine and wide-scale drug tests. In medical applications, demographic information regarding the authors of these reviews such as age and gender is of primary importance; however, existing studies usually either assume that this information is available or overlook the issue entirely. In this work, we propose and compare several approaches to automated mining of demographic information from user-generated texts. We compare modern natural language processing techniques, including extensions of topic models and convolutional neural networks (CNN). We apply single-task and multi-task learning approaches to this problem. Based on a real-world dataset mined from a health-related web site, we conclude that while CNNs perform best in terms of predicting demographic information by jointly learning different user attributes, topic models provide additional information and reflect gender-specific and age-specific symptom profiles that may be of interest for a researcher“[[19]]

Blockchain: this paper said”Cognitive manufacturing has brought about an innovative change to the 4th industrial revolution based technology in combination with blockchain distributed ledger, which guarantees reliability, safety, and security, and mining-based intelligence information technology. In addition, artificial intelligence, machine learning, and deep learning technologies are combined in processes for logistics, equipment, distribution, manufacturing, and quality management, so that an optimized intelligent manufacturing system is developed. This study proposes a topic mining process in blockchain-network-based cognitive manufacturing. The proposed method exploits the highly universal Fourier transform algorithm in order to analyze the context information of equipment and human body motion based on a variety of sensor input information in the cognitive manufacturing process. An accelerometer is used to analyze the movement of a worker in the manufacturing process and to measure the state energy of work, movement, rest, and others. Time is split in a certain unit and then a frequency domain is analyzed in real time. For the vulnerable security of smart devices, a side-chain-based distributed consensus blockchain network is utilized. If an event occurs, it is processed according to rules and the blocking of a transaction is saved in a distributed database. In the blockchain network, latent Dirichlet allocation (LDA) based topic encapsulation is used for the mining process. The improved blockchain distributed ledger is applied to the manufacturing process to distribute the ledger of information in a peer-to-peer blockchain network in order to jointly record and manage the information. Further, topic encapsulation, a formatted statistical inference method to analyze a semantic environment, is designed. Through data mining, the time-series-based sequential pattern continuously appearing in the manufacturing process and the correlations between items in the process are found. In the cognitive manufacturing, an equalization-based LDA method is used for associate-clustering the items with high frequency. In the blockchain network, a meaningful item in the manufacturing step is extracted as a representative topic. In a cognitive manufacturing process, through data mining, potential information is extracted and hidden rules are found. Accordingly, in the cognitive manufacturing process, the mining process makes decision-making possible, especially advanced decision-making, such as potential risk, quality prediction, trend prediction, production monitoring, fault diagnosis, and data distortion.”[[20]]

*****************************References****************************

[1] Westerlund, M., Leminen, S., & Rajahonka, M. (2018). A topic modelling analysis of living labs research. Technology Innovation Management Review, 8(7).‏

[2] Liu, L., Tang, L., Dong, W., Yao, S., & Zhou, W. (2016). An overview of topic modeling and its current applications in bioinformatics. SpringerPlus, 5(1), 1608.‏

[3] Ku, L. W., Lee, L. Y., Wu, T. H., & Chen, H. H. (2005, August). Major topic detection and its application to opinion summarization. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 627–628). ACM.‏

[4] Huang, T. C., Hsieh, C. H., & Wang, H. C. (2018). Automatic meeting summarization and topic detection system. Data Technologies and Applications, 52(3), 351–365.‏

[5] Rana, T. A., Cheah, Y. N., & Letchmunan, S. (2016). Topic modeling in sentiment analysis: a systematic review. Journal of ICT Research and Applications, 10(1), 76–93.‏

[6] Guo, F., Metallinou, A., Khatri, C., Raju, A., Venkatesh, A., & Ram, A. (2018). Topic-based evaluation for conversational bots. arXiv preprint arXiv:1801.03622.‏

[7] Tu, D., Chen, L., Lv, M., Shi, H., & Chen, G. (2018). Hierarchical online NMF for detecting and tracking topic hierarchies in a text stream. Pattern Recognition, 76, 203–214.‏

[8] Kim, K., Song, H. J., & Moon, N. (2017). Topic Modeling for Learner Question and Answer Analytics. In Advanced Multimedia and Ubiquitous Engineering (pp. 652–655). Springer, Singapore.‏

[9] Haribhakta, Y., Malgaonkar, A., & Kulkarni, P. (2012, September). Unsupervised topic detection model and its application in text categorization. In Proceedings of the CUBE International Information Technology Conference (pp. 314–319). ACM.‏

[10] Spina, D., Gonzalo, J., & Amigó, E. (2014, July). Learning similarity functions for topic detection in online reputation monitoring. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval(pp. 527–536). ACM.‏

[11] Ma, J., Zhang, Y., Wang, Z., & Yu, K. (2016). A message topic model for multi-grain SMS spam filtering. International Journal of Technology and Human Interaction (IJTHI), 12(2), 83–95.‏

[12] Al Moubayed, N., Breckon, T., Matthews, P., & McGough, A. S. (2016, September). Sms spam filtering using probabilistic topic modelling and stacked denoising autoencoder. In International Conference on Artificial Neural Networks (pp. 423–430). Springer, Cham.‏

[13] Ma, J., Zhang, Y., Zhang, L., Yu, K., & Liu, J. (2017). Bi-Term Topic Model for SMS Classification. International Journal of Business Data Communications and Networking (IJBDCN), 13(2), 28–40.‏

[14] Qiu, J., Liao, L., & Li, P. (2009, July). News recommender system based on topic detection and tracking. In International Conference on Rough Sets and Knowledge Technology (pp. 690–697). Springer, Berlin, Heidelberg.‏

[15] Schneider, N., Fechner, N., Landrum, G. A., & Stiefl, N. (2017). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach. Journal of chemical information and modeling, 57(8), 1816–1831.‏

[16] Dantu, R., Dissanayake, I., & Nerur, S. (2019, January). Exploratory Analysis of Internet of Things (IoT) in Healthcare: A Topic Modeling Approach. In Proceedings of the 52nd Hawaii International Conference on System Sciences.

[17] Peters, N. S., Bradley, G. C., & Marshall-Bradley, T. (2019, February). Task Boundary Inference via Topic Modeling to Predict Interruption Timings for Human-Machine Teaming. In International Conference on Intelligent Human Systems Integration (pp. 783–788). Springer, Cham.‏

[18] Liu, Y., Du, F., Sun, J., Jiang, Y., He, J., Zhu, T., & Sun, C. (2018). A crowdsourcing-based topic model for service matchmaking in Internet of Things. Future Generation Computer Systems, 87, 186–197.‏

[19] Tutubalina, E., & Nikolenko, S. (2018). Exploring convolutional neural networks and topic models for user profiling from drug reviews. Multimedia Tools and Applications, 77(4), 4791–4809.‏

[20] Chung, K., Yoo, H., Choe, D., & Jung, H. (2018). Blockchain Network Based Topic Mining Process for Cognitive Manufacturing. Wireless Personal Communications, 1–15.‏

--

--