Carlos Toxtli Hernandez, Claudia Flores Saviaga, Marco Maurier, Amandine Ribot, Temitayo Samson Bankole, Alexander Entrekin, Michael Cantley, Salvi Singh, Sumitra Reddy, Yenumula. V. Reddy.
Even though the advent of the Web coupled with powerful search engines has empowered the knowledge workers to quickly find the needed information, it still is a time-consuming operation. Presently there are no readily available tools that can create and maintain an up-to-date personal knowledge base that can be readily consulted when needed. While organizing the entire Web as a semantic network is a long-term goal, creation of a semantic network of personal knowledge sources that are continuously updated by crawlers and other devices is an attainable task. We created an app titled ExperTwin, that collects personally relevant knowledge units (known as JANs) from the Web, Email correspondence, and locally stored files, organize them as a semantic network that can be easily queried and visualized in many formats — just in time — when performing a knowledge-based task. The architecture of ExperTwin is based on the model of a “Society of Intelligent Agents”, where each agent is responsible for a specific task. Collection of JANs from multiple sources, establishing the relevancy, and creation of the personal semantic network are some of the many tasks performed by the individual agents. Tensorflow and Natural Language Processing (NLP) tools have been implemented to let ExperTwin learn from users. Document the design and deployment of ExperTwin as a “Knowledge Advantage Machine” able to search for relevant information while performing a knowledge-based task, is the main goal of the research presented in this post.
Today, the information in every field is expanding exponentially, while simultaneously the half-life of that information is reducing rapidly. Knowledge workers (KW) are facing these twin challenges in keeping themselves current. Traditional methods of continuing education are not sufficient to keep the knowledge workers up-to-date. This situation gives rise to the need for the development of an intelligent tool that can automatically gather, refine and present contextually relevant knowledge on demand at the time of need. This post describes a tool known as ExperTwin that we developed to address these challenges. This may be metaphorically thought of as an alter ego that dwells in cyberspace collecting, refining and transforming information into a usable knowledge base that is always current and relevant. For example, if a knowledge worker, say that a researcher is preparing a research paper, ExperTwin can present the latest relevant annotated bibliography. The same thing applies to any other situation where knowledge plays a key role. Just as Mechanical Advantage played a key role in the industrial era, the concept of Knowledge Advantage could be applied to deal with the information explosion problem.
The three main components of the solution are Knowledge Discovery, Learning Agent and Visualization. Below are shown the architecture and the details of each of the components.
Our knowledge discovery step for ExperTwin consists of the crawling feature to extract data from various sources. The ExperTwin has the capability to discover knowledge from RSS feeds, email, and documents uploaded via drag and drop on the ExperTwin interface. The web crawler or discovery agent extracts data from the RSS feed at regular intervals. All the information is inserted in the database as JANs.
There were 4 different data acquisition mechanisms, a files drag-and-drop area in the user interface, a chrome extension, an email bot, and a Rich Site Summary(RSS) crawler. The drag-and-drop mechanism consists of a special area in the top bar of the user interface that is able to upload multiple files and nested directories. The chrome extension is a button placed in the browser top bar and its function is to send the URL that the user is visiting. The email bot is able to get as input article URLs, RSS URLs, and attached files.
For the purpose of this research, we selected a Tech News domain. The sources of all the data are close to fifty RSS feeds from various tech news websites. The web crawler or discovery agent extracts data from the RSS feed at regular intervals. The news feeds are temporarily stored in a Non-SQL MongoDB database until being processed by a TensorFlow model that filters the relevant content as is outlined in section VI (learning agent).
ExperTwin learns from the user preferences and interactions to retrieve tailored content to each user through Natural Language Processing and Machine Learning models.
Characterization of text articles using keywords has long been researched in the open literature. All data processing such as data scraping, organization, filtering, preference learning, and data visualization depends on the set of keywords that characterize the JANs, therefore, it is pertinent to extract the right set of keywords for any given document in the database. Essentially, keyword extraction boils down to using the degrees of words and a different list (i.e. the title of an article) to determine with some degree of accuracy what a passage of text is about. The parameters to this program include the text of the article, the number of keywords desired and optionally the title of the document. Upon receipt of the arguments, Rapid Automatic Keyword Extraction (RAKE) is used to obtain the keywords within the passage of text. Once the RAKE extracted keywords have been handled in the ways previously described, they are then sorted based on their degrees. The higher the degree, the higher the likelihood that the keyword is a good descriptive word of the passage of text parameter. Fortunately, once the words have been sorted, the last step can be performed, which is constructing the output dictionary. The dictionary is formatted as (Keyword: Weight) with the keywords being the key for each weight. The Keyword portion of the dictionary has already been decided and listed as the top keywords selected by the number of keywords input parameter, but the weight still needs to be calculated. The weight wk of keyword k with keyword degree k is calculated as: wk = δk kk (1) Once weights have been calculated and the dictionary constructed, the algorithm will return the dictionary to the programmer
Outline of the Learning Agent
This section presents the operational description of the intelligence and adaptability behind the framework discussed so far. Given the collection of all documents referred to as JANs in this framework. It is necessary to extract some features from an initial set of documents based on user-defined preferences. The collection of JANs can be thought of a community ontology from which we seek a personal ontology-based section. This user-defined preferences provide an insight into the ontology of the user therefore within the database and therefore provides a basis for learning component of the overall framework.
Several methods exist in which terms can be represented in text mining in order to be used as a platform for the learning component. This representation provides a method for evaluating a search heuristic. The most utilized and promising approach has been the vector space model. In this model, a collection of documents D is represented by an m dimensional vector, where each dimension represents a distinct term and m here is the total number of distinct terms used in the collection of documents. Each document in the collection has a corresponding vector representation V, where vi is the weight of the term di for that particular document. Thus the collection of documents is a matrix DRnxm where n is the number of documents in the collection. Weights are attached to the relative importance of the words/terms which can be determined using the tf-idf scheme. Using this approach, the term weights are calculated based on their relative importance i.e. how often a term appears in a particular document and how frequently it
The machine learning algorithm employed here is a two-step based method using Artificial Neural Networks (ANNs). The first ANN consists of five layers, one input layer, three hidden layers, and one output layer. The second ANN receives classification output from the first and further processes the results for user preferences, this is referred to as the preference learner. This helps monitor user preferences as they change from the defined interest. In order to train the network, an initial set of documents from the database in the server was collected for training purpose. This consisted of 2125 articles from the Web crawler. A set of user-defined keywords were generated, this is represented as K(user). Using the keyword extraction scheme as defined above. Keywords are generated from all documents and for document Vk = 1, 2, …, n. Each document class Ck is classified as class 1 or class 0 based on the criteria in equation 3.
As mentioned above, the second ANN incorporates exhibited user preferences for leaning in addition to the first ANN. This preference learner is implemented in the form of a neural network with multiple layers that train upon domain-relevant JANs, some of which have been identified as of primary interest to the user, whereas other JANs in this subset are not the user's first preference. The data for training is a subset of the main dataset that was used to collect, discover and catalog relevant news articles from amongst a set of articles scraped from the Internet. This subset comprises only the articles that have already been classified as JANs in the present domain context. Depending upon the user's predisposition towards certain topics within the domain, JANs recording these select topics are rated as one, whereas others are rated as zeroes. The procedure here is to extract all previously 890 classified documents from the MongoDB database and then vectorize all JANs using the bag of words approach. The dataset is divided into batches. This network consists of three layers with relU activation function. As before, the cross-entropy is defined as the objective to be minimized. And the learning algorithm is the backpropagation with Adam optimization which adjusts step size in order to find the global minimum of the cost function.
The application's interface is composed of four main parts: 1) a user setting section, 2) a section of seeing articles in detail, 3) a 2D/3D reorganization of the articles, and finally, 4) a graph view displaying the different relationships between the articles. Additionally, the interface has been built to adapt to any size of the device using the open source Bootstrap toolkit. Each section is detailed below. ’
The first part of the application interface is related to the user settings. Users can have access to their own knowledge-based system. For convenience, since all the backend technologies (server, and databases) were located on the Google Developer Console/Cloud, we decided to integrate Google Sign-In into the application. Once logged in, users can see their topics of interest (keywords), and these define the ranking of articles retrieved by the server queries. At any time, while using the application, the user has the opportunity to change, add and remove any keyword by using the interface Add New and Delete buttons.
Article Details View
Once the user has selected a keyword from the dropdown or has entered a query in the search bar, the second section of the interface will be populated by results from our database. Following the ranking, calculation explained in the previous section, a list of maximum 125 appropriate articles will be received by the interface under the form of the JSON file. Of those 125 articles, the top ten, i.e. the ten most highly ranked articles compared to the query/keyword, are displayed as a list (in descending ranking order). We chose to display only ten articles to keep the interface to a reasonable size in height and to avoid a dictionary list effect that goes on and on. If the user, desired to consult more articles, he can still access one of the 125 articles through the 3D view. For each article, a system of ratings is proposed to the user. The rating is composed of five clickable stars, that will light up in yellow or turn back to grey in function of the user choice. The rating will be used by the backend to influence the ranking of the articles.
3D Articles View
In the case where the user wants to see and access more than the ten most highly ranked articles, the user can toggle between the Graph view and the Article View by using the Go to View button located in the navigation part of the interface. There are four types of 3D representations available: Table, Sphere, Helix, Grid additionally to the graph view displaying the different relationships between the articles.
ExperTwin can be switched to an immersive experience where the user can navigate through the 3D environment. This view is compatible with commercial VR headsets that use mobile device’s screen and it is loaded from a web browser.
In this post, we illustrated how a Knowledge Advantage machine could be constructed by providing details of how different requirements could be met. However, we should note the severe limitations of the current implementation of ExperTwin. In the future, we hope to develop a plug-and-play feature where a generic ExperTwin could simply be connected to a Knowledge Base. Also, we plan to explore automatic summarization of the knowledge instead of simply presenting the contextually relevant JANs. Further work on the automatic determination of context will be essential for tools like ExperTwin to be successful.
The paper was presented at 2018 IEEE Confs on Internet of Things, Green Computing and Communications, Cyber, Physical and Social Computing, Smart Data, Blockchain, Computer and Information Technology, Congress on Cybermatics. The code can be found here. The paper can be found here