DATA STORIES | DIGITAL HEALTHCARE| KNIME ANALYTICS PLATFORM

KNIMEZoBot: A Low Code Solution for Conversational Interrogation of Zotero Libraries with LLMs

Leverage AI and low-code for efficient and augmented literature review

Dayanjan S. Wijesinghe
Low Code for Data Science

--

Co-authors: Suad Alshammari, Pharm.D.,MSc., Lama Basalelah, Pharm.D., MSc., Walaa Abu Rukbah, Pharm.D., MSc., Ali Alsuhibani, Pharm.D. and Dayanjan Wijesinghe, Ph.D., Virginia Commonwealth University, School of Pharmacy

Photo by Ryunosuke Kikuno on Unsplash.

The link to the preprint of the publication from which this blog post is drawn can be found at https://arxiv.org/abs/2311.04310. If you are to cite this work, please cite as arXiv:2311.04310 [cs.HC] or https://doi.org/10.48550/arXiv.2311.04310

Introduction

Academicians, clinicians and researchers are currently facing a significant challenge of information overload. The rate of published research findings have been increasing exponentially over the past years making it difficult to keep pace. Compounding this are the new natural-language-optimized-AI-driven approaches that will further accelerate the pace of new findings and their publications in the coming years. Thus, a significant and unmet need exists for a means by which we can continue to manage our knowledge and stay updated with current findings. Two specific scenarios within this problem space could be:

  • 1 — Synthesizing answers from multiple documents from an existing corpus of knowledge. Researchers tend to accumulate published findings relevant to their field of expertise, often in reference libraries. These text corpora increase with time when newer published findings are being added to an existing corpus. Finding specific answers to a given question from this carefully curated content becomes more challenging as the number of publications in the library increases, requiring the user to hunt through multiple publications. Thus, an emerging need exists for an AI driven platform to answer specific questions from an ever growing corpus of curated scientific literature in personal and group reference libraries.
  • 2 — Summarizing knowledge through comprehensive literature reviews. Conversely, summarizing current status of knowledge in a given domain leads to uncovering gaps in knowledge and new areas for impactful research. This requires the undertaking of comprehensive literature reviews. A comprehensive literature review consists of first identifying and then collecting all relevant publications to answer the questions under investigation. Once this is complete, individuals need to read through the collected information in order to synthesize answers to specific questions. Thus, the traditional workflow of performing literature reviews can be extremely time-consuming and labor-intensive. Considering the accelerating pace of discoveries, the time to completion of a literature review through the traditional, manual, time-consuming path results in a knowledge summary that is already outdated by the time it is completed.

Newer and easier ways to implement automated approaches are needed for researchers to ask questions from curated literature libraries and to periodically update their knowledge in their discipline in a timely manner.

The challenge of context window lengths

Recent advances in artificial intelligence (AI), primarily in Large Language Models, show great promise in streamlining solutions to the above mentioned challenges. Tools like ChatGPT, Claude or even Bard have the capacity to generate summaries of papers and even synthesize findings across multiple documents. However, there are several challenges that currently limit the effective application of these platforms in their native “chat” formats for academic research. Primary among them is the challenge of context window lengths. The publicly available ChatGPT has a context window of approximately 5000 words, while Claude has one approximating 75,000 words. Thus, the majority of academic publications are too long for ChatGPT. In the case of Claude, while single publications may be used to summarize knowledge, its use across multiple publications would be limited.

RAG as a possible solution

A contemporary advancement that addresses the limitations posed by context windows and facilitates the summarization of data from assorted documents is termed as Retrieval Augmented Generation (RAG). The procedural workflow of RAG initiates with the segmentation of extensive text corpora into diminutive, overlapping textual fragments. Subsequent to this, these fragments undergo a transformation into vector representations and are cataloged within a vector-based database. When a query is presented, this too is converted into its corresponding vector form. Following this, a procedure to ascertain the vectorial similarity is employed to pinpoint all text segments within the stored database that exhibit semantic congruence. The query, alongside all pertinent text fragments, is then channeled to a Large Language Model (LLM) to generate a cohesive and pertinent response. This methodology proficiently navigates the constraints of context windows when employing LLMs for response synthesis across a spectrum of domains. The potential to iterate this process numerous times establishes a question-and-answer paradigm, rendering it an invaluable asset for extensive literature reviews, benefiting scholars and medical professionals alike.

While the RAG-based approach has the potential to quickly summarize information, its execution as of now requires significant knowledge of coding. Taking into account that a majority of academicians and clinicians are not familiar with coding, we decided to develop a code free and open source approach to allow anyone to implement a RAG-based system called KNIMEZoBot. Here, we combined three elements. Konstanz Information Miner (KNIME), a code free data science platform, Zotero which is an open source reference management system and GPT4 from Open AI as the language model of choice for knowledge synthesis. Thus, KNIMEZoBot represents an innovative approach to augment the literature review workflow by combining the strengths of reference managers, scholarly databases, and AI.

Konstanz Information Miner (KNIME)

KNIME, the “Konstanz Information Miner,” is a free and open-source platform originating from the University of Konstanz in Germany, catering to data analytics, reporting, and integration needs, with a strong footing in data science and machine learning domains. It’s a community-enhanced tool, widely embraced by data professionals globally owing to its user-friendly, graphical interface enabling code-free workflow creation, modification, and visualization.

KNIME’s core strength is its extensive node repository facilitating seamless data pipeline construction for tasks ranging from data preprocessing to advanced analytics using a no-code/low-code approach. It boasts robust data integration, connecting effortlessly to various data sources like databases and web services, thus centralizing data for comprehensive analysis. Scalability is a hallmark of KNIME, adeptly managing small to large datasets, with ease of integration into big data frameworks like Apache Hadoop and Apache Spark.

The platform supports building, training, and evaluating machine learning models utilizing popular libraries such as scikit-learn and TensorFlow, alongside offering an array of statistical and analytical techniques. Automation is seamless with KNIME, allowing scheduled workflow executions, while its ecosystem facilitates collaborative efforts and workflow sharing.

The commercial component of KNIME extends advanced features and support, enriching its open-source ecosystem. It’s a versatile tool for creating insightful reports, visualizing data, and finds applications across diverse fields including bioinformatics, predictive analytics, business intelligence, and industrial research.

Zotero

Zotero, a free and open-source reference management software, is cherished by a broad spectrum of academia and professionals for easing the collection, organization, and citation of research materials. Originating from George Mason University, it’s a boon for scholarly research and writing, streamlining reference, citation, and bibliography management.

Key facets include effortless reference collection from diverse sources like websites and academic journals, with automatic citation information extraction from web pages and PDFs. Its intuitive interface facilitates organizing references via folders, tags, and notes, ensuring easy retrieval. A hallmark feature is its citation and bibliography generation in numerous styles like APA and MLA, significantly reducing formatting time.

Integration with prevalent word processors like Microsoft Word and Google Docs allows direct citation insertion and bibliography generation in documents, ensuring accuracy and consistency. Its PDF management capability lets users attach, organize, and annotate PDFs within the reference library.

Zotero encourages collaborative research through shared library features, vital for research teams. It offers cloud synchronization for easy access across devices and data backup, enhancing data security. Browser extensions for Chrome and Firefox simplify capturing references online. Being open-source, it’s continually evolved by community contributions, and its cross-platform availability extends its reach. Applications are vast, aiding academic research, education, library assistance, and professionals across legal, medical, and media fields in managing and citing a vast array of references effortlessly.

Introducing KNIMEZoBot

The “KNIMEZoBot” represents an innovative integration of Zotero and OpenAI through the code free KNIME to streamline literature reviews and research. This project seamlessly combines the above-mentioned Zotero reference manager, with OpenAI’s powerful natural language processing capabilities via a RAG-based approach using KNIME as the interface. The primary goal is to simplify retrieving PDFs from Zotero libraries and collections, and then utilize OpenAI within KNIME workflows to ask insightful questions and extract key information from academic papers.

KNIMEZoBot uses a Retrieval-Augmented Generation (RAG) architecture, conducting first a semantic search to identify relevant passages from retrieved PDFs. It then leverages large language models (in this case OpenAI’s GPT models) to synthesize natural language answers based on the extracted information. This enables KNIMEZoBot to provide informative responses to questions by efficiently searching academic papers and distilling salient facts and main points. Overall, the integration of Zotero and OpenAI represents an innovative approach to enhance literature reviews and research by combining reference management, scholarly databases, and OpenAI.

Figure 1: Underlying workflow for KNIMEZoBot.

First component (Setup Zotero):

In order to effectively use the KNIMEZoBot system, users need to follow a series of key steps. The first requirement is selecting the type of Zotero library they want to access — either a personal Zotero library or a group library. Based on that choice, users will need to input their corresponding Zotero API key, which allows the system to interface with the library.

Additionally, users will need to provide either their personal Zotero user ID if accessing their own library, or the group ID if accessing a shared group library. To assist users in easily finding and copying their user ID or group ID, we have included hyperlinks within the system interface that direct users to Zotero guides with instructions on locating that information.

Furthermore, to enable more targeted searches, users have the option to filter based on Zotero collections. This allows them to refine the content being retrieved from their library down to specific collections, rather than everything in the library. The system was designed to be flexible — some users may want to search across their entire library, while others may want to narrow in on papers from select collections.

Figure 2: Component “Setup Zotero” when executed — User interface of KNIMEZoBot. Users are required to complete the Zotero information fields.
Figure 3: Continuation of the first component when executed. We provided options to filter by specific collections.

Second component (Setup OpenAI):

The second core component of the system involves setting up the OpenAI environment according to the user’s preferences. Users have the ability to adjust key settings such as chunk size and chunk overlap. Chunk size refers to the maximum number of tokens processed per API request, while overlap determines the number of duplicated tokens between chunks. Giving users control over these parameters enables them to customize the configuration based on their specific computational needs and use case.

After inputting their OpenAI API key, which grants access to the AI models, users can select from a variety of available models offered through the OpenAI API. Users can make a selection from a range of available OpenAI models, including but not limited to GPT-3.5 Turbo and GPT-4.

Figure 4: Component “Setup OpenAI” when executed- Users are required to select chunk size and overlap settings for text processing, enter their OpenAI API key, and select an AI model.

Last component (Chat app):

The last component of the system is the Chat application, which provides an interactive interface for users to engage with their Zotero library. This chatbot-style app enables users to pose questions and queries about the content of their Zotero library in a natural conversational format. The seamless integration of the chatbot with the Zotero reference database creates a convenient and user-friendly method for users to search for information within their library.

In addition, users have the option to download their full conversation history with the chatbot in a .csv format. This allows users to save all of their questions and the chatbot’s responses so they can refer back to the information later.

Figure 5: Final Component “Chat App” — Chat Interface. This component allows users to ask questions and receive answers through a conversational chatbot. Users can also download their full chat history as a CSV file.

Conclusion

In summary, the KNIMEZoBot represents a promising integration of technologies to expedite literature reviews. By unifying the capabilities of Zotero, OpenAI, and KNIME, this system automates laborious tasks such as downloading and digesting academic papers. Researchers can save significant time while benefiting from state-of-the-art AI techniques for synthesizing knowledge in a low code manner. This innovation demonstrates the potential for AI to assume a greater role in accelerating informed research. While further enhancements to the accuracy and sophistication of the automated analysis remain desirable, KNIMEZoBot marks an important step toward streamlining access to critical information in existing literature by domain experts who are not coders by training. By facilitating more rapid and comprehensive understanding of prior work, this system could substantially benefit the research community and knowledge-building process.

KNIMEZoBot — An application using a low code approach to enable Retrieval Augmented Generation based QnA sessions with your own Zotero Libraries.

Project Files

About the lead author

Dr. Suad Alshammari holds a Pharm.D. from the Northern Border University of Saudi Arabia and MSc. from Virginia Commonwealth University School of Pharmacy. She is currently pursuing her research towards a PhD in Pharmacotherapy. Her research interests encompass the use of AI for accelerating research and drug discovery.

--

--