Maximizing the Power of Bioinformatics Through Large Natural Language Models
As we cover every month here in our blog, bioinformatics is evolving blazingly fast, in particular in the last years with advancements in artificial intelligence (AI). Every few months, a new tool, resource or database revolutionizes some aspect of the field; and among tools, AI constantly brings new ways to better process and analyze data (as we covered for example when we touched on the latest foundational models for biology) as well as new ways to make better predictions (as exemplified by AlphaFold 3). Today we will touch on large language models (LLMs) trained on human-readable data such as human-written text and source code, just like what powers the infamous ChatGPT, and their several applications to biology — now and in the near future.
Among all the advancements in AI, LLMs like those of OpenAI’s GPT family with the latest GPT-4o just out, Google’s Gemini or Meta’s Llama or Anthropic’s Clause models among others, have emerged as powerful tools particularly suited to boost various bioinformatics workflows. Whereas we covered already the rather “obvious” application of large language models as tools to process text, thus opening up new ways to do text mining, ontology, and bioinformatics, we have observed more recently some wider appications. Given the potential of this technology to reshape the way researchers approach data analysis and knowledge extraction, we at Nexco are closely monitoring its evolution and its concrete applications. In particular, we believe LLMs can provide some cutting-edge technologies that deliver efficient and accurate data analysis solutions for our clients. We here present some exciting ideas, examples, and discussions about this.
The Transformative Impact of Large Language Models in Bioinformatics
Streamlining Annotation of scRNA-seq Data
LLMs excel in automating and streamlining the annotation of complex biological data. Traditional methods often require extensive manual effort and expert knowledge, which can be both time-consuming and costly. LLMs, on the other hand, can quickly process and annotate data with high accuracy. This capability has been recently applied to the task of annotating single-cell RNA sequencing (scRNA-seq) data, in a systematic test where GPT-4 was put to annotate cell types in ten benchmark datasets covering five species and hundreds of tissue and cell types, including normal and cancer samples.
The study, published in Nature Methods (here) found that GPT-4 can match the accuracy of manual annotations by experts, particularly when utilizing the top ten differential genes identified by a two-sided Wilcoxon test. But of course, the LLM can do it orders of magnitude faster, reducing the workload on researchers who can then focus on the interpretation of the results.
Notably, GPT-4’s proficiency is not limited to identifying major cell types; it also excels in distinguishing subtypes, especially immune cells, and provides more granular annotations than manual methods. Its integration with standard analysis pipelines like Seurat3 allows for quick and efficient input processing, outperforming other annotation methods that often require additional steps. Moreover, GPT-4 displayed robust performance even when dealing with mixed and unknown cell types, subsampling, and noisy data.
The AI model reported in said paper has been integrated into a software package called GPTCelltype, designed for the R programming environment, which streamlines the transition from manual to automated annotation within existing single-cell analysis workflows.
At its core, the procedure consists in feeding the LLM (via the standard OpenAI API) with the top differential genes identified by the two-sided Wilcoxon test together with a prompt that the authors of the study designed and optimized. Prompts looked just like what one would imagine could work, for example:
Identify cell types of TissueName cells using the following markers separately for each row. Only provide the cell type name. Do not show numbers before the name. Some can be a mixture of multiple cell types.\n GeneList.
While there must surely be some ample space for improvement, given how sensitive the results of LLMs are to specific prompts as Deepmind reported, it is truly remarkable that a rather simple, concise and straightforward prompt can achieve the results reported in the paper. No doubt that together with further improvements in the prompting and other elements as discussed in the paper, tools like GPTCelltype could be very powerful.
Facilitating Human-Computer Interaction Using Natural Language
One of the most user-friendly aspects of LLMs is their ability to facilitate interaction with data through natural language commands. This makes advanced bioinformatics tools more accessible to researchers without extensive computational backgrounds. As an example, we already covered how an LLM coupled to a speech recognition system helps users of the HandMol program to manipulate molecules inside virtual environments, where the user is typically away from peripherals like the keyboard and the mouse and might even have his/her hands busy handling virtual objects.
Similarly in spirit but with a totally different goal, a recently released tool called Omega is a conversational agent for bioimage analysis. It works as a plugin for napari, a popular image viewer for biological data, and it allows users to perform complex tasks formulated with natural language. This way, Omega bypasses the need to know the napari software in depth, empowering more, perhaps less-trained, researchers to interact with image datasets much more easily.
In the short paper describing Omega, the author showcases how through requests such as “segment cell nuclei in the selected image on the napari viewer” followed by “count the number of segmented nuclei” and finally “return a table that lists the nuclei, their positions and areas” a user can easily obtain the corresponding graphs and tables, without having to type a single command.
Notably, Omega is not just a bypass for napari but actually also helps its users to learn, via an AI-augmented code editor that casts requests into scripts from which the user can learn, also introducing safeguards such as automatic commenting and error checks. This way, novice users can learn and advanced users can check that the code generated by the LLM is actually correct and mitigate the risk of errors.
Generic Data Analysis and the Future
Other similar, broader applications have also been showcased since the early days when LLMs were released, such as tools that can on-the-fly cast requests in natural language into code for the analysis of generic data tables, including plotting capabilities. We are at Nexco firm believers that these kinds of natural interaction interfaces, possibly coupled to robust speech recognition systems like that offered by models such as Whisper or even built-in natively into browsers, will make advanced bioinformatics more accessible than ever.
Moreover, generalist LLM-based tools that integrate requests in natural language with LLM-based and other forms of data analysis, could in the future allow even higher flexibility than currently possible. Moreover, as foundational models become more practical (see our previous blog post), we can forecast complete Artificial General Intelligence (AGI) agents that can automate an even larger proportion of the procedures that lead from raw data to new biology and applications.
Some Practical Considerations
Incorporating LLMs into bioinformatics workflows enhances efficiency and reproducibility. These models can process data faster than traditional methods, providing rapid results that keep pace with the demands of modern research. Additionally, their ability to produce consistent and reproducible annotations and analyses helps standardize bioinformatics practices, reducing variability and improving the reliability of research outcomes.
However, all these pipelines involve some kind of casting of the natural requests into function calls or even code writing, a task at which LLMs excel yet aren’t free of errors and hallucinations. It is thus for the moment impossible to rely blindly on these tools, and we require better prompting methods, deeper benchmarking campaigns, and caution when using the models.
Cost Effectiveness
Costs are also to be considered. Some models are open source, but they must be deployed somewhere and this often has important associated costs; meanwhile, other models are accessed through APIs that are paid but in exchange for not having to maintain any systems running. Modern LLMs called via APIs such as those from OpenAI have gotten very cheap per token, but on the other hand the number of tokens can escalate very quickly when processing large datasets.
How we at Nexco can put Large Natural Language Models to Work for you
At Nexco, we are dedicated to utilizing the latest advancements in AI to enhance bioinformatics research. Here are some of the ways how we can apply LLMs to improve your data analysis pipelines, and why not other kinds of procedures:
- Automated Data Annotation: We can develop new or deploy existing LLMs and LLM-based software to automate the annotation of datasets, such as GPTCelltype described above for scRNA-seq, to help you reduce the need for manual intervention and accelerating analyses.
- Natural Language Interfaces: By integrating conversational agents much like in Omega and in other examples shown above, we can enable researchers to interact with their data using natural language commands, simplifying complex workflows and making advanced analysis tools more accessible. By adding speech recognition and speech synthesis as in the HandMol example, we can further simplify human-computer interaction.
- Custom Analytical Solutions: We leverage the versatility of LLMs to develop tailored solutions that meet your specific needs in bioinformatics and molecular modeling, working together in the development, testing, fine-tuning, prompting, and implementation of LLM-based systems for data analysis.
At Nexco, our commitment to innovation ensures that you receive the most advanced, efficient, and reliable bioinformatics solutions available. By starting to integrate LLMs into our services and upcoming products, we empower you to unlock the full potential of your data and drive groundbreaking discoveries in your research.