Knowledge graphs and black box LLMs
What are the limitations of large language models (LLMs) in biological research? ChatGPT responds to this query with quite a comprehensive list that includes a lack of domain-specific knowledge, contextual understanding, access to up-to-date information, and interpretability and explainability.
Nevertheless, it has to be acknowledged that LLMs can have a transformative impact on biological and biomedical research. After all, these models have already been applied successfully in biological sequential data-based tasks like protein structure predictions and could possibly be extended to the broader language of biochemistry. Specialized LLMs like chemical language models (CLMs) have the potential to outperform conventional drug discovery processes in traditional small-molecule drugs as well as antibodies. More broadly, there is a huge opportunity to use large-scale pre-trained language models to extract value from vast volumes of unannotated biomedical data.
Pre-training, of course, will be key to the development of biological domain-specific LLMs. Research shows that domains, such as biomedicine, with large volumes of unlabeled text benefit most from domain-specific pretraining, as opposed to starting from general-domain language models. Biomedical language models, pre-trained solely on domain-specific vocabulary, cover a much wider range of applications and, more importantly, substantially outperform currently available biomedical NLP tools.
However, there is a larger issue of interpretability and explainability when it comes to transformer-based LLMs.
The LLM Black Box
The development of Natural Language Processing (NLP) models has traditionally been rooted in white-box techniques that were inherently interpretable. Since then, however, the evolution has been towards more sophistical and advanced techniques black-box techniques that have undoubtedly facilitated state-of-the-art performance but have also obfuscated interpretability.
To understand the sheer scale of the interpretability challenge in LLMs, we turn to OpenAI’s Language models can explain neurons in language models paper from earlier this year, which opens with the sentence “Language models have become more capable and more widely deployed, but we do not understand how they work.” Millions of neurons need to be analyzed in order to fully understand LLMs, and the paper proposes an approach to automating interpretability so that it can be scaled to all neurons in a language model. The catch, however, is that “Neurons may not be explainable.”
So, even as work continues on interpretable LLMs, the life sciences industry needs a more immediate solution to harness the power of LLMs while mitigating the need for a more immediate solution to integrate the potential of LLMs while mitigating issues such as interpretability and explainability. And knowledge graphs could be that solution.
Augmenting BioNLP Interpretability with Knowledge Graphs
One criticism of LLMs is that the predictions that they generated based on ‘statistically likely continuations of word sequences’ fail to capture relational functionings that are central to scientific knowledge creation. These relation functionings, as it were, are critical to effective life sciences research.
Biomedical data is derived from different levels of biological organization, with disparate technologies and modalities, and scattered across multiple non-standardized data repositories. Researchers need to connect all these dots, across diverse data types, formats, and sources, and understand the relationships/dynamics between them in order to derive meaningful insights.
Knowledge graphs (KGs) have become a critical component of life sciences’ technology infrastructure because they help map the semantic or functional relationships between a million different data points. They use NLP to create a semantic network that visualises all objects in the systems in terms of the relationships between them. Semantic data integration, based on ontology matching, helps organize and link disparate structured/unstructured information into a unified human-readable, computationally accessible, and traceable knowledge graph that can be further queried for novel relationships and deeper insights.
Unifying LLMs and KGs
Combining these distinct ontology-driven and natural language-driven systems creates a synergistic technique that enhances the advantages of each while addressing the limitations of both. KGs can provide LLMs with the traceable factual knowledge required to address interpretability concerns.
One roadmap for the unification of LLMs and KGs proposes three different frameworks:
- KG-enhanced LLMs, where the structured traceable knowledge from KGs enhances the knowledge awareness and interpretability of LLMs. Incorporating KGs in the pre-training stage helps with the transfer of knowledge whereas in the inference stage, it enhances LLM performance in accessing domain-specific knowledge.
- LLM-augmented KGs: LLMs can be used in two different contexts - they can be used to process the original corpus and extract relations and entities that inform KG construction. And, to process the textual corpus in the KGs to enrich representation.
- Synergized LLMs + KGs: Both systems are unified into one general framework containing four layers. One, a data layer that processes the textual and structural data that can be expanded to incorporate multi-modal data, such as video, audio, and images. Two, the synergized model layer, where both systems' features are synergized to enhance capabilities and performance. Three, a technique layer to integrate related LLMs and KGs into the framework. And four, an application layer, for addressing different real-world applications.
The KG-LLM advantage
A unified KG-LLM approach to BioNLP provides an immediate solution to the black box concerns that impede large-scale deployment in the life sciences. Combining domain-specific KGs, ontologies, and dictionaries can significantly enhance LLM performance in terms of semantic understanding and interpretability. At the same time, LLMs can also help enrich KGs with real-world data, from EHRs, scientific publications, etc., thereby expanding the scope and scale of semantic networks and enhancing biomedical research.
At BioStrand, we have already created a comprehensive Knowledge Graph that integrates over 660 million objects, linked by more than 25 billion relationships, from the biosphere and from other data sources, such as scientific literature. Plus, our LENSai platform, powered by HYFT technology, leverages the latest advancements in LLMs to bridge the gap between syntax (multi-modal sequential and structural data ) and semantics (functions). By integrating retrieval-augmented generation (RAG) models, we have been able to harness the reasoning capabilities of LLMs while simultaneously addressing several associated limitations such as knowledge-cutoff, hallucinations and lack of interpretability. Compared to closed-loop language modelling, this enhanced approach yields multiple benefits including clear provenance and attribution, and up-to-date contextual reference as our knowledge base updates and expands. If you would like to integrate the power of a unified KG-LLM framework into your research, please drop us a line here.