Want a More Intelligent Generative AI Chatbot? Start with Intelligent Content!

4 min readFeb 25, 2024

Radically improving the intelligence of Generative AI chatbots requires leveraging intelligent content, a pivotal aspect often overlooked in early implementations. The majority of retrieval-augmented generative (RAG) models operate under the assumption that their content sources are unstructured. However, a significant number of organizations possess extensive collections of documents structured according to specific models, especially for user assistance (help) documentation. These structured document collections can be instrumental in developing AI systems that are not only more accurate and reliable but also easier to maintain and more transparent in their operations, thanks to explainable AI (XAI) principles. These benefits can be realized on a large scale and with a high level of automation.

An ideal framework for organizing and managing such structured content is the Darwin Information Typing Architecture (DITA). DITA is an XML-based framework designed for authoring, producing, and distributing both technical and non-technical information. As one of the most successful and broadly adopted open OASIS standards for documentation, DITA facilitates the structuring of documents and data in a way that is both modular and reusable. This is particularly advantageous for handling extensive documentation volumes. DITA sets forth a series of design principles and document types, encouraging the creation of content that is focused on specific topics rather than traditional document-centric approaches. This methodology promotes content reuse, streamlines management processes, and supports more efficient and automated distribution across various platforms and formats.

The foundational elements of DITA include:

Topics: These are the primary content units within DITA, each covering a single subject or concept. Topics vary in type — ranging from concepts and tasks to references — each fulfilling a distinct role within the documentation.
Maps: DITA maps organize topics to mirror the structure of a document or publication. They enable the definition of topic relationships, facilitate content reuse, and allow for the customization of outputs for different audiences or objectives.
Specialization: DITA allows the creation of new, more specific types based on existing topics and maps. This process, known as specialization, enhances the adaptability of content to specific needs or domains without compromising compatibility with standard DITA processing tools.

In the documentation industry, intelligent content is essentially synonymous with DITA. Intelligent content is characterized by its modularity, structure, reusability, and semantic richness, which separates its format from presentation. This makes the content predictably understandable by machines for processing, interpretation, and automation.

DITA-based documents are inherently self-descriptive due to their explicit topic typing, semantic tagging (which focuses on the content’s purpose rather than its appearance), and comprehensive metadata detailing aspects like authorship, publication date, intended audience, and content status. This rich metadata, embedded directly within the document’s structure, provides essential context and details about the content without reliance on external references. Furthermore, DITA’s container-based inheritance model is perfectly suited for enriching documents with uniform content taxonomy labels, either manually or through auto-classification. This enriches knowledge graph RAG solutions with more accurate and precise information when queried.

Given its foundation in an intelligent document object model (DOM), DITA is ideally positioned for automating the creation of knowledge graphs for retrieval-augmented generation. The DITA schema can act as the ontology for mapping document instances to a graph database, enabling the automatic generation and maintenance of a knowledge graph at any scale. This graph can then be efficiently queried and utilized in a neuro-symbolic RAG implementation.

Furthermore, the DITA ontology and the taxonomies used to generate the knowledge graph and enrich the content objects can be used to fine-tune the large language model with domain-specific embeddings, further improving retrieval accuracy for a given content corpus.

Regrettably, many early adopters of RAG technology did not consult their documentation teams about the availability of DITA-formatted content, missing out on its considerable advantages for generative AI applications. For organizations yet to adopt DITA, it may be worthwhile to consider transitioning from unstructured to structured DITA document format, given DITA’s substantial benefits beyond AI applications. Unstructured documents can also participate in a DITA-based knowledge graph by encapsulating them in DITA containers, or alternately by employing a form of DITA called lightweight DITA (LwDITA) if wholesale migration is not feasible.

A knowledge graph based on DITA can significantly enhance the pre-retrieval accuracy, generation, referencing, and fact-checking of content. Graph-based RAG systems provide a form of neuro-symbolic AI that avoids many common limitations of purely vector-based models. This blend of neural networks and symbolic AI — dealing with structured data, logic, rules, and relationships — offers a robust alternative to the purely vector-based models that remain prone to inaccuracies. With intelligent content, development teams can move beyond the constant need for fine-tuning, as the intelligence embedded in the DITA source ensures automatic version control and maintenance, streamlining the development process for generative AI chatbots.

Sooner or later, developers are going to realize that the quality and utility of generative AI, especially for help and user assistance, rely as much or more on the quality and intelligence of the content as the model and its implementation.

Want a More Intelligent Generative AI Chatbot? Start with Intelligent Content!

Written by Michael Iantosca

Responses (2)