WhyHow.AI

WhyHow.AI’s platform helps devs and non-technical domain experts build Agentic and RAG-native knowledge graphs

Case Study: Turning Congressional Hearing Transcripts into Knowledge Graphs

--

Interested in turning transcripts into Temporal Knowledge Graph that you can ask complex questions across multiple Congressmen, political issues, and Committees?

In this case study, we show how Congressional Transcripts are turned into Temporal Knowledge Graphs that you can rely on for the purposes of RAG and analytics. We show what real Question & Answers are against this system, and what type of business outcomes you can achieve with this system. As far as we are aware, the combination of steps here is a relatively novel Knowledge Graph implementation.

Data used

We relied on publicly available Congressional Committee hearing meetings that can be found online here. We used only a subset of the reports available, ingesting 30 hearings.

Sample Congressional Transcript used

Novel Knowledge Graph Architecture Overview

Nodes:

We have 6 types of Nodes: Topic, Committee, Organization, People, Hearing, Event

Triples (Sample List):

Hearing -> Has Challenges/Opportunities -> Challenges/Opportunities

Hearing -> Discussed -> Topic

People -> Participated in -> Hearing

People -> Affiliated with -> Organization

Chunks:

Chunks are chunks of text that are standalone objects. Chunks are tied to each Triple, and there can be many Chunks tied to a single Triple. Instead of being the unstructured source of the Triple, the Chunks in this case are summaries and key points related to each Triple type.

An example of what different type of chunks are tied to triples look like the following:

So for example, the Chunk tied to the triple “People -> Affiliated with -> Organization” would look like the entirety of the information below:

Name: John D. Herold

Role: Lieutenant, Pennsylvania State Police

Context: Described the Pennsylvania State Police’s involvement in the security detail and the planning process leading up to the event.

Contact Info: {“email”: “”, “phone”: “”, “twitter”: “”, “website”: “”, “facebook”: “”, “linkedin”: “”, “instagram”: “”}

While the Chunk tied to the triple “Hearing -> Discussed -> Topic” would look like the entirety of the information below:

Planning and Security Protocols: Failures in the security planning and execution were discussed, emphasizing the need for better preparedness.

1. Summary -> TextEmbedding
Triple: (Summary) -> [inherits_from]-> (TextEmbedding)
- Chunk_ids link to document summaries
- Example Chunk: "House Committee on Veterans Affairs discusses modernization efforts in Q1 2024. Key focus on digital transformation and claims processing."

2. FinancialInsight -> TextEmbedding
Triple: (FinancialInsight) -> [inherits_from]-> (TextEmbedding)
- Chunk_ids link to financial discussions
- Example Chunk: "FY2024 budget allocation increased by $2.3B for VA modernization initiatives, focusing on IT infrastructure."

3. Challenge -> TextEmbedding
Triple: (Challenge) ->[inherits_from]-> (TextEmbedding)
- Chunk_ids link to identified challenges
- Example Chunk: "Legacy system integration poses significant hurdles for VA modernization timeline, impacting claims processing speed."

4. KeyEvent -> TextEmbedding
Triple: (KeyEvent) ->[inherits_from]-> (TextEmbedding)
- Chunk_ids link to event records
- Example Chunk: "2024-01-15: Committee hearing on VA Claims Backlog Reduction Act introduced by Rep. Smith."

5. WhatMatters -> TextEmbedding
Triple: (WhatMatters) -> [inherits_from]-> (TextEmbedding)
- Chunk_ids link to key insights
- Example Chunk: "Veteran healthcare access remains top priority, with focus on rural area coverage and telehealth expansion."

6. Opportunity -> TextEmbedding
Triple: (Opportunity) -> [inherits_from]-> (TextEmbedding)
- Chunk_ids link to identified opportunities
- Example Chunk: "AI implementation could reduce claims processing time by 40%, potential for immediate implementation."

7. PersonContext -> TextEmbedding
Triple: (PersonContext) -> [inherits_from]-> (TextEmbedding)
- Chunk_ids link to person-related information
- Example Chunk: "Dr. Jane Smith, VA Chief Technology Officer, leading digital transformation initiative since 2023."

Link to Graph: https://main--whyhowai.netlify.app/public/graph/671c22353ccecb7033cacec5

With this particular graph architecture, where you can have key points and summaries tied to triples, you can then focus on landing on the right set of triples through an unstructured search, and subsequently bringing in all the relevant key information through the linked chunks in a structured way.

Unique to WhyHow Architecture

There are a few things unique to the WhyHow graph infrastructure that allows us to build this architecture in a simple way.

Firstly, Triples are embedded and retrieved through vector search, avoiding a common retrieval issue of having to use Text2Cypher for the identification of nodes, relationships, and then Cypher query construction just to land on the right Triple. This has shown to dramatically improve retrieval accuracy by up to 3x.

Secondly, Triples are standalone objects in WhyHow that you can link chunks to. This allows you to distill the key information that you want to retrieve per Triple, and bring it directly into the context once the right Triples are found. This avoids having to represent crucial information and context in a graph format (complicating the schema construction process), and bringing in information in a structured way after the initial unstructured vector search. This is similar in process to LinkedIn’s application of Knowledge Graphs in their system, where crucial information like ‘Steps to Reproduce’ are represented and retrieved similarly, and where the steps themselves are represented as individual ‘chunks’/ ‘nodes’.

Thirdly, WhyHow accepts data in a JSON format, which allows seamless interaction between any of the extraction frameworks directly into graph creation. In this case, we use Unstructured for the initial transformation of the transcript data into the necessary JSON structured to load into WhyHow.

Fourthly, because of the way that Chunks and the retrieval process is designed in the WhyHow system, you can easily include temporal data that can be used to govern the way that the answer is constructed. Temporal data has always been a hard thing to model in Knowledge Graphs (to the point that it is typically advised against by leading KG experts), but is an obviously important part of workflows. Existing methods that even attempt to model temporal data try to ingest this into the Knowledge Graph itself and then retrieve based on a structured Cypher query, as opposed to our architecture that uniquely uses the LLM to help filter for temporal data.

Blending the power of LLM with structured knowledge representations like Knowledge Graphs are important ways to achieve business outcomes, and we think this temporal Knowledge Graph architecture will help unlock a lot of business value through the successful implementation of temporal data.

Data Transformation Process Used

First, we use Unstructured to turn the transcript information into a schema-aligned set of information on a per transcript basis.

Each transcript is turned into a JSON summarization that looks like this (partial JSON output below):

[
{
"intel_id": 2070,
"insights": {
"data": {
"summary": "The Congressional hearing on September 13, 2023, titled 'Unsuitable Litigation: Oversight of Third-Party Litigation Funding,' examined the growing influence of third-party litigation funding in the legal system, focusing on the ethical, financial, and judicial implications. Lawmakers discussed concerns over potential abuses, the motivations behind mass litigation, and the need for regulatory measures to enhance transparency and protect consumers.",
"financial": [
"The mass tort litigation sector generated approximately $443 billion in 2020, significantly impacting the economy.",
"Litigation funding firms have assets under management that grew from $11.3 billion to $13.5 billion from 2020 to 2022."
],
"challenges": [
"The prevalence of frivolous lawsuits is burdening the justice system and judicial resources.",
"Litigation funding lacks proper regulation and transparency, creating risks related to ethical standards and the integrity of the judicial process."
],
"key_events": [],
"key_people": [
{
"name": "James Comer",
"role": "Chairman of the Committee",
"context": "Led the hearing, focusing on the need for oversight of litigation funding practices.",
"contact_info": {
"email": "",
"phone": "",
"twitter": "",
"website": "",
"facebook": "",
"linkedin": "",
"instagram": ""
}
},
{
"name": "Jamie Raskin",
"role": "Ranking Member",
"context": "Raised concerns about the ethics of the Supreme Court Justices and the influence of wealth in the judicial system.",
"contact_info": {
"email": "",
"phone": "",
"twitter": "",
"website": "",
"facebook": "",
"linkedin": "",
"instagram": ""
}
},
{
"name": "Maya Steinitz",
"role": "Professor of Law, Boston University School of Law",
"context": "Testified on the implications of litigation funding and the need for regulation.",
"contact_info": {
"email": "",
"phone": "",
"twitter": "",
"website": "",
"facebook": "",
"linkedin": "",
"instagram": ""

Secondly, we used Senzing to help perform Entity Resolution to help identify any extraction issues or mistakes in the underlying transcript. Senzing is an Entity Resolution platform that employs one of the best existing resolution processes for a lot of common people and organizations. Given many of the organizations and people mentioned in Congressional transcripts are famous and public, Senzing is a great way to resolve and add additional context to many of the figures mentioned.

What Senzing improved:

*1. People Matching*
• Found duplicates like Marcus Allen (FBI Whistleblower + Witness) and
merged them
• Now one profile per person with all their roles
• Improved accuracy of person tracking across hearings

*2. Location Cleanup*
• 17 hearing records → 1 verified location (Rayburn House Office Building)
• Standardized address format
• Reduced duplicate location entries

*3. Organization Links*
Connected people to organizations automatically:
• Tristan Leavitt → Empower Oversight
• Michael Horowitz → DOJ
• Marcus Allen → FBI

*4. Role Clarity*
Example: Jim Jordan
• Role: Chairman, House Judiciary Committee
• Committee: Weaponization Subcommittee

All roles consolidated in single profile

Thirdly, we map this JSON schema into the WhyHow schema, and then import all the information into the WhyHow.AI KG Studio.

Nodes:
1. Hearing Node
Structure: {
name: str, # "Budget Hearing - Department of Defense"
label: "Hearing",
properties: {
intel_id: str,
event_at: datetime,
source_updated_at: datetime,
source: str,
type: str,
meeting_status: str,
room: str,
building: str,
chamber: str,
congress: str
},
chunk_ids: List[str] # Links to hearing content chunks
}

2. Person Node
Structure: {
name: str, # "John Smith"
label: "Person",
properties: {
role: str # "Senator" | "Representative" | "Witness"
},
chunk_ids: List[str] # Links to person-related chunks
}

3. Topic Node
Structure: {
name: str, # "Healthcare Reform"
label: "Topic",
properties: {
sentiment: str # "positive" | "negative" | "neutral"
},
chunk_ids: List[str] # Links to topic discussion chunks
}

4. Organization Node
Structure: {
name: str, # "Department of Defense"
label: "Organization",
properties: {
role: str # "Government Agency" | "Private Sector"
},
chunk_ids: List[str] # Links to organization-related chunks
}

5. Committee Node
Structure: {
name: str, # "Armed Services Committee"
label: "Committee",
properties: {
system_code: str
},
chunk_ids: List[str] # Links to committee-related chunks
}

6. Document Node
Structure: {
name: str, # "FY2024 Budget Proposal"
label: "Document",
properties: {
type: str, # "Report" | "Testimony" | "Statement"
format: str, # "PDF" | "DOC"
url: str
},
chunk_ids: List[str] # Links to document content chunks
}

7. Legislation Node
Structure: {
name: str, # "H.R. 1234"
label: "Legislation",
properties: {
status: str # "Proposed" | "Passed" | "Failed"
},
chunk_ids: List[str] # Links to legislation-related chunks
}

Lastly, we then run a custom prompt that contextualizes the triples retrieved from the Knowledge Graph.

With this architecture in place, one really interesting thing is that we can now continue to add information about Committee Meeting Key Points, Summaries, and other additional details to the Knowledge Graph easily, since it is just a matter of adding additional chunks to the existing triples that exist. If a Committee Meeting has new Key Points, additional Chunks are added to the appropriate triple

This process took 20 dev hours, which can be broken down into the following:

  • 2 hours (10%) was spent looking and understanding the data (Exploratory Data Analysis)
  • 13 hours (65%) was spent iterating on the schema, and figuring out what nodes should be in the graph, what nodes should be connected to what, what chunks should exist, how it should connect to the various triples, testing the retrieved answers with a set of questions, and iterating accordingly.
  • 2 hours (10%) was spent writing the code to create the set of triples to be loaded in
  • 3 hours (15%) was spent writing validation checks and output checks to catch any errors

Question & Answer against Temporal Congressional Knowledge Graphs

Prompt used for answer construction after retrieving the relevant context from the Knowledge Graph

   context = f"""
You are an AI assistant specializing in U.S. Congressional hearings,
legislative matters, and government affairs. Use the following information
to answer the user's question. The information is derived from a knowledge
graph of recent congressional hearings and related documents.

Relevant Nodes (these represent entities in the knowledge graph):
{node_context}

Relevant Relationships (these are full triples representing connections
between entities in the knowledge graph):
{triple_context}

Relevant Document Chunks (these are excerpts from related documents):
{chunk_context}

Answer the question using the following steps:
1. Analyze the relevant information, focusing on the specific topic of the
question.
2. Identify key points related to the question, including relevant people,
events, legislation, or policies.
3. Formulate a clear and concise answer that directly addresses the
question.
4. If the question asks for a comparison, ensure you address all parties
or aspects mentioned.
5. If there are any uncertainties or gaps in the information, mention them
explicitly.
6. If you don't have enough information to answer the question confidently,
say so. Be strict about this.
7. Include references to the source documents in your answer using
[Source: ID] format.
8. If the information is from a future date, treat it as speculative or
planned rather than confirmed.
9. At the end of your answer, list all the sources you referenced under
a "Sources:" heading.
10. It's currently {datetime.datetime.now().strftime("%Y-%m-%d")}.
Consider this when interpreting dates in the information.
11. List the sources, with titles, in the order they were referenced.

Question: {question}

Available Sources:
{sources_list}

Analysis of Congressperson Historical Involvement

Question: Tell me everything congressman James Comer has been involved in in the past 2 months

Answer:

In the past two months, Congressman James Comer has been involved in the 
following activities:

1. **Hearing Participation**: James Comer participated in the congressional
hearing titled "Unsuitable Litigation: Oversight of Third-Party Litigation Funding,"
which took place on September 13, 2023. As the Chairman of the Committee on
Oversight and Accountability, he led the hearing, focusing on the need for
oversight of litigation funding practices. The hearing examined the growing
influence of third-party litigation funding in the legal system, discussing
both its potential benefits and risks [Source: 2070:"Unsuitable Litigation:
Oversight of Third-Party Litigation Funding"].

2. **Statements and Articles**: James Comer has been involved in authoring
several statements and articles related to litigation funding and its
implications:
- He provided a statement for the record for the American Property Casualty
Insurance Association (APCIA) [Source: 2070:UC - Comer - Statement for the
Record - American Property Casualty Insurance Association (APCIA)].
- He authored an article titled "A 'Boogeyman' National Security Threat in
Litigation Funding" published in Law 360, discussing national security
concerns related to litigation funding [Source: 2070:UC - Comer - Article - "
A 'Boogeyman' National Security Threat in Litigation Funding" - Law 360].
- He wrote an article for The Hill titled "Don't Fear Foreign Investment in
Lawsuits," addressing concerns about foreign investment in litigation
[Source: 2070:UC - Comer - Article - "Don't Fear Foreign Investment in
Lawsuits" - The Hill].
- He contributed to an article in US Law Week titled "Breaching a Litigation
Funding Agreement - the Sysco/Burford Story," which discusses issues related
to litigation funding agreements [Source: 2070:UC - Comer - Article -
"Breaching a Litigation Funding Agreement - the Sysco/Burford Story" - US Law
Week].
- He authored an article for Bloomberg Law titled "Litigation Finance Doesn't
Pose a Security Risk: Legal Insight," providing insights into the security
implications of litigation finance [Source: 2070:UC - Comer - Article -
"Litigation Finance Doesn't Pose a Security Risk: Legal Insight" - Bloomberg
Law].

3. **Letters and Communications**: James Comer has been involved in
correspondence related to oversight and accountability:
- He sent a letter to Grassley and Issa dated April 12, 2023, from the
State of Alabama Office of the Attorney General [Source: 2070:UC - Comer -
Letter to Grassley and Issa dated April 12, 2023 from State of AL OAG].
- He sent a letter to McCarthy and Issa dated March 1, 2023, from Underwood
Ranches, LP [Source: 2070:UC - Comer - Letter to McCarthy and Issa dated
March 1, 2023 from Underwood Ranches, LP].
These activities highlight Congressman Comer's focus on litigation funding
and its implications for national security, foreign investment, and legal
agreements.

**Sources:**
- 2070:"Unsuitable Litigation: Oversight of Third-Party Litigation Funding"
- 2070:UC - Comer - Statement for the Record - American Property Casualty
Insurance Association (APCIA)
- 2070:UC - Comer - Article - "A 'Boogeyman' National Security Threat in
Litigation Funding" - Law 360
- 2070:UC - Comer - Article - "Don't Fear Foreign Investment in Lawsuits" -
The Hill
- 2070:UC - Comer - Article - "Breaching a Litigation Funding Agreement -
the Sysco/Burford Story" - US Law Week
- 2070:UC - Comer - Article - "Litigation Finance Doesn't Pose a Security
Risk: Legal Insight" - Bloomberg Law
- 2070:UC - Comer - Letter to Grassley and Issa dated April 12, 2023 from
State of AL OAG
- 2070:UC - Comer - Letter to McCarthy and Issa dated March 1, 2023 from
Underwood Ranches, LP

Analysis of topics and government

Question: What are some upcoming changes to Medicare?

Answer:

Some upcoming changes to Medicare, as discussed in the context of the 
postponed congressional hearing titled "Meddling with Medicare: The
Biden-Harris Cover-Up," involve significant impacts stemming from the Inflation
Reduction Act (IRA). The IRA was intended to reduce healthcare costs but has
reportedly led to increased premiums for Medicare beneficiaries. Additionally,
there are significant changes affecting Medicare Part D, which are negatively
impacting seniors by altering premiums and plan availability [Source: Unknown
ID:Medicare Part D, Unknown ID:Inflation Reduction Act (IRA)].

These changes were to be addressed in the hearing, which was postponed, and
were expected to include testimonies from experts like Douglas Holtz-Eakin
from the American Action Forum and Joel White from the Council for Affordable
Health Coverage, who were to discuss the economic impacts and affordability
issues related to Medicare [Source: 2045:"Meddling with Medicare: The
Biden-Harris Cover-Up"].

Sources:
- Unknown ID:Medicare Part D
- Unknown ID:Inflation Reduction Act (IRA)
- 2045:"Meddling with Medicare: The Biden-Harris Cover-Up"

Comparative Analytics of different Congresspeople

Question: Compare the work of hearings that Jim Jordan and Dr. Carolyn Clancy have been involved in?

Answer:

Jim Jordan and Dr. Carolyn Clancy have been involved in different 
congressional hearings, each focusing on distinct issues.

**Jim Jordan**:
- Jim Jordan, as the Chairman of the House Judiciary Committee, participated
in the hearing titled "Hearing on the Weaponization of the Federal Government,"
which was scheduled for September 25, 2024. This hearing focused on the
alleged misuse of the FBI's security clearance process as a tool for
retaliation against whistleblowers. The hearing emphasized accountability for
the federal government and discussed legislative reforms for whistleblower
protection [Source: 2021].

**Dr. Carolyn Clancy**:
- Dr. Carolyn Clancy, serving as the Assistant Under Secretary for Health at
the VA, participated in the hearing titled "Emerging Therapies: Breakthroughs
in the Battle Against Suicide?" held on November 14, 2023. This hearing
focused on the potential of psychedelic-assisted therapies, such as MDMA and
psilocybin, in treating PTSD and reducing suicide rates among veterans.
The hearing highlighted the need for further research and training for
providers to facilitate broader access to these therapies [Source: 2061].

In summary, Jim Jordan's work in the hearings has been centered around
government accountability and whistleblower protection, while Dr. Carolyn
Clancy's involvement has been in exploring innovative therapies for mental
health issues among veterans. Each has contributed to discussions on
significant but different aspects of public policy and health.

**Sources:**
- 2021: Hearing on the Weaponization of the Federal Government
- 2061: Emerging Therapies: Breakthroughs in the Battle Against Suicide?

Why is this superior to Vector-Only RAG?

Superior analytics across multiple people and events

  • In the question about comparing Jim Jordan and Dr. Carolyn Clancy, we must be able to have an exhaustive list of the different work they have been involved with that can be compared against. Since Vector-Only Search is about identifying relevant chunks, it does not have a definitive list of relevant hearings that each has been involved with. This means that an intermediary aggregation of data is required that can definitively list the hearings that each person has been involved with.

An ability to perform and bring in multi-Document context:

  • In the question about Congressman James Comer’s involvement over the past 2 months, we can see that the involvements span multiple congressional hearings. This means that an intermediary aggregation of data is required that maps a Person to their involvement incurred over time across multiple hearings.

WhyHow.AI provides tools, services and processes for Structured Knowledge, Knowledge Graphs and more reliable Agentic RAG solutions. If you are interested in exploring any of our tools (KG Studio, Knowledge Table [Open Source]) and services, feel free to chat with us here.

If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG for accuracy, memory and determinism, follow our newsletter at WhyHow.AI or join our discussions about rules, determinism and knowledge graphs in RAG on our Discord.

--

--

WhyHow.AI
WhyHow.AI

Published in WhyHow.AI

WhyHow.AI’s platform helps devs and non-technical domain experts build Agentic and RAG-native knowledge graphs

Responses (3)