Complex-Pattern-System Extraction for Transparent Linguistic Intelligence

Published in

Cognitive Computing and Linguistic Intelligence

4 min readMay 26, 2024

Extracting entire systems of complex patterns is a novel approach to comprehending the intricate frameworks that define sophisticated structures.

The process involves using algorithms with multi-step look-ahead capabilities to extend initial set of patterns by identifying additional prominent patterns.

1. Introduction

Extracting entire systems of complex patterns is a novel approach to comprehending the intricate frameworks that define sophisticated structures. By leveraging algorithms with multi-step look-ahead capabilities, we can extend an initial set of patterns by identifying additional prominent patterns. This method, analogous to a top-down parser applying rewriting rules, allows us to explore multiple potential pathways and combinations, increasing the likelihood of uncovering the most relevant and accurate patterns within complex systems.

2. Definitions

SEQUENCE: An ordered set of Nodes in a Directed Graph, connected at each step by a Directed Link.
DOMINANT: The last element of a SEQUENCE.
GOALS of a SEQUENCE: All Nodes directly linked from its DOMINANT.

Rules

WHEN an existing SEQUENCE (licensed by a PATTERN) is found in the Current Graph,
THEN a new SEQUENCE is added, licensed by PATTERN_NAME corresponding to the found PATTERN, adding an alternative PATH from the same Starting Node to each GOAL of the matched SEQUENCE.

3. Process Overview

Step 1: Parse the rewriting rules given in source form.

Step 2: Parse the TOKENIZED input_text string and turn it into a linear SEQUENCE of Nodes in a Directed Graph. In the simplest case, each character is a Token. Before being processed by the PatternMatchingAgent, each Node in the Graph represents one character Token in the input, including the added <start> and <end> Tokens. This is the initial state of the Current Graph.

Example: For the string “abc,” the initial Graph is:

<start> => “a” => “b” => “c” => <end>

Step 3: Iterate through the following process many times:

Sample a Node from the Current Graph and a Rule from Rewriting Rules.
Attempt to apply the sampled Rule at the sampled Node in the Current Graph.
Whenever a PATTERN is found as a path connecting some Node x to another Node y, an additional path is added connecting Node x to Node y as suggested by PATTERN_NAME (replacement string specified in the rule).

Step 4: Introduce new hypothetical rules based on sequences of tokens encountered frequently:

Monitor sequences of tokens that occur disproportionately often in the Current Graph.
Formulate new hypothetical rules from these frequent sequences, hypothesizing potential new patterns.
Validate these new hypothetical rules by attempting to apply them in similar computational contexts.

Step 5: Bootstrap the extension of the Grammar from a very large text:

Continuously parse and analyze large corpora of text to discover recurring sequences and patterns.
Dynamically update the set of rewriting rules with new patterns identified through the analysis of extensive text data.
Employ machine learning techniques to refine and validate the hypothesized rules, ensuring that only effective and relevant rules are incorporated into the Grammar.

Step 6: Return the resulting state of the Current Graph with all the added alternative paths and the newly introduced hypothetical rules.

4. Example of Plain Parsing

Starting with the text: “abbcddef” and three Rules:

a -> b
bc -> dd
d -> ef

Initial State of the Current Graph:

<start> => “a” => “b” => “b” => “c” => “d” => “d” => “e” => “f” => <end>

Explanation:

We start with the initial state of the graph representing the tokenized input text. We iteratively apply the given rules to find matching patterns within the graph.

When a pattern is found, new paths and nodes are added to the graph based on the rule.

This process continues, dynamically expanding the graph and incorporating new paths as rules are applied. Throughout this process, we encounter and count novel patterns, identifying the more commonly occurring ones.

5. Concept of Systems of Complex Patterns

The newly formulated and validated sets of hypothetical rewriting rules that work together to implement segments of grammar are referred to as “systems of complex patterns.” These systems represent comprehensive and interconnected rulesets that can effectively parse and generate text based on observed linguistic structures. Explaining this idea involves conveying how these systems are developed, validated, and applied to enhance the understanding and processing of language.

6. Applications and Future Prospects

The structured information from pattern-processed graphs may help Intelligent Linguistic Agents handle complex linguistic constructs with greater precision, identifying and utilizing intricate patterns that would otherwise be missed.

Another significant application of this method is training entire large language models (LLMs) on pattern-processed graphs instead of raw text. This approach allows LLMs to understand complex language structures more deeply. It is possible that LLMs trained on graphs constructed by complex-pattern-system extraction can achieve a more nuanced and structured comprehension of the text.

Conclusion

By adding systems of complex patterns to the process, the new approach can continuously improve and refine linguistic grammars. This enables a deeper and more structured understanding of complex language structures.