Fast Earley Parsing
This work was presented at the 2023 Mid-Atlantic Student Colloquium on Speech, Language and Learning. Poster linked here. Repository linked here.
Earley Parsing is a parsing algorithm named for its inventor, Jay Earley, who first introduced it in 1968. It uses dynamic programming to parse strings according to a given context-free language. One reason for its notability is the relatively loose requirements it places on the context-free language it parses with, as compared to its more common peers such as the CKY algorithm, which requires the grammar be in Chomsky normal form. However, this lack of restriction on Earley parsing results in a longer runtime, scaling with not only the length of the input but also the size of the grammar and length of the grammar rules.
In this blog post, we explore one way to parallelize Earley parsing.
Before we get started, it will be helpful to familiarize yourself with grammars and with Earley parsing. As a quick recap and to establish shared vocabulary, I will give a brief summary of the standard serial Earley parsing steps. If you don’t need this review, feel free to skip straight to the ‘Fast Earley Parsing’ section.
Recall that Earley is a left-to-right dynamic programming parsing algorithm that maintains parsing states and corresponding elements for some input string w.
An Earley parsing state keeps track of the state of the parse. It describes a potentially partially completed grammar rule currently under consideration, as well as where in the input string (w) we are considering the grammar rule to apply to. As in other dynamic programming algorithms, a chain of Earley parsing states describes a path to the final parsing return value. I adopt notation from Joshua Goodman’s 1998 “Semiring Parsing” paper to describe these items:
[i, A → ⍺⋅β, j]
- i: start index in input string
- j: current end position in input string
- A → ⍺β: rule under consideration; the dot ⋅ indicates up to where in the rule we have determined can apply to the substring up until the current end position
Summarized, A → ⍺β⇒* wᵢwᵢ₊₁…wⱼ_₁β.
Given an input string w and grammar G, all possible Earley items are: all combinations of start indices, end indices, grammar rules and dot indices. This is O(N²*|G|*L) where N is the length of the input string, |G| is the number of rules in the grammar, and L is the length of a grammar rule.
The initial Earley item is [0, ROOT→ ⋅S , 0], the start of the root rule. The goal Earley item is [0, ROOT→ S⋅ , 1], the completion of the root rule.
Earley parsing elements are values paired with Earley items. Based on the semiring of interest, elements have different types. For example, the standard recognition parsing Earley element is a boolean, the inside parsing element is a float value, and the Viterbi-derivation parsing element is a tuple of <the highest-probability parse tree, probability of that parse tree>. In this blog post, we will focus on the inside semiring, so our elements are floats.
The initial Earley item is always paired with the semiring unity value element (1): ( [0, ROOT→ ⋅S , 0] , 1). Every other Earley item is initialized to the semiring additive identity (null-value) element (0): ( [i, R, j] , 0). The element paired with the goal Earley item at the completion of Earley parsing is the returned value of the parse: ( [0, ROOT→ S⋅ , 1] , return).
We will use a chart to associate items with and update their paired element values.
We read the input string from left-to-right in Earley parsing. At every input token index j, we process Earley items with end index j :
[k, A → ⍺⋅xβ, j] paired with element e₁.
For each item, we perform one of three possible operations:
- Predict — if the next token after the dot is a nonterminal, initialize the value of all new parsing states that expand the nonterminal: “predict” that another parsing item will be part of the path to the final returned value. This new parsing item will be initialized with the unity element value.
- Scan — if the next token after the dot is a terminal matching the next terminal in the input string, progress ⋅ one token further in the rule. Add the resulting element value to the element currently associated with the new Earley item, which is one ⋅ further in the rule and one end index further in the rule.
- Complete — if the item is complete, update the elements for all possible items that could have predicted this item.
In pseudocode, this looks like the following:
At the end, we return the element in the chart associated with the goal item.
Observe the times and plus symbols — these are addition and multiplication operators particular to the type of parsing at hand. Goodman writes in depth about this concept of semiring parsing. For now, simply note that for recognition parsing, which uses the boolean semiring, times is AND and plus is OR. When parsing to get the inside value, times is multiply and plus is addition. Likewise, when parsing in logarithmic form for the logarithm of the inside value, times is addition and plus is logsumexp.
Fast Earley Parsing
Now that we are on the same page about Earley parsing, let’s get to the meat of this blog post: how to parallelize Earley parsing.
Ordering Constraints
Note that there is one simple but strict ordering constraint to make Earley parsing run correctly for all parsing objectives: all of an item’s children must be processed before it itself is processed.
Consider the Completion step. The element associated with the parent item [i, B→ γ A ⋅η, j], e₃, is updated based on the element of child [k, A→ α⋅ j]: e₁. However, both the parent item and the child item have the same end index j. If we process the parent item before we process the child item, the parent element that we process with will be the value prior to being updated by the child element. This is a problem. But by enforcing that all children are processed before the item itself, we can avoid this issue.
Section 5 of Goodman’s paper, Semiring Parser Execution, suggests one approach to do so: compute a dependency graph between all Earley items for the input string, topologically sort this dependency graph, then parse each item according to Earley semantics in order of the topological sort. The resulting element in the chart associated with the goal item is the true final parse.
Parallelize Earley
As an alternative approach, we can observe that some Earley items have no dependencies between them, and can therefore be processed in parallel. We just need to maintain the ordering constraint described above. Assuming that every rule in the grammar has either at least one terminal symbol or two nonterminal symbols in the expansion (i.e. it is not empty or self-looping), this ordering constraint can be captured in the following partial order relation ≤ between items A=[i_A,A→ α⋅β, j_A] and B=[i_B,B→γ ⋅𝛿, j_B]:
All items with the same partial order value (i_A=i_B and j_A=j_B) do not have any dependencies between them, so they can be processed in any order. To be clear, if A≤B, B must be processed after A is processed.
To perform parallelization, we perform all computations of Earley parsing as a series of matrix multiplications. In order to do this, we first redefine the Earley items as integer identifiers on a matrix. Instead of a rule with dot notation, we will encode each of these “dotstates” into an ID in our “dotstate” vocabulary: with |G| grammar rules and O(L) maximum length of a grammar rule, this results in O(|G|*L) items in our dotstate vocabulary V.
With this new dotstate vocabulary, we now create special support matrices that encode properties of the grammar relevant to Earley parsing, such as valid scans and completions.
Initialization matrix (I∈ Rⱽ): We will have one vector of size |V| to denote the prediction value for every rule. This is a sparse vector: only the dotstates in which the dot is in the beginning of the rule have a non-null semiring value. In the recognition parse semiring, this prediction value is TRUE. In the inside parse semiring, this prediction value is the probability of the rule in the defined probabilistic grammar.
Transition matrix (T∈ {0,1}⁽ᴺ⁺ᵀ⁾ˣⱽˣⱽ): This sparse multi-hot vector tells us which symbol advances which dotstates to which other dotstate. This can be viewed as a representation of both advancing steps: scan and the latter half of completion. Concretely, T[s, d_a, d_b] = 1 if and only if d_a=A→ α⋅ sβ and d_b=A→α s⋅β. T.sum(0) gives us all possible dotstate transitions, for all possible symbols.
Completion matrix (C∈ {0,1}ⱽˣⱽ}): This sparse multi-hot vector tells us which dotstates can be advanced by the completion of which other dotstates. This can be viewed as a representation of the first half of completion. Concretely, C[d_a, d_b] = 1 if and only if d_a=A→ α⋅Bβ and d_b=B→γ⋅; that is, if the next symbol in the first-index dotstate is the expanding nonterminal in second-index dotstate and the second-index dotstate is complete (dot is at the end).
Our Earley chart (Chart∈ Rᶻ ˣ ᶻ ˣⱽ) keeps track of Earley item element values, indexing by end index, start index, and dotstate. Z is the length of the input.
With this matrix-based infrastructure in place, we parallelize Earley as follows:
Inititialize. Initialize all Earley chart values to the semiring additive identity:
Chart[all] = 0
Predict. Observe that all predictions can be conducted independently. In other words, every prediction step has no dependencies on anything else. Therefore, we can perform all predictions during initialization:
Chart[diagonal, :] = I
Now, we scan and complete according to the partial order. In increasing order of end index j:
Complete. Consider items in reverse-order of start index k.
these_scores = Chart[j, k] # |V|
Identify their “complement” items with end index k. An item is complement to another if it is complete and its expansion nonterminal is the next symbol after the dot in the other.
complement_scores = Chart[k] # len(input) x |V|
Combine current scores with complement scores.
scores_by_comp_index = C @ these_scores # V
scores_by_comp_index = scores_by_comp_index[None, :] # 1 x V
combined_scores = complement_scores * scores_by_comp_index # Z x V
Add the combined scores to the scores of their parent items:
combined_scores_by_parent_index = combined_scores @ T.sum(0) # Z x V
Chart[j] += combined_scores_by_parent_index
Scan. For items whose next symbol is the current symbol input[j], copy scores over:
new_scan_scores = Chart[j] @ T[input[j]] # Z x V
Chart[j+1] += new_scan_scores
Return the score associated with the goal item:
Chart[Z-1, 0, ROOT->S⋅]
Pseudocode for parallelized Earley parsing is as follows:
SUPPORT(Prob_Grammar):
G = Prog_Grammar.Grammar
dotstate_vocabulary: Dict[Tuple[Rule, DotPosition], DotStateId] = {}
for rule in G.rules:
for dot_idx in range(len(rule) + 1):
dotstate_vocabulary[(rule, dot_idx)] = size(dotstate_vocabulary)
num_dotstate_pairs = size(dotstate_vocabulary)
P = [semiring_zero] of size (num_dotstate_pairs)
completion_matrix = [0] of size (num_dotstate_pairs, num_dotstate_pairs)
transition_matrix = [0] of size (size(G.terminals) + size(G.nonterminals)),
num_dotstate_pairs,
num_dotstate_pairs)
for rule in G.rules:
for dot_idx in range(len(rule) + 1):
dotstate_id = dotstate_vocabulary[(rule, dot_idx)]
if dot_idx == 0:
P[dotstate_id] = Prob_Grammar.parameter[rule]
if dot_idx < len(rule):
next_symbol = rule[dot_idx]
next_dotstate_id = dotstate_vocabulary[(rule, dot_idx + 1)]
transition_matrix[next_symbol][dotstate_id][next_dotstate_id] = 1
if next_symbol in G.nonterminals:
for child_rule in G.rules with nonterminal next_symbol:
child_dotstate_id = dotstate_vocabulary[(child_rule, len(child_rule))]
completion_matrix[dotstate_id, child_dotstate_id] = 1
EARLEY_PARALLEL(Input):
# All the possible positions before and after tokens in the input string
N = len(Input) + 1
# R^NxNx|V|; entry [end_pos, origin, dotstate] is the score of Earley item [origin, dotstate, end_pos]
Chart = [semiring_zero] of size ((N, N, num_dotstate_pairs))
PREDICT. Since all items with the same origin and end_pos are generated from prediction steps, do all predictions in advance.
Chart[range(N), range(N)] = P
# Process items according to the partial order. An item is processed either by being COMPLETEd or SCANned.
for j in range(N):
for k in reversed(range(j)):
COMPLETE. For any complete item over range `[split, end_pos]`, `mul` its score with any complement item with end index `split` and `add` result to parent item.
# R^|V| scores of items currently being processed.
these_scores = Chart[j, k]
# R^|V| scores of items currently being processed, re-indexed to position of complement items
scores_by_comp_index = C @ these_scores
# R^Nx|V| score of every complement item [origin (free), dotstate, k]
complement_scores = Chart[k]
# R^Nx|V| combined scores of current items and complement items
combined_scores = scores_by_comp_index[None, :] * complement_scores
# R^|V| combined scores, re-indexed to parent item positions
combined_scores_by_parent_index = combined_scores @ T.sum(0)
Chart[j] += combined_scores_by_parent_index
SCAN. For items whose next symbol is the current symbol, copy scores into next time step with dot advanced (scanned).
if j < N - 1:
# R^Nx|V| current item scores, re-indexed by advanced item
new_scan_scores = Chart[j] @ T[Input[j]]
Chart[j + 1] += new_scan_scores
# Return the score of the completed root rule.
return Chart[len(Input), 0, dotstate_vocabulary[(G.root_rule, 1)]]
Why do we care?
Why parallelize? Parallelization lends itself to efficient execution on GPUs. One fundamental reason behind the proliferation of the Transformers architecture for language models, despite its quadratic time complexity, is that it is highly parallelizable.
What can we do with a faster Earley parser? We started working on this because we wanted a way to induce a probabilistic grammar over a corpus of strings. Using a faster Earley parser, we could potentially compute and minimize the expected inside values over the corpus of strings. We didn’t finish this work, so if that sounds interesting to you, reach out to me.