World's most popular travel blog for travel bloggers.

[Solved]: Can an Earley Parser be made into a fuzzy parser similar to the Levenshtein Automata Algo for DFA?

, , No Comments
Problem Detail: 

There's a way to perform fuzzy parsing (accepts strings even with typos to a certain edit distance), with a DFA and a run-time constructed Levenshtein Automata of the input word. Can something similar be done with an Earley parser? I'm finding it hard to understand the algorithm, let alone answer this question.

Asked By : Fruitful Approach

Answered By : babou

The answer is yes. However I would not do that with an Earley parser because there are simpler ones with the same capabilities.

Basically, Earley parser belongs to a family of general context-free parsers, that produces all possible parses for a given string, when the grammar is ambiguous.

There are two ways (at least) of understanding these parsers:

  • as dynamic programming interpretation of a pushdown automaton corresponding to the grammar on the input string;

  • as the the construction of the intersection of the grammar with a finite state automaton.

When parsing a single string, the finite state automaton to be considered is a linear automaton that recognizes only the string $w$ to be parsed, one symbol at a time (number of state is $|w|+1$). If you apply the cross-product construction of a FA $A$ and a CF garmmar $G$ (Bar Hillel, Perlis, Shamir 1961), you get a new CF grammar that is a new grammar $F$ which generates $\mathcal L(A)\cap\mathcal L(G)$. The interesting point usually overlooked is that $F$ preserves the parse-trees used by $G$, up to non-terminals renaming (due to cross-product).

Thus if the FA $A$ generates only your input string, the grammar $F$ will generate only that string (if it is in $\mathcal L(G)$, otherwise it generates the empty language $\emptyset$). Furthermore, it generates it with all the parse-trees that $G$ could use to generate it.

This grammar $F$ is what is usually called a shared parse forest, and all general CF parsing algorithms are a more or less optimized version of the cross-product construction, whether CYK, Earley, generalized LR or LL, or others. So all I am saying applies to them too.

But, as you see, this generalizes to parsing a whole regular set, if anyone is interested in doing that.

That is precisely your question. You have a string $w$. You want to parse it up to some variations defined by a finite state transducer, which in your case is a transducer that produces all strings within some given Levenshtein edit distance of $w$ (but the origin of the transducer is immaterial). The set of those strings is a regular set that can be defined by a FA, with weighted transition that can compute the edit distance of each string.

If you do the cross-product with your grammar $G$, you get a shared parse forest grammar $F$ that generates all the strings in the intersection. Furthermore, you get the weights on some of the rules so that you can compute the edit distance of each of the accepted strings.

If desirable, this can be used to keep only the strings with minimal distance.

However, this can be improved a bit since composition with finite state machines is associative.

If you always use the same, finite state transducer, as is the case in your question, the right way to go about is to compose the Grammar $G$ and the transducer (here the Levenshtein automaton), independently of the input string. This gives you a weighted grammar that you can use to parse the input string $w$. The problem is that parsing with the brutal intersection construction with give you strings at any Levenshtein distance, i.e. $\Sigma^*$.

It would be easy to prune that construction to get the same result as before, but the best way is a more controlled intersection construction, such as the dynamic programming organization used by most parsers in the literature, including Earley's, and use it to avoid generating useless rule by computing distances and aborting any computational path when it exceeds the desired threshold. Dynamic programming can also be used to compute directly the parse-forest (or parse-tree) for the string that have the shortest distance to the input.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/44889

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback