World's most popular travel blog for travel bloggers.

[Solved]: How to find specificity of a regex match?

, , No Comments
Problem Detail: 

I'm thinking about a routing system. Imagine I have the two following regexes

  • pathpart1/pathpart2 => specific match that routes to controller1
  • .* => catch-all that routes to controller2

And I let them match on a URL, e.g. 'pathpart1/pathpart2'.

They both match, but I would want to give prevalence to the most specific regex, i.e. the regex where the cardinality of all possible matches of that regex is the lowest.

Is there a good way to calculate that the first regex has a low cardinality on its match-set (so I want to to go with that match) and the second has a very high cardinality on its match-set (i.e. it is completely not specific, so a match is basically a catch all last resort)...?

I do not know upfront which routes are registered with the router, so I can't loop over them in order of cardinality by hand (i.e. low cardinality first, and the catch-all last if all others don't match).

Asked By : Willem Mulder

Answered By : Raphael

There are two angles I see to define "specifity" in this context.

  1. Specifity of a regular expression $r$ that matches input $w$ is the (inverse of the) number of words of length $|w|$ that also match $r$.
  2. Specifity of a regular expression $r$ that matches input $w$ is (the ratio of) the (maximum) number of symbols of $w$ that are matched by non-wildcard symbols in $r$ (and $|w|$).

The first option follows your idea: choose among all matches the regular expressions that matches the fewest others. Since there are in general infinitely many matching strings, we have to restrict ourselves to a certain length, or a finite set of lengths.

The second option follows another approach. If the input string is well described by a given expression, long parts of it should be (more or less) explicitly described in the regular expression. For example, $(aa|ab)+.^*bba$ is a more specific match for $aaababaababababba$ than $.^*$.

Option 1

In this case, we have to count all words in $\mathcal{L}(r)$ that have length $n$. Luckily, this is easy once we have an unambiguous, regular grammar¹ $G$ for $\mathcal{L}(r)$. See here how this can be done.

Option 2

Let's assume we have an unambiguous regular expression (for instance by specifying lazy wildcards). In that case, we can translate the regular expression into a DFA. Run the word through it and count how many non-wildcard transitions are taken.

If the expression is ambiguous, we can do the same with the corresponding NFA if we explore all possible paths.


  1. Convert $r$ to an $NFA$, determinise, convert to right-regular grammar. All conversions are standard and algorithmically feasible (if not necessarily efficient).
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/10786

0 comments:

Post a Comment

Let us know your responses and feedback