Archive

Posts Tagged ‘Java’

Java regular expressions – tricky quantifiers

June 2nd, 2010 Asiri Rathnayake No comments

Couple of days ago I was asked to implement temporary resource support for xwiki. It was a straight forward job; all I had to do was to parse a request URI, find a corresponding temporary file and serve it to the client – no big deal. The URI format was known which looked something like:

<prefix>/temp/<space>/<page>/<module>/<remainder>

Here <prefix> and <remainder> are paths themselves (with prefix being /xwiki/bin almost always). Now all that was needed is a mechanism to extract <space>, <page>, <module> and <remainder> segments; afterwards composing the absolute path to the temporary file was easy. It didn’t take me long to figure out that I needed a regular expression to do the dirty work – without much thinking I came up with the following code:

// ....

public class TempResourceAction
{
  private static final Pattern URI_PATTERN =
    Pattern.compile(".*/temp/([^/]*)/([^/]*)/([^/]*)/(.*)");

  //....

  private File getTemporaryFile(String URI)
  {
    Matcher matcher = pattern.matcher(URI);
    if (matcher.find()) {
      String space = matcher.group(1);
      String page = matcher.group(2);
      String module = matcher.group(3);
      String path = matcher.group(4);

      //...
    }
  }
}

Now, there is no doubt about the coolness of capturing groups; they provide a convenient method of extracting pieces of information from a formatted string. I was happy about the code and asked sergiu dumitriu to have a look at the code since this feature was bit too important. I was surprised and excited when he proved that my regular expression can actually be exploited to breach the security; a URI input of following sort will lead to chaos:

<prefix>/temp/space1/page1/module1/temp/space2/page2/module2/<remainder>

Result:

<space> - space2 (group 1)
<page> - page2 (group 2)
<module> - module2 (group 3)
<path> - remainder (group 4)

The problem is that the quantifier * (kleene star) at the beginning of my regular expression acts in a greedy fashion; which means the regular expression evaluation engine will try to repeat the preceding pattern (in this case – “.”) as many times as possible – leading to the consumption of the entire string. Afterwards when the engine realizes that the whole input string has been consumed while the pattern string has not been consumed entirely (i.e. a failed match), it will start to backtrack one character at a time and try to match the remainder of the pattern string against the string formed by the newly discarded characters (i.e. remainder of the input string). As we can see, this is a very expensive process and in my case leads to a match that I did not expect. Quite frankly the solution to this problem is simple – we replace the greedy quantifier with a reluctant one:

Pattern pattern =
  Pattern.compile(".*?/temp/([^/]*)/([^/]*)/([^/]*)/(.*)");

The reluctant quantifier *? forces the regular expression evaluation engine to act in an incremental fashion; first the expression .*? will be mapped to zero characters from the input string (note that a kleen star always implies zero or more characters), if that doesn’t work (i.e rest of the pattern string doesn’t match with the remainder of the input string) .*? will be mapped to one character, then to two characters and so and so on until the pattern string is entirely consumed (i.e. a match found). As we can see, with this modification my regular expression not only leads to correct results for all possible inputs but also operates faster because there is less backtracking involved.

Now to the really interesting part; sergiu suggested that I could improve the performance of my regular expression with the following modification:

Pattern pattern =
 Pattern.compile(".*?/temp/([^/]*+)/([^/]*+)/([^/]*+)/(.*+)");

I could understand greedy and reluctant qualifiers without much effort, but to understand possessive quantifiers I had to do some reading. The section about quantifiers on java regular expressions tutorial gives a rough idea about greedy, reluctant and possessive quantifiers. However, for a more thorough treatment of possessive quantifiers I had to refer this article. Simply put, possessive quantifiers can be thought of as an extension to greedy quantifiers; they work in a greedy fashion except that they never fall back (backtrack) even if a match is not found. This behavior implies that possessive quantifiers can sometimes reject legitimate inputs; for an example if we consider the input string aaab and the pattern strings .*b and .*+b,  the first pattern will result in a correct match while the second one will reject the input string – the reason is that the beginning expression .*+ of the second pattern consumes the entire input string and never backtracks to see if the rest of the pattern string b can be matched.

Now why would anyone want to use possessive quantifiers if they can sometimes reject valid inputs? Well it turns out that possessive quantifiers, if used wisely can make the regular expression evaluation to either match fast or fail fast. Possessive quantifiers avoid unnecessary attempts at matching an input string unlike reluctant or greedy quantifiers which force incremental matching or backtracking. This quality makes possessive quantifiers very attractive in terms of performance. However, one thing we need to make sure is that possessive quantifiers should be employed in a way that does not make the whole regular expression reject any valid inputs. In my regular expression if I used a possessive kleen star quantifier .*+ at the beginning of the expression it would make the entire regular expression useless. But on the other hand our use of possessive quantifiers for specifying path segments [^/]*+ and remainder segment .*+ makes complete sense because such usage simply makes the matching work fast (or fail fast on invalid inputs) in comparison to both reluctant and greedy approaches.

Categories: Java, Technology Tags: ,