An SML implementation of the CEK machine
The CEK machine [1] is often used to understand and formalize the evaluation of pure λ calculus. A good introduction to the CEK machine can be found in the lecture notes on principles of programming by Dr. Hayo Thielecke. As part of my research Dr. Hayo also recommended me to learn a bit of Standard ML in order to get a feeling of functional programming; and I decided to translate the CEK machine into SML. The result can be found here – it was fun to both learn and program is SML specially with the knowledge on λ calculus I gathered in lectures. My main reference on SML was Dr. Robert Harper‘s tutorial; this tutorial is easy to follow if you have a good understanding of λ calculus. Now the code might look little less professional because I’m new to SML, but it sure works:
[asiri@localhost sml]$ mlton cek.sml [asiri@localhost sml]$ ./cek --- http://asiri.rathnayake.org --- Input expression: (lambda x.x)5 Result: 5 [asiri@localhost sml]$ ./cek --- http://asiri.rathnayake.org --- Input expression: (lambda x.x)y Error occured while evaluating lambda expression: [Could not locate environment mapping for variable [y]] [asiri@localhost sml]$ ./cek --- http://asiri.rathnayake.org --- Input expression: (lambda x.lambda y.x y)(lambda y.y)5 Result: 5 [asiri@localhost sml]$
[1] https://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR197
Concrete syntax vs Abstract syntax
I recently had a discussion with my supervisor Dr. Hayo Thielecke regarding above topic. As per our discussion, concrete syntax is what is defined by a grammar (BNF) and in turn used to build a parser for the language. If we take a derivation of a string according to the productions of a grammar, it is normally called a parse tree. However, note that even though we call it a “parse tree”, there is no actual tree like data structure involved. On the other hand an abstract syntax tree (AST) is an actual data structure which is built while an input string is being parsed; this tree structure resembles the parse tree up to some degree but it is more “abstract” in a sense because we try to capture the “essence” of the program while discarding all unnecessary details (comments, spaces, some parenthesis etc.) depending on the language in question. It is useful to think of AST as the interface (or bridge) between the compiler front-end (lexer & parser) and compiler back-end (semantics & code generation). Another fine point is that sometimes we define a secondary grammar in order to understand ASTs of a language – often called abstract syntax of a language. It is most likely that this secondary grammar is ambiguous because the focus is on the structure of the AST – which is missing all the boiler-plate nodes of the corresponding parse tree. Regardless of these fine points, we sometimes hear these “tree” terms used interchangeably
Installing broadcom wireless drivers on centos5
Installing broadcom wireless drivers on a linux system is pretty simple, just download the driver source from http://www.broadcom.com/support/802.11/linux_sta.php and follow the instructions on the README file. However, if you are running CentOS5 you might be faced with an error message during the compilation phase that reads something like below:
[asiri@asiri-laptop hybrid_wl]$ make KBUILD_NOPEDANTIC=1 make -C /lib/modules/`uname -r`/build M=`pwd` make[1]: Entering directory `/usr/src/kernels/2.6.18-194.3.1.el5-PAE-i686' LD /home/asiri/software/hybrid_wl/built-in.o CC [M] /home/asiri/software/hybrid_wl/src/shared/linux_osl.o In file included from /home/asiri/software/hybrid_wl/src/shared/linux_osl.c:17: /home/asiri/software/hybrid_wl/src/include/typedefs.h:80: error: conflicting types for ‘bool’ include/linux/types.h:36: error: previous declaration of ‘bool’ was here make[2]: *** [/home/asiri/software/hybrid_wl/src/shared/linux_osl.o] Error 1 make[1]: *** [_module_/home/asiri/software/hybrid_wl] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.18-194.3.1.el5-PAE-i686' make: *** [all] Error 2
Now I can’t say much about the reasons for this error but you can simply work around it by commenting out three lines from the source file src/include/typedefs.h (lines 79 – 81):
//ifndef TYPEDEF_BOOL //typedef unsigned char bool; //endif
I didn’t discover this solution myself; I found it on a forum couple of months back when I was setting up a different installation and thought of logging it here just in case I bump into it again
Java regular expressions – tricky quantifiers
Couple of days ago I was asked to implement temporary resource support for xwiki. It was a straight forward job; all I had to do was to parse a request URI, find a corresponding temporary file and serve it to the client – no big deal. The URI format was known which looked something like:
<prefix>/temp/<space>/<page>/<module>/<remainder>
Here <prefix> and <remainder> are paths themselves (with prefix being /xwiki/bin almost always). Now all that was needed is a mechanism to extract <space>, <page>, <module> and <remainder> segments; afterwards composing the absolute path to the temporary file was easy. It didn’t take me long to figure out that I needed a regular expression to do the dirty work – without much thinking I came up with the following code:
// ....
public class TempResourceAction
{
private static final Pattern URI_PATTERN =
Pattern.compile(".*/temp/([^/]*)/([^/]*)/([^/]*)/(.*)");
//....
private File getTemporaryFile(String URI)
{
Matcher matcher = pattern.matcher(URI);
if (matcher.find()) {
String space = matcher.group(1);
String page = matcher.group(2);
String module = matcher.group(3);
String path = matcher.group(4);
//...
}
}
}
Now, there is no doubt about the coolness of capturing groups; they provide a convenient method of extracting pieces of information from a formatted string. I was happy about the code and asked sergiu dumitriu to have a look at the code since this feature was bit too important. I was surprised and excited when he proved that my regular expression can actually be exploited to breach the security; a URI input of following sort will lead to chaos:
<prefix>/temp/space1/page1/module1/temp/space2/page2/module2/<remainder> Result: <space> - space2 (group 1) <page> - page2 (group 2) <module> - module2 (group 3) <path> - remainder (group 4)
The problem is that the quantifier * (kleene star) at the beginning of my regular expression acts in a greedy fashion; which means the regular expression evaluation engine will try to repeat the preceding pattern (in this case – “.”) as many times as possible – leading to the consumption of the entire string. Afterwards when the engine realizes that the whole input string has been consumed while the pattern string has not been consumed entirely (i.e. a failed match), it will start to backtrack one character at a time and try to match the remainder of the pattern string against the string formed by the newly discarded characters (i.e. remainder of the input string). As we can see, this is a very expensive process and in my case leads to a match that I did not expect. Quite frankly the solution to this problem is simple – we replace the greedy quantifier with a reluctant one:
Pattern pattern =
Pattern.compile(".*?/temp/([^/]*)/([^/]*)/([^/]*)/(.*)");
The reluctant quantifier *? forces the regular expression evaluation engine to act in an incremental fashion; first the expression .*? will be mapped to zero characters from the input string (note that a kleen star always implies zero or more characters), if that doesn’t work (i.e rest of the pattern string doesn’t match with the remainder of the input string) .*? will be mapped to one character, then to two characters and so and so on until the pattern string is entirely consumed (i.e. a match found). As we can see, with this modification my regular expression not only leads to correct results for all possible inputs but also operates faster because there is less backtracking involved.
Now to the really interesting part; sergiu suggested that I could improve the performance of my regular expression with the following modification:
Pattern pattern =
Pattern.compile(".*?/temp/([^/]*+)/([^/]*+)/([^/]*+)/(.*+)");
I could understand greedy and reluctant qualifiers without much effort, but to understand possessive quantifiers I had to do some reading. The section about quantifiers on java regular expressions tutorial gives a rough idea about greedy, reluctant and possessive quantifiers. However, for a more thorough treatment of possessive quantifiers I had to refer this article. Simply put, possessive quantifiers can be thought of as an extension to greedy quantifiers; they work in a greedy fashion except that they never fall back (backtrack) even if a match is not found. This behavior implies that possessive quantifiers can sometimes reject legitimate inputs; for an example if we consider the input string aaab and the pattern strings .*b and .*+b, the first pattern will result in a correct match while the second one will reject the input string – the reason is that the beginning expression .*+ of the second pattern consumes the entire input string and never backtracks to see if the rest of the pattern string b can be matched.
Now why would anyone want to use possessive quantifiers if they can sometimes reject valid inputs? Well it turns out that possessive quantifiers, if used wisely can make the regular expression evaluation to either match fast or fail fast. Possessive quantifiers avoid unnecessary attempts at matching an input string unlike reluctant or greedy quantifiers which force incremental matching or backtracking. This quality makes possessive quantifiers very attractive in terms of performance. However, one thing we need to make sure is that possessive quantifiers should be employed in a way that does not make the whole regular expression reject any valid inputs. In my regular expression if I used a possessive kleen star quantifier .*+ at the beginning of the expression it would make the entire regular expression useless. But on the other hand our use of possessive quantifiers for specifying path segments [^/]*+ and remainder segment .*+ makes complete sense because such usage simply makes the matching work fast (or fail fast on invalid inputs) in comparison to both reluctant and greedy approaches.
Tomcat on linux – ditching openjdk for sunjdk
One of the annoying problems I came across recently is that it’s really painful to get tomcat installed on Fedora (or CentOS) without being dependent on OpenJDK. I don’t like using OpenJDK because SunJDK is widely used and well tested (plus, I don’t see a distinct advantage of using OpenJDK). Also in my specific use case, XWiki has some problems when run on OpenJDK (Tomcat).
For me, the main advantage of using platform provided tomcat version (one that depends on OpenJDK) is that it is well integrated into the system (init scripts, updates etc.) but getting it to work with SunJDK is kind of hackish. There are several blog posts on the internet about how to configure the platform provided tomcat version to depend on SunJDK instead of OpenJDK (most solutions make use of the alternatives command) but none of them seems to be without side effects (and in my specific case, it just didn’t work – XWiki failed to start).
I finally made up my mind and decided to install tomcat and SunJDK from their official web sites. Right now I’m executing tomcat from my personal user account and I have set my JAVA_HOME and PATH variables correctly so that tomcat uses SunJDK rather than system provided OpenJDK. It might not be the perfect solution (administrative difficulties, security concerns etc.) but for my particular use case it was acceptable. However, I believe these disadvantages can be overcome with some effort (write an init script, create a separate user account for executing tomcat etc.) and that it’ll still be worthier than being dependent on OpenJDK.