By Raphael Proust - 2011-08-12
Str module has to go!
Str provides regular expressions in a non-reentrant, non-functional fashion.
While the OCaml distribution provides it in
otherlibs, it is installed by
default and so widely used, and implemented under the hood via a C library.
Regular expressions are used in several places in MirageOS, mainly for small
operations (splitting, getting an offset, etc.), and so having a portable
fallback written in pure OCaml would be very useful.
There are several possible ways to replace the
Str module, each with its own
set of perks and drawbacks:
fourth backend is to be included. Moreover each regexp library uses a slightly
different convention for regexps (e.g. see the
magic option in
vim) which means that a lot of translation code might be needed.
String.index and the likes).
This solution is portable and potentially efficient. However, the potential
efficiency comes from a low-level way of doing things.
Str and needed an estimation of performance cost in order to
assess the practicality of the solution.
There is a purely OCaml regexp library readily available, called
developed by Claude Marché from the LRI laboratory. You can find the
documentation and the source on the associated
webpage. After getting rid of mutexes
(which, in MirageOS, are of no use, because of the
concurrency), we benchmarked it against
Str. We also included the popular
Pcre (Perl Compatible Regular Expression) library that is widely used.
The benchmark (available on github) is really simple and measures three different factors:
MirageOS uses regexp in a specific pattern: a limited number of regexp constructions with a potentially high number of invocation (e.g. HTTP header parsing). The size of the strings on which regexps are used may vary. Because of this pattern, our benchmark does not take regexp construction overhead into account.
Here are the execution times of approximately 35000 string matching operations on strings of 20 to 60 bytes long.
Quite surprisingly for the string matching operation, the C based
is less efficient than the pure OCaml
Pcre results were even worse
Regexp library is lightweight, and so far faster than its C based
counterparts. One of the features
Regexp lacks is "group capture": the ability
to refer to blocks of a previously matched string. In
Pcre it is possible to
explicitly and selectively turn group capturing off via special syntax,
instead of the regular parentheses.
Str does not offer this, and thus
imposes the runtime cost of capture even when not necessary. In other words, the
slowdown/group capturing "is not a feature, it's a bug!"
With the introduction of
Regexp into the tree, the libraries available to MirageOS
applications are now
Str-free and safer to use across multiple backends. The main
drawback is a slight increase in verbosity of some parts of the code.
Benchmarking the substitution operation is also necessary to assess the
performance gain/loss (which we will do shortly).
In addition to cosmetic and speed considerations, it is important to consider the portability increase: MirageOS's standard library is Node.js compatible, a feature we will explore shortly!