Categories
Coding

Perl Regular Expressions

Why Perl is the ultimate regular expression powerhouse

I don’t get to use Perl very often these days, but when I do, it’s usually because I want to do some powerful data manipulation in a hurry. In the last couple of weeks, I’ve used it to extract and format a large amount of financial-related data for reporting purposes (it is the Practical Extraction and Reporting Language, after all), and currently I am using it to extract data from a series of text files containing technical trivia. The text files are in question-and-answer format, and the regex I have defined to extract the data and write to an XML-based format looks like this (snipped):

while ($trivia =~ m/^Q\s\$([A-Z0-9]{3})\)\s(.*?)(?=^A)/smg ) {
my $q = "<question number=\"$1\">$2</question>\n";
push (@questions, $q);
}

while ($trivia =~ m/^A\s\$([A-Z0-9]{3})\)\s(.*?)(?=^Q|\Z)/smg ) {
my $a = "<answer number=\"$1\">$2</answer>\n";
push (@answers, $a);
}

It’s a two-pass approach using two separate patterns. The two regex expressions look very similar, and indeed they are. The main difference is the lookahead portion, which is the (?=^Q|\Z) bit. This is slightly different for both expressions (it acts as an anchor that tells the parser when to stop).

None of this is very hard in Java, either, our patterns can be ported straight over:

Pattern patt = Pattern.compile(
"^Q\\s\\$([A-Z0-9]{3})\\)\\s(.*?)(?=^A|\\Z)",
Pattern.DOTALL | Pattern.MULTILINE);

However, what is really powerful about the Perl approach is the amazing power of its regex engine. For instance, it would be nice to be able to dynamically switch between the anchoring conditions, depending on what we had just picked up as the value of $1. So, for instance, if $1 was “Foo”, we could change the anchoring condition to be “Bar”, giving us a dynamic regex that automagically seems to understand the semantics of the data it is processing, as well as just the syntax.

In Perl, this is possible using the (??{}) and (?{}) operators. These allow you to execute some code in the body of the regular expression and use the output of that code as a dynamic pattern. You can even “feed” the code with backreferences from the current pattern.

To illustrate, check out the following examples in Perl.

First, our test string that we will search upon:

my $string = "Hello World";

Let’s also create a couple of lookup tables:

my %sym = (
'Hello' => 'World',
'World' => 'Hello'
);

my %sym2 = (
'Hello' => '\w{5}'
);

And now let’s use them.

my $var = "Hello";

First, a simple lookup using a variable:

if ($string =~ m/$var/) {
print "Matched variable\n";
}

Now, let’s insert some dynamic Perl code into the regex:

if ($string =~ m/$sym{'Hello'}/) {
print "Matched hash lookup\n";
}

Next, let’s pass a backreference into the dynamic code:

if ($string =~ m/(Hello) (??{$sym{$1}})/) {
print "Matched dynamic hash lookup using backreference\n";
}

And finally (the best bit) – using a backreference, the hash lookup resolves to a string of regex metacharacters (\w{5} in this case)! This is exactly what I need to make the lookahead dynamic.

if ($string =~ m/(Hello) (??{$sym2{$1}})/) {
print "Matched dynamic hash lookup, resolving to regex metacharacters, using backreference\n";
}

It’s things like this that make me reach for Perl time and time again .

Leave a Reply